Steve Nasypany [email protected] PowerVM Performance Updates HMC v8 Performance Capacity Monitor Dynamic Platform Optimizer PowerVP 1.1.2 VIOS Performance Advisor 2.2.3 © 2014 IBM Corporation First, a HIPER APAR… AIX.
Download ReportTranscript Steve Nasypany [email protected] PowerVM Performance Updates HMC v8 Performance Capacity Monitor Dynamic Platform Optimizer PowerVP 1.1.2 VIOS Performance Advisor 2.2.3 © 2014 IBM Corporation First, a HIPER APAR… AIX.
Steve Nasypany [email protected] PowerVM Performance Updates HMC v8 Performance Capacity Monitor Dynamic Platform Optimizer PowerVP 1.1.2 VIOS Performance Advisor 2.2.3 © 2014 IBM Corporation First, a HIPER APAR… AIX 6.1 TL9 SP1 XMGC XMGC IV53582 AIX 7.1 TL3 SP1 XMGC XMGC IV53587 XMGC NOT TRAVERSING ALL KERNEL HEAPS Systems running 6100-09 Technology Level with bos.mp64 below the 6.1.9.2 level Systems running 7100-03 Technology Level with bos.mp64 below the 7.1.3.2 level PROBLEM DESCRIPTION: xmalloc garbage collector is not traversing all kernel heaps, causing pinned and virtual memory growth. This can lead to low memory or low paging space issues, resulting in performance degradations and, in some cases, a system hang or crash You can’t diagnose this with vmstat or svmon easily. Systems just run out of memory pinned or computational memory keeps climbing, and cannot be accounted to a process 2 © 2014 IBM Corporation Optimization Redbook Draft available now! POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2… Compilers & optimization Performance tools & tuning http://www.redbooks.ibm.com/redpieces/abstracts/sg248171.html 3 © 2014 IBM Corporation HMC Version 8 Performance Capacity Monitor 4 © 2014 IBM Corporation Power Systems Performance Monitoring HMC 780 or earlier HMC 810 Evolution from disjoint set of OS tools to integrated monitoring solution System resource monitoring via a single touch-point (HMC) Data collection and aggregation of performance metrics via Hypervisor REST API (WEB APIs) for integration with IBM and third-party products Trending of the utilization data Assists in first level of performance analysis & capacity planning 5 © 2014 IBM Corporation Performance Metrics (complete set, firmware dependent) Physical System Level Processor & Memory Resource Usage Statistics – System Processor Usage Statistics (w/ LPAR, VIOS & Power Hypervisor usage breakdown) – System Dedicated Memory Allocation and Shared Memory Usage Statistics (w/ LPAR, VIOS & Power Hypervisor usage breakdown) Advanced Virtualization Statistics – Per LPAR Dispatch Wait Time Statistics – Per LPAR Placement Indicator (for understanding whether the LPAR placement is good / bad based on score) Virtual IO Statistics – Virtual IO Server’s CPU / Memory Usage (Aggregated, Breakdown) – SEA Traffic & Bandwidth usage Statistics (Aggregated & Per Client, Intra/Inter LPAR breakdown) – NPIV Traffic & Bandwidth usage Statistics (HBA & Per Client breakdown) – vSCSI Statistics (Aggregated & Per Client Usage) – VLAN Traffic & Bandwidth usage Statistics (Adapter & LPAR breakdown) SRIOV Traffic & Bandwidth usage Statistics (Physical & Virtual Function Statistics w/ LPAR breakdown) 6 © 2014 IBM Corporation 6 Performance Metrics (cont.) Raw Metrics – Cumulative counters (since IPL) or Quantities (size, config, etc.,) – Fixed sampling intervals • General purpose monitoring: 30 seconds, 30 minute cache • Short term problem diagnosis: 5 seconds, 15 minute cache Processed Metrics – Utilization (cpu, I/O, etc.,) – Fixed interval of 30 seconds, preserved for 4 hrs Aggregated Metrics – Rolled-up Processed Metrics – Rolled-up data at 15 minute, 2-hour & daily (Min, Average & Max) – Preserved for a max of 365 days (configurable per HMC & limited by storage space) 7 © 2014 IBM Corporation New control for storage and enablement 8 © 2014 IBM Corporation Aggregate Server: Current Usage (CPU, Memory, IO) 9 © 2014 IBM Corporation Partition: Entitlement vs Usage Spread, Detail 10 © 2014 IBM Corporation Partition: Processor Utilization 11 © 2014 IBM Corporation Partition: Network, including SR-IOV support 12 © 2014 IBM Corporation Storage by VIOS, vSCSI or NPIV 13 © 2014 IBM Corporation HMC v8 Monitor Support (June 2014 GA) Minimum features with all POWER6 & above models: Managed System CPU Utilization (Point In Time & Historical) Managed System Memory Assignment (Point In Time & Historical) Server Overview Section of Historical Data with LPAR & VIOS view Processor Trend Views with LPAR, VIOS & Processor Pool (no System Firmware Utilization, Dispatch Metrics, will be shown as zero) Memory Trend Views with LPAR & VIOS view These metrics were available via legacy HMC performance data collection mechanisms and are picked up by the monitor. 14 © 2014 IBM Corporation HMC v8 Monitor Support (new firmware-based function) FW 780 & VIOS 2.2.3, all function except for 770/780-MxB models – No support for LPAR Dispatch Wait Time – No support for Power Hypervisor Utilization FW 780 or above with VIOS level below 2.2.3, then the following functions are not available (basically, no IO utilization): – Network Bridge / Virtual Storage Trend Data – VIOS Network / Storage Utilization FW 770 or less with VIOS 2.2.3 or later then these are not provided: – Network Bridge Trend Data – LPAR Dispatch Wait Time – Power Hypervisor Utilization FW 770 or less with VIOS level below 2.2.3, then the tool will not provide: – Network Bridge / Virtual Storage Trend Data – VIOS Network / Storage Utilization – LPAR Dispatch Wait Time – Power Hypervisor Utilization 15 © 2014 IBM Corporation Dynamic Platform Optimizer Update 16 © 2014 IBM Corporation What is Dynamic Platform Optimizer - DPO DPO is a PowerVM virtualization feature that enables users to improve partition memory and processor placement (affinity) on Power Servers after they are up and running. DPO performs a sequence of memory and processor relocations to transform the existing server layout to the optimal layout based on the server topology. Client Benefits –Ability to run without a platform IPL (entire system) –Improved performance in a cloud or highly virtualized environments –Dynamically adjust topology after mobility 17 © 2014 IBM Corporation What is Affinity? Affinity is a locality measurement of an entity with respect to physical resources – An entity could be a thread within AIX/i/Linux or the OS instance itself – Physical resources could be a core, chip, node, socket, cache (L1/L2/L3), memory controller, memory DIMMs, or I/O buses Affinity is optimal when the number of cycles required to access resources is minimized POWER7+ 760 Planar Note x & z buses between chips, and A & B buses between Dual Chip Modules (DCM) In this model, each DCM is a “node” 18 © 2014 IBM Corporation Partition Affinity: Why is it not always optimal? Partition placement can become sub-optimal because of: Poor choices in Virtual Processor, Entitlement or Memory sizing –The Hypervisor uses Entitlement & Memory settings to place a partition. Wide use of 10:1 Virtual Processor to Entitlement settings does not lend any information for optimal placement. –Before you ask, there is no single golden rule, magic formula, or IBM-wide Best Practice for Virtual Processor & Entitlement sizing. If you want education in sizing, ask for it. Dynamic creation/deletion, processor and memory ops (DLPAR) Hibernation (Suspend or Resume) Live Partition Mobility (LPM) CEC Hot add, Repair, & Maintenance (CHARM) Older firmware levels are less sophisticated in placement and dynamic operations 19 © 2014 IBM Corporation Partition Affinity: Hypothetical 4 Node Frame Partition X Partition X DPO operation Partition Y Partition Z Partition Y Partition X Partition Y Partition Z Free LMBs Partition Z 20 © 2014 IBM Corporation Current and Predicted Affinity enhancement with V7R780 firmware Scores at the partition level along with the system wide scores lsmemopt –m managed_system –o currscore –r [sys | lpar] lsmemopt –m managed_system –o calcscore –r [sys | lpar] [--id request_partition_list] [--xid protect_partition_list] sys = system-wide score (default if the –r option not specified) lpar = partition scores 21 © 2014 IBM Corporation Example: V7R780 firmware current affinity score lsmemopt -m calvin -o currscore -r sys >curr_sys_score=97 lsmemopt –m calvin –o currscore –r lpar >lpar_name=calvinp1,lpar_id=1,curr_lpar_score=100 lpar_name=calvinp2,lpar_id=2,curr_lpar_score=100 lpar_name=calvinp50,lpar_id=50,curr_lpar_score=100 lpar_name=calvinp51,lpar_id=51,curr_lpar_score=none lpar_name=calvinp52,lpar_id=52,curr_lpar_score=100 lpar_name=calvinp53,lpar_id=53,curr_lpar_score=74 lpar_name=calvinp54,lpar_id=54,curr_lpar_score=none Get predicted score lsmemopt -m calvin -o calcscore -r sys >curr_sys_score=97,predicted_sys_score=100,requested_lpar_i ds=none,protected_lpar_ids=none 22 © 2014 IBM Corporation HMC CLI: Starting/Stopping a DPO Operation optmem –m managed_system –t affinity –o start [--id requested_partition_list] [--xid protect_partition_list] Use these switches to exclude partitions by name or number example: Partitions that are not DPO aware – Partition lists are comma-separated and can include ranges – Include –-id 1,3, 5-8 – Requested partitions: partitions that should be prioritized (default = all LPARs) – Protected partitions: partitions that should not be touched (default = no LPARs) – Exclude by name –x CAB, ZIN or by LPAR id number --xid 5,10,16-20 optmem –m managed_system –t affinity –o stop 23 © 2014 IBM Corporation HMC CLI: DPO Status lsmemopt –m managed_system >in_progress=0,status=Finished,type=affinity,opt_id=1, progress=100,requested_lpar_ids=none,protected_lpar_ids= none,”impacted_lpar_ids=106,110” • Unique optimization identifier • Estimated progress % • LPARs that were impacted by the optimization (i.e. had CPUs, memory, or their hardware page table moved) 24 © 2014 IBM Corporation What’s New (V7R7.8.0): DPO Schedule, Thresholds, Notifications System affinity score Not LPAR affinity score 25 © 2014 IBM Corporation DPO – Supported Hardware and Firmware levels Introduced in fall 2012 (with feature code EB33) • 770-MMD and 780-MHD with firmware level 760.00 • 795-FHB with firmware level 760.10 (760 with fix pack 1) • Recommend 760_069 has enhancements below Additional systems added spring 2013 with firmware level 770 – 710,720,730,740 D-models with firmware level 770.00 – 750,760 D-models with firmware level 770.10 (770 with fix pack 1) – 770-MMC and 780-MHC with firmware level 770.20 (770 with fix pack 2) – Performance enhancements – DPO memory movement time reduced – Scoring algorithm improvements – Recommend firmware at 770_021 Affinity scoring at the LPAR level with firmware level 780 delivered Dec 2013 http://www 770-MMB, 780-MHB added with 780.00 304.ibm.com/support/customercare/ 795-FHB updated with 780.00 sas/f/power5cm/power7.html 770-MMD, 780-MHD (AM780_056_040 level released 4/30/2014) * Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced. * All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. 26 © 2014 IBM Corporation Running DPO DPO aware Operating Systems – AIX: 6.1 TL8 or later, AIX 7.1 TL2 or later – IBM i: 7.1 TR6 or later – Linux: Some reaffinitization in RHEL7/SLES12. (Fully implemented in follow-on releases) – VIOS 2.2.2.0 or later – HMC V7R7.6.1 Partitions that are DPO aware are notified after DPO completes Re-affinitization Required – Performance team measurements show reaffinitization is critical – For older OS levels, users can exclude those partitions from optimization, or reboot them after running the DPO Optimizer Affinity (at a high level) is as good as CEC IPL – (assuming unconstrained DPO) 27 © 2014 IBM Corporation More Information IBM PowerVM Virtualization Managing and Monitoring (June 2013) – SG24-7590-04: http://www.redbooks.ibm.com/abstracts/sg247590.html?Open IBM PowerVM Virtualization Introduction and Configuration (June 2013) – SG24-7940-05: http://www.redbooks.ibm.com/abstracts/sg247940.html?Open POWER7 Information Center under logical partitioning topiccs – http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=%2Fp7hat%2Fiphblm anagedlparp6.htm IBM DeveloperWorks – https://www.ibm.com/developerworks/community/blogs/PowerFW/entry/dynamic_platform _optimizer5?lang=en POWER7 Logical Partitions “Under the Hood” – http://www03.ibm.com/systems/resources/power_software_i_perfmgmt_processor_lpar.pdf 28 © 2014 IBM Corporation PowerVP 29 © 2014 IBM Corporation PowerVP Redbook Draft available now! http://www.redbooks.ibm.com/redpieces/pdfs/redp5112.pdf 30 © 2014 IBM Corporation Review - POWER7+ 750/760 Four Socket Planer Layout Note x & z buses between chips, and A & B buses between Dual Chip Modules (nodes) Power 750/760 D Technical Overview 31 © 2014 IBM Corporation Review - POWER7+ 770/780 Four Socket Planer Layout Loc Code Conn Ref Not as pretty as 750+ diagram, note we have x, w & z buses between chips with this model and buses to other nodes (not pictured) and IO are a little more cryptic Power 770/780 D Technical Overview 32 © 2014 IBM Corporation PowerVP - Virtual/Physical Topology Utilization 33 © 2014 IBM Corporation Why PowerVP - Power Virtualization Performance During an IPL of the entire Power System, the Hypervisor determines an optimal resource placement strategy for the server based on the partition configuration and the hardware topology of the system. There was a desire to have a visual understanding of how the hardware resources were assigned and being consumed by the various partitions that were running on the platform. It was also desired to have a visual indication showing a resource’s consumption and when it was going past a warning threshold (yellow) and when it was entering an overcommitted threshold (red.) 34 © 2014 IBM Corporation PowerVP Overview Graphically displays data from existing and new performance tools Converges performance data from across the system Shows CEC, node & partition level performance data Illustrates topology utilization with colored “heat” threshold settings Enables drill down for both physical and logical approaches Allows real-time monitoring and recording function Simplifies physical/virtual environment, monitoring, and analysis Not intended to replace any current monitoring or management product 35 © 2014 IBM Corporation PowerVP Environment Partition Collectors Required for logical view LPAR CPU utilization Disk Activity Network Activity CPI analysis Cache analysis System-wide Collector One required per system P7 topology information P7 chip/core utilizations P7 Power bus utilizations Memory and I/O utilization LPAR entitlements, utilization System Collector Partition Collector Operating system Hypervisor interfaces IBM i, AIX, VIOS, Linux Chip Core HPMCs PMUlets FW/Hypervisor Thread PMUs Power Hardware 36 © 2014 IBM Corporation You only need to install a single system-wide collector to see global metrics PowerVP – System, Node and Partition Views System Topology 37 © 2014 IBM Corporation Node Drill Down Partition Drill Down PowerVP – System Topology • The initial view shows the hardware topology of the system you are logged into • In this view, we see a Power 795 with all eight books and/or nodes installed, each with four sockets • Values within boxes show CPU usage • Lines between nodes show SMP fabric activity 38 © 2014 IBM Corporation PowerVP – Node drill down • This view appears when you click on a node and allows you to see the resource assignments or consumption • In this view, we see a POWER7 780 node with four chips each with four cores • Active buses are shown with solid colored lines. These can be between nodes, chips, memory controllers and IO buses. 39 © 2014 IBM Corporation PowerVP 1.1.2: Node View (POWER7 780) 40 © 2014 IBM Corporation PowerVP 1.1.2: Chip (POWER7 780 with 4 cores) SMP Bus IO Memory Controller CHIP DIMM LPARs 41 © 2014 IBM Corporation PowerVP 1.1.2: CPU Affinity LPAR 7 has 8 VPs. As we select cores, 2 VPs are “homed” to each core. The fourth core has 4 VPs from four LPARs “homed” to it. This does not prevent VPs from being dispatched elsewhere in the pool as utilization requirements demand 42 © 2014 IBM Corporation PowerVP 1.1.2: Memory Affinity LPAR 7 Online Memory is 32768 MB, 50% of 64 GB in DIMMs Note: LPARs will be listed in color order in shipping version 43 © 2014 IBM Corporation PowerVP - Partition drill down • View allows us to drill down on resources being used by selected partition • In this view, we see CPU, Memory, Disk IOPS, and Ethernet being consumed. We can also get an idea of cache and memory affinity. • We can drill down on several of these resources. Example: we can drill down on the disk transfer or network activity by selecting the resource 44 © 2014 IBM Corporation PowerVP - Partition drill down (CPU, CPI) 45 © 2014 IBM Corporation PowerVP - Partition drill down (Disk) 46 © 2014 IBM Corporation PowerVP – How do I use this? • PowerVP is not intended to replace traditional performance management products • It does not let you manage CPU, memory or IO resources • It does provide an overview of hardware resource activity that allows you to get a high-level view • Node/socket activity • Cores assigned to dedicated and shared pool • VM’s Virtual Processors assigned to cores • VM’s memory assigned to DIMMs • Memory bus activity • IO bus activity • Provides partition activity related to • Storage & Network • CPU • Software Cycles-Per-Instruction 47 © 2014 IBM Corporation PowerVP – How do I use this? High-Level • High-level view can allow visual identification of node and bus stress • Thresholding is largely arbitrary, but if one memory controller is obviously saturated and others are inactive, you have an indication more detailed review is required • There are no rules-of-thumb or best practices for thresholds • You can review system Redbooks and determine where you are with respect to bus performance (not always available, but newer Redbooks are more informative) • This tool provides high-level diagnosis with some detailed view (if partition-level collectors are installed) 48 © 2014 IBM Corporation PowerVP – How do I use this? Low-Level • Cycles-Per-Instruction (CPI) is a complicated subject, it will be beyond the capacity of most customers to assess in detail • In general, a lower CPI is better – the fewer number of CPU cycles per instruction, the more instructions can get done • PowerVP gives you various CPI values – these, in conjunction with OS tools can tell you whether you have good affinity • Affinity is a measurement of a threads locality to physical resources. Resources can be many things: L1/L2/L3 cache, core(s), chip, memory controller, socket, node, drawer, etc. 49 © 2014 IBM Corporation AIX Enhanced Affinity AIX on POWER7 and above uses Enhanced Affinity instrumentation to localize threads by Scheduler Resource Allocation Domain (SRAD) AIX Enhanced Affinity measures Local Usually a Chip Near Local Node/DCM Far Other Node/Drawer/CEC Affinity Local chip Near POWER7 770/780/795 Far internode These are logical mappings, which may or may not be exactly 1:1 with physical resources POWER8 S824 DCM 50 © 2014 IBM Corporation intranode AIX topas Logical Affinity (‘M’ option) Topas Monitor for host: claret4 Interval: 2 =================================================================== REF1 SRAD TOTALMEM INUSE FREE FILECACHE HOMETHRDS CPUS ------------------------------------------------------------------0 2 4.48G 515M 3.98G 52.9M 134.0 12-15 0 12.1G 1.20G 10.9G 141M 236.0 0-7 1 1 4.98G 537M 4.46G 59.0M 129.0 8-11 3 3.40G 402M 3.01G 39.7M 116.0 16-19 =================================================================== CPU SRAD TOTALDISP LOCALDISP% NEARDISP% FARDISP% ---------------------------------------------------------0 0 303.0 43.6 15.5 40.9 2 0 1.00 100.0 0.0 0.0 3 0 1.00 100.0 0.0 0.0 4 0 1.00 100.0 0.0is optimal 0.0 Local 5 0 1.00 100.0 0.0 0.0 6 0 1.00 100.0 0.0 0.0 Node Chip Dispatches What’s a bad FARDISP% rate? No rule-of-thumb, but 1000’s of far dispatches per second will likely indicate lower performance How do we fix? Entitlement & Memory sizing Best Practices + Current Firmware + Dynamic Platform Optimizer 51 © 2014 IBM Corporation PowerVP Physical Affinity: VM View • PowerVP can show us physical affinity (local, remote & distant) • AIX topas can show us logical affinity (local, near & far) • More local, more ideal Cache Affinity DIMM Affinity Local is optimal Computed CPI is an inverse calculation, lower is typically better 52 © 2014 IBM Corporation PowerVP supported Power models and ITE’s Power System models and ITE’s with 770 firmware support • 710-E1D, 720-E4D, 730-E2D, 740-E6D (also includes Linux D models) • 750-E8D, 760-RMD • 770-MMC, 780-MHC, ESE 9109-RMD • p260-22X, p260-23X, p460-42X, p460-43X, p270-24X, p470-44X, p24L-7FL • 71R-L1S, 71R-L1C, 71R-L1D, 71R-L1T, 7R2-L2C, 7R2-L2S, 7R2-L2D, 7R2-L2T Power System models added with 780 firmware support – 770-MMB and 780-MHB (eConfig support 1/28/2014) – 795-FHB Dec 2013 780 Power Firmware http://www304.ibm.com/support/customercare/ sas/f/power5cm/power7.html Power System models with 780 firmware support – 770-MMD, 780-MHD (4/30/2014) Pre-770 firmware models do not have instrumentation to support PowerVP * Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced. * All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. 53 © 2014 IBM Corporation PowerVP OS Support Announced and GA in 4Q 2013 PowerVP 1.1.2 ships 6/2014 Available as standalone product or with PowerVM Enterprise Edition Agents will run on IBM i, AIX, Linux on Power and VIOS –System i V7R1, AIX 6.1 & 7.1, any VIOS level that supports POWER7 –RHEL 6.4, SUSE 11 SP 3 Client supported on Windows, Linux, and AIX –Client requires Java 1.6 or greater –Installer provided for Windows, Linux, and AIX –Also includes a Java installer, which has worked under VMWARE and OSX (limited testing) Has worked on VMWARE and MAC where the others don’t 54 © 2014 IBM Corporation VIOS Performance Advisor 2.2.3 55 © 2014 IBM Corporation VIOS Performance Advisor: What is it? Not another performance monitoring tool, an integrated report that leverages other tools and the lab’s knowledge base Summarizes the overall performance of a VIOS Identifies potential bottlenecks and performance inhibitors Proposes actions to be taken to address the bottlenecks The “beta” VIOS Performance Advisor productized and shipped with the Virtual I/O Server Shipped with VIOS 2.2.2 56 © 2014 IBM Corporation Performance Advisor: How does it work? Polls key performance metrics for over a period of time Analyzes the data Produces an XML formatted report for viewing in a browser “part” command is available in VIOS restricted shell – pronounced as “p-Art” (Performance Analysis & Reporting Tool). “part” command can be executed in two different modes – Monitoring Mode (actually uses nmon recording now) – Post Processing nmon Recording Mode The final report along with the supporting files are bundled together in a “tar” formatted file Users can download & extract it to their PC or machines with browser installed to view the reports in a browser. 57 © 2014 IBM Corporation VIOS Performance Advisor: Process Collect Data Transfer & view report Monitoring Mode: 5 to 60 minutes IBM Virtual I/O Server login: padmin $ part -i 30 - or Post-processing nmon recording IBM Virtual I/O Server login: padmin $ part -f vio1_130915_1205.nmon 58 © 2014 IBM Corporation Transfer the generated tar file to a machine with browser support Extract the tar file Load *.xml file in brower VIOS Performance Advisor: Browser View 59 © 2014 IBM Corporation VIOS Performance Advisor: Legend, Risk & Impact Advisor Legend Informative Investigate Optimal Warning Critical Help/Info 60 © 2014 IBM Corporation Risk: Level of risk, as a range of 1 to 5, of making suggested value change Impact: Potential performance impact, as a range of 1 to 5, of making suggested value change VIOS Performance Advisor: Config 61 © 2014 IBM Corporation VIOS Performance Advisor: Tunable Information When you select the help icon 62 © 2014 IBM Corporation a pop-up with guidance appears VIOS Performance Advisor: CPU Guidance 63 © 2014 IBM Corporation VIOS Performance Advisor: Shared Pool Guidance If shared pool monitoring is enabled, the Advisor will report, status, settings and if there is a constraint Enablement is via partition properties panel as: Allow performance information collection 64 © 2014 IBM Corporation VIOS Performance Advisor: Memory Guidance 65 © 2014 IBM Corporation VIOS Performance Advisor: IO Total & Disks 66 © 2014 IBM Corporation VIOS Performance Advisor: Disk Adapters 67 © 2014 IBM Corporation VIOS Performance Advisor: FC Details FC Utilization based on peak IOPS rates 68 © 2014 IBM Corporation VIOS Performance Advisor: NPIV Breakdowns 69 © 2014 IBM Corporation VIOS Performance Advisor: Storage Pool 70 © 2014 IBM Corporation VIOS Performance Advisor: Shared Ethernet Accounting feature must be enabled on VIOS chdev –dev ent* –attr accounting=enabled 71 © 2014 IBM Corporation VIOS Performance Advisor: Shared Tunings 72 © 2014 IBM Corporation Performance Advisor: Overhead CPU overhead of running this tool in VIOS will be same as that of running nmon recording – very low Memory footprint of the command is also kept to the minimum However in post-processing mode, if the recording contains a high number of samples, the part command will consume noticeable cpu when executed – Example: an nmon recording with 4000 samples and of size 100MB collected on a VIOS with 255 disks configured will be take about 2 minutes to complete the analysis on a VIOS with an entitlement of 0.2 – A typical (default) nmon recording will contain 1440 samples, so the above example is on the high end of the scale 73 © 2014 IBM Corporation Affinity Backup 74 © 2014 IBM Corporation What is Affinity? Affinity is a locality measurement of an entity with respect to physical resources – An entity could be a thread within AIX/i/Linux or the OS instance itself – Physical resources could be a core, chip, node, socket, cache (L1/L2/L3), memory controller, memory DIMMs, or I/O buses Affinity is optimal when the number of cycles required to access resources is minimized POWER7+ 760 Planar Note x & z buses between chips, and A & B buses between Dual Chip Modules (DCM) In this model, each DCM is a “node” 75 © 2014 IBM Corporation Thread Affinity Performance is closer to optimal when threads stay close to physical resources. Thread Affinity is a measurement of proximity to a resource – Examples of resources: L2/L3 cache, memory, core, chip and node – Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains – Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node Modern highly multi-threaded workloads are architected to have lightweight threads and distributed application memory – Can span domains with limited impact – Unix scheduler/dispatch/memory manager mechanisms spread workloads 76 © 2014 IBM Corporation Partition Affinity: Why is it not always optimal? Partition placement can become sub-optimal because of: Poor choices in Virtual Processor, Entitlement or Memory sizing –The Hypervisor uses Entitlement & Memory settings to place a partition. Wide use of 10:1 Virtual Processor to Entitlement settings does not lend much information for best placement. –Before you ask, there is no single golden rule, magic formula, or IBM-wide Best Practice for Virtual Processor & Entitlement sizing. If you want education in sizing, ask for it. Dynamic creation/deletion, processor and memory ops (DLPAR) Hibernation (Suspend or Resume) Live Partition Mobility (LPM) CEC Hot add, Repair, & Maintenance (CHARM) Older firmware levels are less sophisticated in placement and dynamic operations 77 © 2014 IBM Corporation How does partition placement work? PowerVM knows the chip types and memory configuration, and attempts to pack partitions onto the smallest number of chips / nodes / drawers – Optimizing placement will result in higher exploitation of local CPU and memory resources – Dispatches across node boundaries will incur longer latencies, and both AIX and PowerVM the are actively trying to minimize that via active Enhanced Affinity mechanisms It considers the partition profiles and calculates optimal placements – Placement is a function of Desired Entitlement, Desired & Maximum Memory settings – Virtual Processor counts are not considered – Maximum memory defines the size of the Hardware Page Table maintained for each partition. For POWER7, it is 1/64th of Maximum and 1/128th on POWER7+ and POWER8 – Ideally, Desired + (Maximum/HPT ratio) < node memory size if possible 78 © 2014 IBM Corporation What tools exist for optimizing affinity? Within the AIX, two technologies are used to maximize thread affinity – AIX dispatcher uses Enhanced Affinity services to keep a thread within the same POWER7 multiple-core chip to optimize chip and memory controller use – Dynamic System Optimizer (DSO) proactively monitors, measures and moves threads, their associated memory pages and memory pre-fetch algorithms to maximize core, cache and DIMM efficiency. We do not cover this feature in this presentation. Within a PowerVM frame, three technologies assist in maximizing partition(s) affinity – The PowerVM Hypervisor determines an optimal resource placement strategy for the server based on the partition configuration and the hardware topology of the system. – Dynamic Platform Optimizer relocates OS instances within a frame for optimal physical placement – PowerVP allows us to monitor placement, node, memory bus, IO bus and Symmetric Multi-Processor (SMP) bus activity 79 © 2014 IBM Corporation AIX Enhanced Affinity AIX on POWER7 and above uses Enhanced Affinity instrumentation to localize threads by Scheduler Resource Allocation Domain (SRAD) AIX Enhanced Affinity measures Local Usually a Chip Near Local Node/DCM Far Other Node/Drawer/CEC Affinity Local chip Near POWER7 770/780/795 Far internode These are logical mappings, which may or may not be exactly 1:1 with physical resources POWER8 S824 DCM 80 © 2014 IBM Corporation intranode AIX Affinity: lssrad tool shows logical placement View of 24-way, two socket POWER7+ 760 with Dual Chip Modules (DCM) 6 cores chip, 12 in each DCM 5 Virtual Processors x 4-way SMT = 20 logical cpus Terms: REF Node (drawer or DCM/MCM socket) SRAD Scheduler Resource Allocation Domain Node 0 SRAD # lssrad -av REF1 SRAD MEM 0 CPU 2 0 0 12363.94 2 4589.00 0-7 12-15 1 Node 1 SRAD 1 1 5104.50 8-11 3 3486.00 16-19 If a thread’s ‘home’ node was SRAD 0 SRAD 2 would be ‘near’ SRAD 1 & 3 would be ‘far’ 81 © 2014 IBM Corporation 3 Affinity: Cycles-Per-Instruction Another way to look at affinity is by watching how many cycles a thread uses This can be done via Cycles-Per-Instruction (CPI) measurements POWER Architectures are instrumented with a variety of CPI values provided for chip resources These measurements are usually a complicated subject that are the domain of hardware and software developers In general, a lower CPI is better – the fewer number of CPU cycles per instruction, the more efficient it is We will return to this concept in the PowerVP section 82 © 2014 IBM Corporation Affinity: Diagnosis When may I have a problem? - SRAD has CPUs but no memory or vice-versa - When CPU or Memory are very unbalanced But how do I really know? - Tools tell you: lssrad/topas/mpstat/svmon (AIX), numactl (Linux) & PowerVP - High percentage of threads with far dispatches - Disparity in performance between equivalent systems PowerVM & POWER8 provide a variety of improvements - PowerVM has come a long way in the last three years – firmware, AIX, Dynamic Platform Optimizer and PowerVP give you a lot of options - Cache (sizes, pre-fetch, L4, Non-Uniform Cache Access logic), Controller, massive DIMM bandwidth improvement - Inter-socket latencies and efficiency have progressively improved from POWER7 to POWER7+ and now POWER8 83 © 2014 IBM Corporation How do I Optimize Affinity? • This is a separate topic, but an overview of options • Follow POWER7 Best Practices for sizing (in general, size partitions entitlement, desired & maximum memory settings to be tailored to real usage and chip/node sizes) • Update to newer firmware levels – they are much smarter about physical placement of virtualized OS instances • Use Dynamic Platform Optimizer (DPO) to optimally place partitions within a frame • Monitor Enhanced Affinity metrics in AIX (topas ‘M’) • Use Dynamic System Optimizer (DSO) to optimally place threads within AIX. DSO does this by monitoring core, cache and DIMM memory use by individual threads. • Use software products that are affinity-aware (newer levels of some Websphere prodcuts are capable of this) • Manually create Resource Sets (rsets) of CPU & memory resources and assign workloads to them (expert level) 84 © 2014 IBM Corporation