bryan veal annie foong intel r d perform ance scalability
play

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - PDF document

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server Overview The number of CPU cores on modern servers is increasing rapidly Premise: for highly parallel workloads perform ance should scale w ith


  1. Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server

  2. Overview • The number of CPU cores on modern servers is increasing rapidly • Premise: for highly parallel workloads perform ance should scale w ith the num ber of cores • We tested this premise for w eb servers • Our results show that w eb servers do not scale • We tested for common problems with poor parallel programming • We found few parallelism problem s in the TCP/ IP stack and the web software • Instead, we found problem s inherent to server hardw are design 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  3. W hy Perform ance Should Scale Clients Server • Typical networked servers – Have multiple cores – Have NICs mapped onto cores ` ` – Supports many clients – Each client has its own flow • Independence between flows NIC Core – Parallelism in the TCP/ IP stack ` – Parallelism the application ` NIC Core • Because of flow -level parallelism , perform ance NIC Core should scale ` ` NIC Core Memory ` ` 3 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  4. How Perform ance Scales on W eb Servers Example web server benchmark: SPECweb2005 Performance SPECweb2005 Scaling – Official results from HP Scaling – Similar scaling for Intel and AMD CPUs falls short 4 – Performance metric is throughput • Ideal performance scales linearly 3 • Actual Performance scales poorly e Speedup c n a – 2x the cores m 2 r o – 1.5x the performance f r e c e n a P m r o l f r a e P e l a u d 1 t c A I Perform ance does not scale w ith the num ber 0 0 4 8 12 16 of cores! Number of Cores 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  5. Determ ining W hy Perform ance Scales Poorly • Reproduced the published results on our own server • Tested common causes of poor scaling • System – 8-core Intel Xeon server – 4 1GbE NICs • Software – Apache 2, Linux 2.6, PHP 5 Web Server – SPECweb2005 Support Workload • Highest throughput of 3 SPECweb2005 workloads • Performance Metrics – Compare throughput when increasing from 1 to 8 cores – Compare cycles executed per byte transm itted when increasing from 1 to 8 cores 5 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  6. Our Perform ance Scaling Results Web Server Throughput Web Server Cycles/ Byte Scaling falls short 8 50 45 7 5 Throughput (Gb/ s) 40 6 35 Cycles/ Byte 4 Speedup 5 Server 30 3 4 25 Ideal Server l 20 a 3 More e 2 d 15 I cycles to 2 10 send data 1 1 5 0 0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Like the published results, our server scales poorly. 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  7. W here Does Cycles per Byte I ncrease? Ratio of OS (TCP/ IP Stack) to Web Server Cycles/ Byte Application (Web Server) d n a 50 S n O o i h t 2.0 a 45 t e c o t B i u l p b 1.8 p i 40 A r t CPU Utilization Ratio n o 1.6 C 35 Cycles/ Byte 1.4 30 1.2 OS: Application Ratio is Steady 25 1.0 Ideal 20 0.8 15 0.6 10 0.4 5 0.2 0.0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Either both OS and application are poorly parallelized or som ething else is affecting them both. 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  8. Possible Causes of Poor Scaling • We investigated many other causes—details in paper • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  9. Scaling in the TCP/ I P Stack TCP/ IP Stack Throughput TCP/ IP Stack CPU Utilization 6 • Removed web 100% server—TCP only 90% 5 80% Throughput (Gb/ s) • Bulk transmit CPU Utilization 70% 4 60% 3 50% Per-core CPU • 6 NICs at 40% utilization line rate 2 30% remains flat • 128 flows 20% 1 per core 10% 0 0% 0 2 4 6 0 2 4 6 Number of Cores Number of Cores The TCP/ I P stack is parallelized w ell. 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  10. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 0 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  11. Code Path Scaling • Length of code path may increase 0.8 with number of cores 0.7 • Examples Instructions per Cycle – Waiting longer for spin locks 0.6 – Traversing larger data structures 0.5 • Increases instructions per cycle (IPC) D e 0.4 c r e a s i n g • In fact, IPC is decreasing 0.3 Code path does not 0.2 increase significantly. 0.1 0.0 Decreasing I PC 0 2 4 6 8 suggests instruction Number of Cores pipeline stalls. 1 1 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  12. Finding Pipeline Stalls Top Third Poorest Scaling Functions memcpy t cp_init _t so_segs t cp_ack Dominated by memcpy_c free_block Memory Load/ Store memset _c Stalls Function copy_user_generic_st ring dev_hard_st art _xmit __alloc_skb _zend_mm_alloc_int kmem_cache_free _zend_hash_quick_add_or_updat e __d_lookup ap_merge_per_dir_configs skb_clone t cp_sendpage zend_hash_find 0.0 0.1 0.2 0.3 0.4 0.5 Cycles/ Byte Increase between 1 and 8 Cores Scaling is harm ed the m ost by stalls for m em ory loads and stores. 1 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  13. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 3 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  14. Stalls Caused by Cache and TLB Misses • More cache and TLB misses can 0.005 increase memory accesses • Can be caused by increased data 0.004 sharing between cores Misses per Cycle • In fact, cache and TLB misses are decreasing per cycle 0.003 L a s t Cache and TLB - l e v e l C a c h e 0.002 m isses do not cause D a t a T L B m em ory load/ store Decreasing 0.001 stalls. Som ething else 0.000 0 2 4 6 8 does. Number of Cores 1 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  15. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 5 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  16. Possible Cause of Bus Saturation • The system bus (front-side bus) has two main components Cores Cores – Address Bus carries requests and responses for data, called snoops – Data Bus carries the data itself Cache Cache Cache Cache Response • Bus Transaction Example Snoop – A cache miss generates a snoop on Bus the address bus Bus – Snoop is broadcast to memory and Memory Snoop all rem ote caches to find the most Controller current data – Current copy of data is in memory – All rem ote caches and memory respond • More caches mean m ore sources Main and m ore destinations for snoops Response Data Memory • Snoops grow O ( n ² ) w ith the num ber of caches! 1 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  17. The Effect of Snoops on Scaling • Snoops may increase bus utilization 100% Bus is saturated above • Bus utilization above 2/ 3 is 90% 2/ 3 utilization considered saturated System Bus Utilization 80% • Data bus utilization increases, but is Address Bus 70% not saturated 60% • Confirms data sharing between cores is minimal 50% 40% • Address bus utilization increases Data Bus faster 30% • Becom es saturated on 8 cores 20% 10% • Address bus 0% saturation causes of 0 2 4 6 8 poor scaling! Number of Cores 1 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  18. I nsights • Although web servers are highly parallelized and share little data… • Systems are designed for shared memory applications • Snoops are broadcast regardless of good parallelism 1 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  19. Conclusions • Our web server scales poorly with the number of cores • The OS and application exploit flow-level parallelism and scale well • Address bus saturation due to broadcast snoops causes poor scaling 1 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Recommend


More recommend