the search engine you can see
play

The search engine you can see Connects people to information and - PowerPoint PPT Presentation

The search engine you can see Connects people to information and services The search engine you cannot see Total data: ~1EB Processing data : ~100PB/day Total web pages: ~1000 Billion Web pages updated: ~10Billion/day requests: ~10Billion/day


  1. The search engine you can see Connects people to information and services

  2. The search engine you cannot see Total data: ~1EB Processing data : ~100PB/day Total web pages: ~1000 Billion Web pages updated: ~10Billion/day requests: ~10Billion/day Total logs : ~100PB Logs updated: ~1PB/day

  3. The search engine you don’t see Other cutting Intelligent Speech Image Rec. sys edge tech. HCI Large scale distributed computing Large scale distributed storage

  4. The history of Moore’s law PERFORMANCE OF PROCESSOR 100000 10000 1000 100 10 1 • The Moore’s law is going to the end

  5. The history of data center PERFORMANCE OF PROCESSOR 100000 Mainframe PC cluster SDDC 10000 1000 100 10 1

  6. History of data center PC cluster SDDC • 2013~ • 2000~now • Efficiency • Scalability

  7. Outline • What is the PC cluster • What is the SDDC • Baidu’s practice • Conclusion

  8. PC cluster • Background – Web scale applications – The performance and cost limitations of mainframe • Scale PC server by Ethernet – Up to 10K servers per cluster • Typical configurations – Commodity hardware: X86 CPU, INSPUR, HUAWEI… – Software: MR/HDFS/Spark…

  9. PC cluster • Typical stack Applications – Each layer are independent – The interfaces are highly abstract – Follows the technology paradigms of PC Distributed software • Limitations – Multiple highly abstract layers block to exploit the performance potential – Commodity hardware cannot support Commodity hardware emerging applications, such as AI and big data • The end of Moore’s law

  10. Software-Defined Data Center - SDDC • What is SDDC – Applications driven hardware and software – Whole-stack co-design • How – Algorithm • Customized for new hardware and architecture – System and software • separate data path and control path – Hardware • Expose low level API, fully controlled by software • Customized for applications Applications Applications software Distributed software Commodity hardware hardware

  11. Software-Defined Data Center - SDDC • Why SDDC – Exploit performance potential cross multiple layer – Customized hardware to extend Moore’s law for emerging applications • AI and big data – Achieve extreme efficiency • The FPGA in SDDC – Enable the possibility of whole-stack co-design

  12. SDDC – Baidu’s visions and practice • Vision – Shift PC cluster to SDDC in next 3 years – Define and design the SDDC, collaborating with partners • Practice – SDF: software-defined flash – SDA: software-defined accelerator 2015: design the 2011: SDF 2013: SDA distributed SD system

  13. Software-defined flash – background • Traditional SSD limitations – Low bandwidth utilization • 40% or less in real workload – Limited capacity utilization • Only 50%~70% for applications – Less predictable performance • Large-scale – 10,000+ SSD deployment per year (10PB+ capacity) • Challenges – Acquisition of extra devices – Higher cost

  14. Software-defined flash – designs • Software defined – Expose low level hardware interface to software /dev/sda0 ~ /dev/sdaN /dev/sda – Software can control hardware completely • New hardware architecture SSD Controller SSD Ctrl SSD Ctrl SSD Ctrl – Expose hardware channels to software Flash Flash Flash Flash Flash Flash – Individual FTL controller for Flash Flash Flash Flash Flash Flash ... ... ... ... CH_0 CH_1 CH_N CH_0 CH_1 CH_N each channel ch_0 ch_0 ch_0 ch_0 ch_0 ch_0 • New HW/SW interface SDF Conventional SSD SDF Conventional SSD – Write in the unit of erase block size – Leverage global resource for data persistency • Removes across-channel parity coding

  15. Software-defined flash – designs • Removing unnecessary software layers – To reduce latency and CPU cycles – To remove complexity of kernel configurations • User-defined scheduler User Space User Space User Space VFS – Data layout 1 Page 1 1 1 1 1 1 1 1 1 Cache 1 – Erase scheduling 1 File System Block Device 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Generic Block Layer 1 1 Generic Block Layer 1 1 1 1 1 1 1 1 Kernel Space IOCTRL 1 1 Kernel Space 1 IOCTRL 1 1 1 1 IO Scheduler 1 1 SCSI Mid-layer PCIE 1 SATA and SAS Translation Low Level Device Driver PCIE Driver Conventional SSD SDF Direct IO Buffered IO Direct IO Buffered IO (a) (b)

  16. Software-defined flash – designs • Hardware – 25nm MLC NAND, 44 channels, ONFI 1.x asynchronous 40Mhz – 5 FPGA, 4 Spartan-6 for FTL, Virtex-5 for PCIE PCIEx8 Virtex-5 Spartan-6 Spartan-6 Spartan-6 Spartan-6 11 channels 11 channels 11 channels 11 channels

  17. Software-defined flash – conclusions • Key ideas – Exposes flash channels to software – SW/HW co-design • Results – 95% write and 99% read bandwidth utilization – 99% capacity utilization – 50% cost reduction per GB compared with SSD for workload on the production systems • 3000+ deployment in Baidu Webpage storage system – 3x performance better than commodity SSD – 50% cost reduction

  18. Software-defined accelerator – background • AI is the core technology – speech, image, page ranking and Ads. • Extremely computing density – GPU • High cost • High power and high space consumption • Higher demand on data center cooling, power supply, and space utilization – CPU • Medium cost and power consumption • Low speed – FPGA • Most potential • Need faster iteration of development

  19. Software-defined accelerator – design 3x 7000 4.1 • Xilinx K7 FPGA x 6000 – Best performance/cost/power 5000 consumption CPU 4000 GPU 3000 • Evaluations FPGA 2000 1000 • Batch size=8, layer=8 0 Thread 1 2 4 8 12 16 24 32 40 48 56 64 # • Workload1 Fig a:workload1 – Weight matrix size=512 Req/s – FPGA is 4.1x than GPU 700 2.5 – FPGA is 3x than CPU 3.5 600 x x 500 Workload2 CPU 400 – Weight matrix size=2048 GPU – FPGA is 2.5x than GPU 300 FPGA – FPGA is 3.5x than CPU 200 100 • Conclusions 0 – FPGA can merge the Thread 1 2 4 8 12 16 24 32 small requests to improve # performance Fig b: workload2 – Throughput in Req/s of FPGA scales better

  20. Conclusion • Paradigm shift – From PC cluster to SDDC • What is SDDC – Applications driven – Whole-stack co-deign and tuning • The FPGA in SDDC – Enable SDDC • Baidu’s vision and practice – Shift PC cluster to SDDC – SDF,SDA and more

Recommend


More recommend