data processing on future hardware
play

Data Processing on Future Hardware Gustavo Alonso Systems Group - PowerPoint PPT Presentation

Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017 www.systems.ethz.ch Systems Group 7 faculty ~30 PhD ~11 postdocs Researching all aspects


  1. Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017

  2. www.systems.ethz.ch Systems Group • 7 faculty • ~30 PhD • ~11 postdocs Researching all aspects of system architecture, sw and hw

  3. Do not replace, enhance Help the engine to do what it does not do well

  4. • Databases great • Databases not great • Persistence • Text • Consistency • Large data types • Fault tolerance • Multimedia • Concurrency • Floating Point • Optimization • Geometry, spatial • Declarative • Graphs • Most of ML Interfaces • Extensible • Scalability

  5. Text search in databases Istvan et al, FCCM’16 INTEL HARP: This is an experimental system provided by Intel any results presented are generated using pre- production hardware and software, and may not reflect the performance of production or future systems.

  6. 100% processing on FPGA

  7. Hybrid Processing CPU/FPGA

  8. Inside the FPGA … Owaida et al. FCCM 2017

  9. Accelerating real engines Sidler et al., SIGMOD’17

  10. Near memory processing See previous talk, or … From Oracle M7 documentation

  11. DoppioDB: An engine that actually processes data • MonetDB + Intel HARP (v1 and v2) • String processing • Skyline • Data Partitioning • Stochastic Gradient Descent • Decision trees • …

  12. Integration of Partitioned Hash Joins QPI QPI Endpoint R Pointer 64B 64B Cache Cache Line Line ~30 GB/s 6.5 GB/s Mem. Controller Pointer QPI 96GB Main QPI Partitioner Endpoint Memory S Caches FPGA Accelerator Counts R Counts S Intel Xeon CPU Counts R Altera Stratix V Core 0 Core Core ... Core Core ... Core 1 0 1 0 1 ... Partitioned R Target Architecture: Intel Xeon+FPGA Core 0 Core 5 Core 1 Core 6 Padding Counts S Core 0 Core 2 Core 7 Core 1 ... Partitioned S Core 3 Core 8 Kaan et al. SIGMOD 2017 Core 4 Core 9 Memory CPU

  13. SGD on FPGA Kaan et al. FCCM 2017 Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ ( ax -b) a 9 Custom Logic 16 float2fixed C C C Storage Device converters Kaan et al. SIGMOD 2017 FPGA BRAM x - - - Model update 16 fixed C loading adders x - γ ( ax -b) a

  14. If the data moves, do it efficiently Bumps in the wire(s)

  15. (Woods, VLDB’14) IBEX

  16. Sounds good? The goal is to be able to do this at all levels: Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory) Every element in the system (a node, a computer rack, a cluster) will be a processing component

  17. Disaggregated data center Near Data Computation

  18. Consensus in a Box (Istvan et al, NSDI’16 ) Xilinx VC709 Evaluation Board FPGA SW Clients / SFP+ TCP Reads Other nodes Replicated Other nodes Writes SFP+ Direct Networking key-value store Atomic Broadcast Other nodes SFP+ Direct SFP+ DRAM (8GB) 08-Mar-17 18

  19. The system 3 FPGA cluster 10Gbps Switch Comm. over TCP/IP Comm. over direct + Leader election connections X 12 + Recovery Clients • Drop-in replacement for memcached with Zookeeper’s replication • Standard tools for benchmarking (libmemcached) • Simulating 100s of clients 19

  20. Latency of puts in a KVS Direct connections ~3 μ s Consensus Memaslap (ixgbe) 15-35 μ s ~10 μ s TCP / 10Gbps Ethernet 20

  21. The benefit of specialization… 10000000 Specialized Througput (consensus rounds/s) solutions 1000000 10-100x FPGA (Direct) FPGA (TCP) 100000 DARE* (Infiniband) General Libpaxos (TCP) purpose Etcd (TCP) 10000 solutions Zookeeper (TCP) 1000 1 10 100 1000 Consensus latency (us) [1] Dragojevic et al. FaRM: Fast Remote Memory . In NSDI’14. 21 [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15. *=We extrapolated from the 5 node setup for a 3 node setup.

  22. This is the end … There is a killer application (data science/big data) There is a very fast evolution of the infrastructure for data processing (appliances, data centers) Conventional processors and architectures are not good enough FPGAs great tools to: Explore parallelism Explore new architectures Explore Software Defined X/Y/Z Prototype accelerators

Recommend


More recommend