Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017
www.systems.ethz.ch Systems Group • 7 faculty • ~30 PhD • ~11 postdocs Researching all aspects of system architecture, sw and hw
Do not replace, enhance Help the engine to do what it does not do well
• Databases great • Databases not great • Persistence • Text • Consistency • Large data types • Fault tolerance • Multimedia • Concurrency • Floating Point • Optimization • Geometry, spatial • Declarative • Graphs • Most of ML Interfaces • Extensible • Scalability
Text search in databases Istvan et al, FCCM’16 INTEL HARP: This is an experimental system provided by Intel any results presented are generated using pre- production hardware and software, and may not reflect the performance of production or future systems.
100% processing on FPGA
Hybrid Processing CPU/FPGA
Inside the FPGA … Owaida et al. FCCM 2017
Accelerating real engines Sidler et al., SIGMOD’17
Near memory processing See previous talk, or … From Oracle M7 documentation
DoppioDB: An engine that actually processes data • MonetDB + Intel HARP (v1 and v2) • String processing • Skyline • Data Partitioning • Stochastic Gradient Descent • Decision trees • …
Integration of Partitioned Hash Joins QPI QPI Endpoint R Pointer 64B 64B Cache Cache Line Line ~30 GB/s 6.5 GB/s Mem. Controller Pointer QPI 96GB Main QPI Partitioner Endpoint Memory S Caches FPGA Accelerator Counts R Counts S Intel Xeon CPU Counts R Altera Stratix V Core 0 Core Core ... Core Core ... Core 1 0 1 0 1 ... Partitioned R Target Architecture: Intel Xeon+FPGA Core 0 Core 5 Core 1 Core 6 Padding Counts S Core 0 Core 2 Core 7 Core 1 ... Partitioned S Core 3 Core 8 Kaan et al. SIGMOD 2017 Core 4 Core 9 Memory CPU
SGD on FPGA Kaan et al. FCCM 2017 Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ ( ax -b) a 9 Custom Logic 16 float2fixed C C C Storage Device converters Kaan et al. SIGMOD 2017 FPGA BRAM x - - - Model update 16 fixed C loading adders x - γ ( ax -b) a
If the data moves, do it efficiently Bumps in the wire(s)
(Woods, VLDB’14) IBEX
Sounds good? The goal is to be able to do this at all levels: Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory) Every element in the system (a node, a computer rack, a cluster) will be a processing component
Disaggregated data center Near Data Computation
Consensus in a Box (Istvan et al, NSDI’16 ) Xilinx VC709 Evaluation Board FPGA SW Clients / SFP+ TCP Reads Other nodes Replicated Other nodes Writes SFP+ Direct Networking key-value store Atomic Broadcast Other nodes SFP+ Direct SFP+ DRAM (8GB) 08-Mar-17 18
The system 3 FPGA cluster 10Gbps Switch Comm. over TCP/IP Comm. over direct + Leader election connections X 12 + Recovery Clients • Drop-in replacement for memcached with Zookeeper’s replication • Standard tools for benchmarking (libmemcached) • Simulating 100s of clients 19
Latency of puts in a KVS Direct connections ~3 μ s Consensus Memaslap (ixgbe) 15-35 μ s ~10 μ s TCP / 10Gbps Ethernet 20
The benefit of specialization… 10000000 Specialized Througput (consensus rounds/s) solutions 1000000 10-100x FPGA (Direct) FPGA (TCP) 100000 DARE* (Infiniband) General Libpaxos (TCP) purpose Etcd (TCP) 10000 solutions Zookeeper (TCP) 1000 1 10 100 1000 Consensus latency (us) [1] Dragojevic et al. FaRM: Fast Remote Memory . In NSDI’14. 21 [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15. *=We extrapolated from the 5 node setup for a 3 node setup.
This is the end … There is a killer application (data science/big data) There is a very fast evolution of the infrastructure for data processing (appliances, data centers) Conventional processors and architectures are not good enough FPGAs great tools to: Explore parallelism Explore new architectures Explore Software Defined X/Y/Z Prototype accelerators
Recommend
More recommend