fpgas as tools and architectures at eth systems fpgas as
play

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - PowerPoint PPT Presentation

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool . Analysing a multi-Gb trace stream in real time. BRISC Research Architecture


  1. FPGAs as Tools and Architectures at ETH Systems

  2. FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool .  Analysing a multi-Gb trace stream in real time.  BRISC – Research Architecture for Large Systems The FPGA as an architecture .  A platform for hardware and software research.  Expose the coherent interface to an FPGA, with lots and  lots of fast IO links. David Cock | 14 September 2016 | 2

  3. Real-Time Tracing and Verification David Cock | 14 September 2016 | 3

  4. We're Going to Build a Large Program Collider ad Collide instructions at 0.99 c , and observe the decay products. Images: CERN; Chaix & Morel et associés David Cock | 14 September 2016 | 4

  5. Programmers Once (Thought They) Understood Computer Architecture Image: Computer Systems, A Programmer's Perspective, David Cock | 14 September 2016 | 5 Bryant & O'Hallaron, 2011

  6. Symmetric Multiprocessors Were Fairly Simple RAM Cache WB Cache WB David Cock | 14 September 2016 | 6

  7. Concurrent Code Makes Architecture Visible Consider message passing.  Pretty much the simplest thing you can do with shared memory.  Systems like Barrelfish rely on it.  When are barriers required?  You can't write good code, without sufficiently  understanding the hardware. We're combining components in  new ways. David Cock | 14 September 2016 | 7

  8. Message Passing with Shared Memory CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 42 Write: *y = 1 RAM *x = 0 *x = 42 *y = 1 *y = 0 David Cock | 14 September 2016 | 8

  9. Message Passing with a Write Buffer CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 0 Write: *y = 1 *x = 42 WB *y = 1 RAM *x = 0 *y = 0 *y = 1 David Cock | 14 September 2016 | 9

  10. Message Passing with a Barrier CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 42 Write: *y = 1 *x = 42 WB *y = 1 RAM *x = 42 *x = 0 *y = 0 *y = 1 David Cock | 14 September 2016 | 10

  11. Of Course, CPUs Aren't That Simple CPU CPU CPU CPU WB WB WB WB 9 hops L1 L1 L1 L1 L2 L2 Coherent PCI Interconnect RAM L3 David Cock | 14 September 2016 | 11

  12. You Can't Trust the Hardware Source: Chip Errata for the i.MX51, Freescale Semiconductor seL4 was verified modulo  a hardware model . The Cortex A8 has bugs:  Cache flushes don't work.  As of today, these “errata”  are still not public. We rediscovered these by  accident. Non-coherent memory is  coming. David Cock | 14 September 2016 | 12

  13. And Then There's Rack Scale... CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM TOR TOR CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Backhaul David Cock | 14 September 2016 | 13

  14. There's a Lot of Data Available Cache dumps Program trace CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Port mirroring CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Event triggers TOR TOR CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Backhaul Openflow David Cock | 14 September 2016 | 14

  15. ARM High-Speed Serial Trace Port Image: Teledyne Lecroy Streams from the Embedded  Trace Macrocell . Cycle-accurate control flow +  events @ 6GiB/s+ Compatible with FPGA PHYs.  Well-documented protocol.  Aurora 8/10  Available on ARMv8  David Cock | 14 September 2016 | 15

  16. The HSSTP Hardware The official tool is CHF10,000 per core.  The cable run is maximum 15cm.  It's PHY-compatible with common FPGAs  A CHF6k FGPA could easily handle 10.  15x cheaper!  We have a development prototype.  David Cock | 14 September 2016 | 16

  17. HSSTP Testbench David Cock | 14 September 2016 | 17

  18. Fancy Triggering and Filtering The ETM has sophisticated  State 0 filtering e.g. Sequencer . B0 F0 Bn and Fn can be just about any  events on the SoC. State 1 States can enable/disable trace,  B1 F1 or log events. State 2 A powerful facility for pre-filtering  B2 F2 State 3 David Cock | 14 September 2016 | 18

  19. Filtering and Offload in an FPGA We'll need to intelligently filter high-rate  data. We're using an FPGA for the physical  interface already. How much processing could we do?  We have expertise in the group with  FPGA query offloading We have a Master's student working on this.  David Cock | 14 September 2016 | 19

  20. What Could We Do With This Data? David Cock | 14 September 2016 | 20

  21. Hardware Tracing for Correctness Are HW operatjons right? 5Gb/s unmap(pa); cleanDCache(); flushTLB(); Filter at line rate ● Real time pipeline trace on ARM. ● Can halt and inspect caches. ● HW has “errata” (bugs). ● Check that it actually works! ● Catch transient and race bugs. Check temporal Log & process offmine assertjons David Cock | 14 September 2016 | 21

  22. Hardware Tracing for Performance • Should see N coherency messages. 5Gb/s • Do we? ‐ The HW knows! Filter at line rate Is URPC optjmal? Cache 0 x 1 1 INVAL(0) URPC[0]= x; READ(1) URPC[1]= 1; … x Core 0 Cache 1 while(!URPC[1]); x= URPC[0]; Log & process offmine 2 Core 1 David Cock | 14 September 2016 | 22

  23. Properties to Check: Security Runtime verification is an  established field. Lots of existing work to  build on. What properties could we  /* A very simple TESLA assertion. */ check efficiently? TESLA_WITHIN(example_syscall, previously(security_check(ANY(ptr), How could we map them  o, op) == 0)); to the filtering pipeline? http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/ David Cock | 14 September 2016 | 23

  24. Processing Engine That's a lot of data, how can we process it?  This is what rack-scale systems are for!  We have a software pipeline, thanks to a Master's  student: Andrei Pârvu. David Cock | 14 September 2016 | 24

  25. Properties to Check: Memory Management void *a = malloc(); ... {a is still allocated} free(a); Could we check this?  Gp $free( x ) −> P !$free( x ) S x = $malloc; It's always been ... before this free... true that... ...there were no frees of x , since it was allocated. ...if x is freed now , then... David Cock | 14 September 2016 | 25

  26. Checking LTL with Automata This is a well-studied problem, and standard algorithms exist: Gp $free( x ) −> P !$free( x ) S x = $malloc; 00100211 malloc 00111011 malloc 00111010 malloc free malloc malloc malloc malloc malloc free malloc free 00111111 free 00111110 malloc free free 11000111 11000110 free free free 11000000 David Cock | 14 September 2016 | 26

  27. Bound Variables and Multiple Automata malloc So far only one x value. malloc  free Every x needs an  malloc automaton instance. malloc Gp $free( 1 ) −> P !$free( 1 ) S 1 = $malloc; free Gp $free( 2 ) −> P !$free( 2 ) S 2 = $malloc; Gp $free( 3 ) −> P !$free( 3 ) S 3 = $malloc; malloc Requires dynamic allocation.  malloc free Not trivial in HW.  David Cock | 14 September 2016 | 27

  28. A Streaming Verification Engine Capture Processing Properties Sources ETM HSSTP TESLA Dataflow Sequencer Engine malloc() Packet FPGA pairing FPGA Capture Capture Offload Coherence correctness Constraints Requirements David Cock | 14 September 2016 | 28

  29. Software Pipeline Performance LTL checking in software 6 No double allocation No double frees No leaks 5 4 Time (seconds) 3 2 1 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of events (1000s) David Cock | 14 September 2016 | 29

  30. Software Pipeline Performance Trace parsing in software 160 Write trace Trace 140 Write trace w/ASM ASM Write parsed trace 120 Parser 100 Time (seconds) 80 60 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of events(1000s) David Cock | 14 September 2016 | 30

Recommend


More recommend