� � � Firmware Projects and � Hardware Demonstrators � Yuri Gershtein, Rutgers University � for Tracklet Team � Technical Review � 28-Aug-2017 � 1 �
Outline ▪ Reminder about structure of Firmware • Barrel • Disk ▪ Hardware Test Stands • CTP7 / VC709 ▪ Comparison tools • Emulation – Simulation - Hardware ▪ Current status / plans • Tracklet 2.0 • No sector-to-sector communication (?) • 25G link demo board Yuri Gershtein 8/28/2017 �
High Level Overview ▪ Massively parallel track reconstruction – Divide detector into 28 φ sectors � (2 GeV track spans max 2 sectors), spanning all eta – Time multiplexed system ( TMUX=6 ) ▪ Consider a SECTOR , consisting of one board (S ECTOR P ROCESSOR , SP), as the top-level FW unit ▪ Tracklets formed within a sector ▪ 2 GeV track can project into its adjacent sectors ▪ Projections must be sent to other sectors for stub matching ▪ Minimal data duplication ▪ One SP board for each sector → need inter-SP communication With 25G DTC->SP links, have enough bandwidth to duplicate data to SP instead of inter-SP communication. Saves on complexity and reduces latency 3 Yuri Gershtein 8/28/2017 �
Firmware Overview ▪ Tracklet algorithm lends itself to a modular, massively parallelized, pipelined, fixed-latency implementation , ideal for FPGAs • Algorithm consists of 11 steps • Each step processes a new BX every 25ns * N TMUX ▪ High level overview: a few (simple) calculations, lots of data movement, replicated many times • We use aggressive partitioning to reduce occupancy (combinatorics!) • Massive parallelization speeds processing ▪ Most challenging part of the firmware is to move the data to where it needs to be for calculations. 4 Yuri Gershtein 8/28/2017 �
Firmware processing steps 1. Layer Router 6. Organize projections 2. VM Router 7. Match Engines 3. Tracklet Engines 8. Match projections to stubs 4. Forming tracklets, initial 9. Match transmission param estimate 10. Track fit A. Projection to neighbors 11. Duplicate removal 5. Transmit projections • Data Movement � • Calculations � • Links � 5 Yuri Gershtein 8/28/2017 �
Two FW projects ▪ There are two firmware projects to cover half of a sector • ½ Barrel project: • Hybrid + Disk project: slightly different math and fit is more memory intensive ▪ Infrastructure code (moving data, etc) shared between projects ▪ Different challenges presented by each project ▪ Hybrid + Disk project L5+L6 L3+L4 D1+D2 D3+D4 L1+L2 L1+D1 ▪ ½ barrel project 6 Yuri Gershtein 8/28/2017 �
Fixed Latency Design ▪ Processing modules read from, and write to, memories (BRAMs) • Mechanism for hand-o ff btw algo stages • Dual-ported memories decouple clock domains ▪ Each processing step takes a fixed time to produce its first output (latency) ▪ Pipelined design then produces new output for next N TMUX *25 ns. ▪ After that time, move to next BX (truncate if necessary) ▪ Result: Fixed latency for entire processing chain . ▪ Pipelined : Multiple BX in flight at one instant in time 7 Yuri Gershtein 8/28/2017 �
Firmware implementation structure ▪ Firmware consists of a few hand-optimized Verilog modules replicated thousands of times. ▪ Wiring up these modules is automated via python scripts driven by configuration data. • Configuration stored in master configuration file • Same config steers C++ emulation and firmware • Di ff erent sub-configurations allow specializations for barrel, hybrid+disk projects, sharing code ▪ Debugging easier as the C++ emulation has the same memory-processing-memory format ▪ FW is stored in GitHub repo for e ffi cient many-person development • Tagged FW, emulation repos kept in sync 8 Yuri Gershtein 8/28/2017 �
Track finder project generation flow chart Master Project Proj. Conf. Wires.py Config. Generation Files Verilog Code Tracklet SubProject.py LUTs Emulation Vivado/ Stubs simulation Reduced Config 'Bit' file Legend: Input C++ or Python Processing L1 emulation software In hardware Tracks Sector Firmware Output Processor simulation File 9 Yuri Gershtein 8/28/2017 �
Tools generate top-level HDL, c++ emulator configuration, and this diagram processing steps (red) implement the algorithm 1/4 barrel project � Not all connections shown � Duplicate removal is Stub input Track output the next step Organize tracklet projections Stub organization Forming tracklets Match tracklet Match Track fit Projection projections to trans- transmission to stubs mission neighbors Each step takes predetermined amount of 3me – fixed latency 10 Yuri Gershtein 8/28/2017 �
Nearest-neighbor communication ▪ Reminder: Low-pt tracks project to neighboring sectors — need communication ▪ Use serial links over fiber-optics for inter-sector communication • 8b/10b protocol, synchronized across boards ▪ Estimated positions for projected tracklets are sent to neighbor ▪ Corrections to estimated position are sent back to original board for Implemented in final fit hardware � 11 Yuri Gershtein 8/28/2017 �
Link Organization ▪ Projections to be sent are grouped to balance the loads on the transceivers ▪ Use a prio encoder to read many memories (almost all empty) • MUX memory sources and send a single data stream • DeMUX on the receiving end and route to appropriate memory ▪ Transceivers for returning matches are ~identical ▪ Vivado Simulation is done up until the edge of the transceiver • Latency simulated with a holding FIFO 12 Yuri Gershtein 8/28/2017 �
T RACK F INDER DEVELOPMENT WORKFLOW DTC DTC � Yuri Gershtein 8/28/2017 �
Hardware Platform: Virtex-7 FPGA XC7VX690T Xilinx VC709 Board OSU, Cornell, Rutgers Test Stand at Rutgers (4 CTP7) CTP7 Board Test Stand at CERN (4 CTP7) Test Stand at Cornell (4 CTP7) Developed at U Wisconsin Yuri Gershtein 8/28/2017 �
Demonstrator Test Stands (4 CTP7s) ▪ Hardware: • 4 CTP7 boards in μ TCA • 3 sector processors (SPs) • 1 data source and sink • AMC13 Card for clock/sync • Represents 1 of N TMux ▪ I/O: • Data Source: Emulates DTC ✦ Provides stub inputs for both central sector and neighbors. • Sector boards: ✦ Neighbor communication of projections and matches. • Data Sink: Captures L1 Tracks. Yuri Gershtein 8/28/2017 �
Links and Synchronization ▪ Inter-board communication modules built based on Wisconsin’s protocol • 8b/10b encoding, 10 Gbps link speed • We can run with system configured as TMux = 4 or TMux = 6 . ▪ Each CTP7 has 48 optical outputs and 67 inputs at 10 Gbits/s • Su ffi cient I/O for demonstrator needs ▪ Synchronization of boards from central BC0 & clock sent from AMC13 Yuri Gershtein 8/28/2017 �
Demonstrator Tests ▪ Latency • Design is a fixed latency approach. • Time from start of sending of input stubs from “DTC” to arrival of first track at track sink. • Measured with various designs • Half barrel • Hybrid + Disks ▪ Performance versus Emulation • Capture output of SP and compare bitwise with emulation ▪ Stability • Run extended period of time with error checking Yuri Gershtein 8/28/2017 �
Latency measurement ▪ Single Pass measurement Measured latency includes • stub input links from DTC board to • • end-to-end measurement sector board processing of each step, including • ▪ Includes inter-board communicaMon • transmission of data track output links back to DTC board • Measurement with a clock counter • • all processing steps 240 MHz clock • ▪ Done for many events same as processing clock • implemented on the DTC emulator • ▪ single muon, ttbar start: read enable of input memory • Write counter output to a memory • when: first stub is sent or • valid tracks are received • a list of Mme stamps • BX of the received track is also wriVen • into the memory Yuri Gershtein 8/28/2017 �
Latency measurement Latency = Trk_Clk - Start - BX*36 Single Muons � 36 ticks of Arrival of track First stub 240MHz clk in at sink � sent (ev#1) � 150 ns � • First stub sent Until 1st possible track arrives at sink � • Bunch Latency = 800 clks � crossing in = 3.33 μ s � binary + 36 clks/150 ns � Until last possible track arrives at sink � Latency = 836 clks � = 3.48 μ s � Yuri Gershtein 8/28/2017 �
Latency Model TMUX 6, 240 MHz CLK Yuri Gershtein 8/28/2017 �
Latency Model TMUX 6, 240 MHz CLK Processing time of each module before moving to the next BX (TMUX = 6) Yuri Gershtein 8/28/2017 �
Latency Model TMUX 6, 240 MHz CLK Overhead in each processing module Yuri Gershtein 8/28/2017 �
Latency Model TMUX 6, 240 MHz CLK Inter-board communication latency: • Transmission protocol for stub inputs, projections, matches and track outputs • 76 clk (240MHz) measured with ChipScope Yuri Gershtein 8/28/2017 �
Latency Model TMUX 6, 240 MHz CLK Yuri Gershtein 8/28/2017 �
Recommend
More recommend