Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin Department of Computer Science National Tsing Hua University MPSOC2007, Japan
2 YLLIN NTHU-CS More Pixels
NHK Proposes UHD TV Broadcast • Super HiVision 7680x4320 pixels at 60 fps (16XHDTV) • Baseband signal is 24 Gbps. Using 16 MPEG-2 encoding chips, the signal was compressed to 250 Mbps for transmission. • HDTV signals at present are 1.5 Gbps for baseband and 20 Mbps for compressed signals. • High Performance compression / decompression and transmission / storage are needed for 24 Gbps �� ~300 Mbps YLLIN NTHU-CS 3
7680x4320 – UHD TV 3840x2160 – QFHD TV 1920x1080 – HDTV SDTV YLLIN NTHU-CS 4
Video Coding Technology Trend H .264 50% 69% YLLIN NTHU-CS 5
Features of Video Coding Standards Standard MPEG-1 MPEG-2 MPEG-4 H.264 16*16(frame) MB size 16*16 16*16 16*16 16*16, 16*8, 8*16, 8*8, 8*4, 4*8, Block size 8*8 8*8 16*16, 8*8 4*4 Transform DCT DCT DCT/ Wavelet 4*4 int transform Entropy coding VLC VLC VLC VLC, CAVLC and CABAC ME, MC Yes Yes Yes 41 MVs per MB ½ pel ½ pel ¼ pel ¼ pel Pixel accuracy Reference frames One frame One frame One frame Multiple (5) frames Picture type I, P, B I, P, B I, P, B I, P, B Transmission rate Up to 2-15 Mbps 64kbps~2Mbps 64kbps ~ 150Mbps 1.5 Mbps YLLIN NTHU-CS 6
Not all H.264/AVC systems are equal Relative Computational Complexity #Ref Search Range Frames 8 16 32 5 16.9 24.6 55.7 1 1 2.54 8.87 Video Coding with H.264/AVC: Tools, Performance and Complexity, J. Ostermann et al, IEEE CAS Mag., Q1 2004. YLLIN NTHU-CS 7
Quality vs Bit-rate vs Decoding Throughput Decoding Capability of a 600MHz CPU QP Bit Rate (Kbps) Fps 16 1723 44 21 704 55 26 307 65 H.264/AVC Baseline Profile Decoder Complexity Analysis, M. Horowitz, IEEE T-CSVT, July 2003 YLLIN NTHU-CS 8
Our Target ‧ Single-Chip Decoder for QFHD (3840x2160) H.264/AVC High Profile Video – CABAD – 8x8 Transform – Commodity DDR External Memory – Platform-Based Design YLLIN NTHU-CS 9
Performance Resolution Size Clock Frequency Application SQCIF (128 x 96) 1.0 0.4 MHz Video phone QCIF (176 x 144) 2.0 0.8 MHz CIF (352 x 288) 8.3 3 MHz Mobile TV Car TV 、 Surveillance D2 (720 x 480) 28.1 10 MHz 720HD (1080 x 720) 75.0 30 MHz Home theater 1080HD (1920 x 1088) 170.0 62 MHz Digital signage 、 Medical video 、 QFHD (3840 x 2160) 675.0 249 MHz Satellite image 、 Space exploration YLLIN NTHU-CS 10
Essential Issues ‧ Memory – Tradeoff Between the Size of Internal Memory and Bandwidth of External Access ‧ Massive Parallelism ‧ Macroblock Decoding Scheduling YLLIN NTHU-CS 11
NTHU H.264 Decoder Architecture Memory CPU Display Ethernet Controller AHB MAU & AMBA Interface Translator residual IPRED recon coeff IQ & IT CAVLD/ INTERP Parser DF CABAD mv & ridx mvdinfo MVG bs BSG para & predinfo H.264 Video Decoder YLLIN NTHU-CS 12
Memory
Memory Size (Bytes) size vs. b/w in ME 124929 D Full HD 30fps, # of rf =1 , SRV=SRH=64 Level A : 240 Bytes, 19658 MB/s Level B : 1200 Bytes, 1516MB/s Level C: 4977 Bytes, 317MB/s Level D: 124,929 Bytes, 62 MB/s C 4977 B 1200 A 240 62 317 1516 19658 YLLIN NTHU-CS 14 Memory Bandwidth (MB/s)
CB mem rf0 mem rf1 mem CB AG IME block diagram rf AG rf router CMB CMB CMB CMB rf reg reg reg reg reg array comparator comparator comparator comparator MVGen MVGen MVGen MVGen MVGen MVGen MVGen MVGen rf0 rf0 rf0 rf0 rf0 rf0 rf0 rf0 YLLIN NTHU-CS 15 MV mem MV AG
Memory Size (Bytes) size vs. b/w in ME 124929 D C 4977 B 1200 A ours 240 62 317 1516 19658 YLLIN NTHU-CS 16 Memory Bandwidth (MB/s)
Reference-data Pre-fetch System • No redundant fetching – Collecting several MB’s motion vectors, and read the same place by only one single operation • Minimize the number of burst initials – Averagely 2 burst initials per MB (1 for luma, 1 for chroma) : a group of sequentially read (burst read) YLLIN NTHU-CS 17
Reference-data Pre-fetch System (Cont) Reference Region & Index Register MAU Interface R7 R6 R5 R4 R3 R2 R1 R0 Buffer R0 MB4 MB2 MB0 MB6 MB5 MB4 MB1 . . . . R1 MB3 MB1 MB0 MB2 MB7 MB6 MB5 MB4 MB2 MB7 R2 R2 Information MB7 Region Information Translator Region Analyzer / Searcher MB7 MV CABAC MB10 OES manager R2 MB9 Motion Vector Information R2 Data from MB8 SDRAM Generator MB7 R0/R1 Data MB7 Information Interp MB7 MB6 MB5 MB4 MB3 MB2 MB1 MB0 YLLIN NTHU-CS 18
Massive Parallelism
RLD/IQ/IDCT Timing Diagram 122 140 144 161 195 212 219 3 t coeflag_mem 2 1 1 1 1 1 1 1 1 1 1 1 1 1 read 0~16 0 0 0 0 0 0 0~16 0 0~15 0 0 0~15 coeff_mem luma ~ ~ ~ ~ ~ ~ luma ~4 chroma ~ ~ chroma read ac_0_1 16 16 16 16 16 16 ac_14_15 dc ac_0_1 15 15 ac_6_7 0~16 0 0 0 0 0 0 0~16 0~15 0 0 0~15 IQ luma ~ ~ ~ ~ ~ ~ luma chroma ~ ~ chroma stage 1 ac_0_1 16 16 16 16 16 16 ac_14_15 ac_0_1 15 15 ac_6_7 0~16 0 0 0 0 0 0 0~16 0~15 0 0 0~15 IQ luma ~ ~ ~ ~ ~ ~ luma chroma ~ ~ chroma stage 2 ac_0_1 16 16 16 16 16 16 ac_14_15 ac_0_1 15 15 ac_6_7 IDCT 1 1 1 1 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 4 4 4 1 1 1 1 stage 1 IDCT 4 4 4 4 4 4 4 4 4 4 4 4 stage 2 residual_mem 4 4 4 4 4 4 4 4 4 4 4 4 1 write YLLIN NTHU-CS 20
21 YLLIN NTHU-CS DF Timing Diagram
Dual Pipelined Edge Filter Stage 1 Read Pixels L00 L01 L02 L03 M00 M01 M02 M03 R00 R01 R02 R03 Strong filter (Bs=4)/ R21 delta Left delta calculation calculation Stage 2 L10 L11 L12 L13 Left delta M10 M11 M12 M13 R10 R11 R12 R13 R21 delta R21 filter Left Weak Filter (Bs<4) Right delta Stage 3 calculation L20 L21 L22 L23 M20 M21 M22 M23 Right delta R20 R21 R22 R23 Right Weak filter (Bs<4) Right Weak filter (Bs<4) Stage 4 L31 L30 L32 L33 M30 M31 M32 M33 R30 R31 R32 R33 Stage 5 Write Pixels YLLIN NTHU-CS 22
System-Level Optimization Cyclic-Queue-Based IP Interface
Sequential Decoder Timing Diagram (I Frame) PARSER CABAD IQ/IT BSG IPRED DF Initial context MB 0 MB 1 MB 2 Header (time) table and decode decode decode information decode condition offset YLLIN NTHU-CS 24
Elastic Pipeline Decoder Timing Diagram (I Frame) PARSER CABAD IQ/IT BSG IPRED DF (time) MB 0 decode MB 4 decode Initial context Header MB 1 decode MB 5 decode table and information MB 2 decode MB 6 decode decode condition offset YLLIN NTHU-CS 25 MB 3 decode
ASAP Decode with Cyclic Queue Timing Diagram (I frame) PARSER CABAD IQ/IT BSG IPRED DF (time) MB 5 decode MB 0 decode Initial context Header MB 1 decode MB 6 decode table and information MB 2 decode MB 7 decode decode condition offset YLLIN NTHU-CS 26 MB 3 decode MB 8 decode MB 4 decode
Comparison of Different Scheduling Methods (Cycles/ MB) 620 644 650 9 KB 8.3 600 540 8 550 486 7 500 450 486 5.6 5.6 6 400 5 350 300 4 250 2.62 3 200 150 2 161 159 100 140 1 50 0 0 Test Pattern: “pedestrian” Sequential Elastic Pipeline ASAP Ping-Pong ASAP Cyclic- Resolution: 720*480 queue QP: 28 GOP: III… SRAM Usage Turnaround Cycle Processing Cycle Frame #: 30 YLLIN NTHU-CS 27
Verification Environment H264 filelist tbench fpga_lib rtl_sim Easy Bug Tracing lm_wrap asic_lib gate_sim mfu amba_wrap top main_ctrl mvg bsg mem vn nlint netlist parser cabad idct ipred interp df Sub IP hd_amba syn jm11.0 def filelist tbench rtl_sim xilinx_mem altera_mem artisan_mem YLLIN NTHU-CS 28 rtl syn vn nlint gate_sim
A Multimedia SOC Platform ROM/ Accelerator USB(PHY) CPU Flash Memory SDRAM Daughter Board (FPGA) SRAM FPGA Static VIC USB 2.0 SDRAM Controller(4-CH) memory High-Speed Bus JPEG APB Display DMA SRAM PWM WDT TIMER Capture Codec Bridge Controller Peripheral Bus DAI SSI SD SM UART GPIO 12C Audio Codec Flash memory Video-In Button LED TV/LCD Flash Card I2S with SSI CCIR601 YLLIN NTHU-CS 29
Summary ‧ Super High Definition Video Capturing, Delivery and Display are on the Horizon ‧ Massive Parallelism is Essential for Making Consumer Applications Possible ‧ Tradeoff Among Memory Usage, Bandwidth and Logic Has Profound Impact on the Overall System Performance ‧ System Design Should Be Adaptable to Content, Quality Variation YLLIN NTHU-CS 30
Recommend
More recommend