Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,' Argonne'Leadership'Compu2ng'Facility ' Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University' Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory
Overview Proton'computed'tomography'(pCT)'is'an'alterna2ve'to'xEray'based'CAT'scans,'which' ‣ promises'several'medical'benefits'at'the'cost'of'being'significantly'more'computa2onally' expensive' We'designed'a'60Enode'GPU'cluster'to'meet'the'computa2onal'challenge' ‣ ! ! Computed'tomography' ‣ Benefits'of'proton'computed'tomography' ‣ Computa2onal'problem'descrip2on' ‣ CPU/GPU'performance'comparison ‣ Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 2
What is Computed Tomography? CAT'(or'CT)'scans'are'wellEknown' ‣ CAT'=='“computerized'axial'tomography”' ‣ CAT'scans'are'used'to'reconstruct'the'density'distribu2on'within'a'volume,'typically'used' ‣ in'medical'imaging' CAT'scans'are'conducted'with'photons'(XErays)' ‣ ! What'is'Proton'Computed'Tomography?' ‣ A'reconstruc2on'technique'similar'to'XEray'computed'tomography,'conducted'with' • protons'instead'of'photons Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 3
Why Proton Computed Tomography? 13'million'people'are'diagnosed'with'cancer'each'year'worldwide' ‣ 2.6'million'of'them'are'candidates'for'proton'therapy'treatment' ‣ Proton'therapy'involves'deposi2ng'protons'at'precise'loca2ons'within'a'tumor' ‣ site'where'they'irradiate'the'target'2ssue' The'protons'emit'lower'radia2on'as'they'travel'through'the'body'un2l'they' ‣ reach'the'target,'where'they'emit'a'burst'of'radia2on'(the'Bragg'peak)' Healthy'2ssue'beyond'the'tumor'site'receives'nominally'no'radia2on' • It'is'crucially'important'to'precisely'iden2fy'the'tumor'site' ‣ To'ensure'that'cancerous'2ssue'is'destroyed' • To'avoid'damaging'healthy'2ssue'surrounding'the'tumor,'especially'in' • sensi2ve'areas' Proton'therapy'treatment'planning'is'currently'performed'using'XEray'imaging' ‣ Photons'and'protons'interact'with'intermediate'material'differently' • Conversion'between'photon/proton'modali2es'involves'a' systema0c'range' • error'of'365% Image source: Wikipedia Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 4
Proton computed tomography Final System (in black): 4 tracking planes with XY Si detectors: Our'goal'is'to'reconstruct'volume' ‣ calorimeter with 64 end=on CsI Crystals of'adult'human'head'in'under'10' Calorimeter: minutes'' Each bar corresponds to a Planned Scaled Prototype (in red): 5cm x 5cm CsI Crystal, 4 planes of XY Si detectors (2 X-SSDs Protons'directed'through'two' read out by a photodiode and 2 Y-SSDs per plane): 8 CsI Crystal ‣ bars frontal'planes,'the'target'volume,' two'backing'planes,'and'finally'a' calorimeter' Measures'posi2on'and'angle'of' ‣ incidence'of'protons'at'entry'and' exit,'and'the'energy'loss Tracking Plane: Each large square corresponds to one double- sided or two single-sided 9cm x 9cm SSDs Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 5
Problem Description Final System (in black): 4 tracking planes with XY Si detectors: Proton'source,'detector'planes,'and'calorimeter' ‣ calorimeter with 64 end=on CsI mounted'on'rota2ng'gantry,'as'in'familiar'XEray'CT' Crystals configura2ons' Calorimeter: Each bar corresponds to a Data'collected'over'a'full'rota2on'of'the'gantry,'180' Planned Scaled Prototype (in red): ‣ 5cm x 5cm CsI Crystal, 4 planes of XY Si detectors (2 X-SSDs samples'(every'2'degrees)' read out by a photodiode and 2 Y-SSDs per plane): 8 CsI Crystal bars Ini2al'detector'designed'to'image'a'human'head' ‣ (nominally'25cm'cube)' From'physics'domain,'and'so'that'each'voxel'is' ‣ sufficiently'represented'in'the'resul2ng'system' matrix,'we'approximate'requiring'a'volume' consis2ng'of'256x256x36'(2,359,296=~'2.4M)' voxels'and'2'billion'protons'total' For'each'proton,'we'track'11'values:' ‣ [x,y,z]'at'entry' ‣ Tracking Plane: [x,y,z]'at'exit' Each large square ‣ corresponds to one double- angle'at'entry'and'exit' ‣ sided or two single-sided 9cm x 9cm SSDs input'and'output'energy' ‣ gantry'rota2on'angle ‣ Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 6
Baseline execution times 1 billion protons, 60 nodes, CPU only Began'with'serial'code' ‣ Phase Execution time (seconds) that'took'more'than'7' hours'to'process'131M' 128.2 Setup protons' { Parallelized'with'MPI'to' ‣ 1278.5 Most Likely Path (MLP) use'mul2ple'CPUs' 664.9 Linear solver (CARP) Established'baseline' ‣ execu2on'2mes 2072.0 Overall execution time Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 7
MLP (Most Likely Path) In'contrast'with'XEray'computed'tomography'in' ‣ which'the'par2cles'traverse'the'volume'in' straight'lines,'in'pCT'the'protons'are'scakered' by'the'material'as'they'travel'through'the' volume' MLP'computes'the'path'integral'of'the'protons' ‣ through'the'material'based'on'their'known' entry'and'exit'loca2ons'and'angles'and'the' energy'loss' The'proton'paths'are'discre2zed'as'the'voxels' ‣ touched'while'traversing'the'volume' Path'integral'calcula2ons'are'independent'and' ‣ parallelize'at'the'level'of'protons'(but'inherently' sequen2al'within'each'path) Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 8
Linear solver (CARP) The'result'of'MLP'is'a'system'of'equa2ons'rela2ng'each'proton’s'touched' ‣ voxels'to'the'rela2ve'stopping'power'(roughly,'the'energy'loss)' We'began'the'project'with'a'CPU'implementa2on'of'the'rowEac2on'based' ‣ sparse'itera2ve'solver'CARP'(component'averaged'row'projec2ons)' CARP'decomposes'the'matrix'into'row'blocks,'one'block'per'processor,'and' ‣ iterates'to'sa2sfactory'convergence:' Performs'a'JacobiElike'itera2on'sequen2ally'through'the'rows'to'produce'a'perE • block'solu2on'vector' Averages'the'perEblock'solu2on'vectors'(in'componentEwise'fashion)' • Redistributes'the'solu2on'vector'x'to'all'processors • Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 9
Hardware: Gaea GPU cluster at Northern Illinois University 60'compute'nodes' ‣ Node'configura2on' ‣ 2x'Intel'X5650'12Ecore'CPUs' • 2x'NVIDIA'M2070'GPUs' • 72GB'RAM' • QDR'Infiniband • Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 10
Data decomposition 2.1B'protons'/'60'nodes'=~'35M'protons'per'node' ‣ 2'GPUs'E>'17M'protons'per'GPU' ‣ The'maximum'voxels'per'proton'is'~364' ‣ 17M'protons'x'364'voxels'x'4'bytes/voxel'='25GB'data'per'GPU' ‣ Larger'than'available'M2070'GPU'memory'of'6GB' • High'watermark'memory'requirement'on'cluster'is'3TB'(aggregate) ‣ Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 11
MLP (Most Likely Path) CUDA implementation MLP'involves'calcula2ng'path'integral'of'the'protons' ‣ Ini2al'implementa2on'assigns'a'thread'per'proton' ‣ PerEGPU'proton'data'is'larger'than'GPU'memory'on'M2070' ‣ Stage'batches'of'protons'to'GPU' ‣ MLP'was'ported'to'the'GPU,'with'mul2ple'variants' ‣ gpu'struct: 'Direct'port'of'CPUEbased'code'using'structured'proton/voxel'data' • gpu'flat'memory: 'Flat'memory'space'with'perEproton'padded'voxel'arrays' • gpu'flat'memory'+'overlap: 'Streaming'computa2on'to'overlap'compute'and' • hostEdevice'transfers' Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 12
MLP (Most Likely Path) CUDA implementation (26M protons, 2 GPUs) Implementation Execution time (seconds) Speedup 598.7 - cpu 77.6 7.7x gpu_struct 55.5 10.8x gpu_flat_memory gpu_flat_memory + 53.0 11.3x overlap Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 13
Linear solver (CARP) CUDA implementation (26M protons, 2 GPUs) CARP'ported'directly'from'CPU'code' ‣ PerEnode'rowEblock'data'larger'than'GPU'memory;'batch'process' ‣ Further'subdivide'perEnode'rowEblock'into'rowEblocks'per'streaming'mul2processor' ‣ ! Execution time ! Implementation Speedup (seconds) ! 161.0 - cpu ! ! 139.3 1.16x gpu ! ! Limited'speedup'in'GPU'implementa2on,'because:' ‣ rowEac2on'based'solver'constrains'parallel'granularity' • scakered'memory'accesses'constrain'performance,'as'is'typical'of'sparse'matrix'opera2ons • Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 14
Performance at scale 2'billion'protons,'60'nodes,'12'CPU'cores/node,'2'GPUs/node Phase Execution time (seconds) 22.3 Setup 151.0 Most Likely Path (MLP) 265.5 Linear solver (CARP) 438.8 Overall execution time Initial goal was to complete in <600s (10mins) Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 15
Further work: CARP Hybrid CPU/GPU Assign'row'blocks'to'CPU'and'GPU'simultaneously' ‣ Weighted'work'distribu2on'based'on'ini2al'performance'measurements ‣ 2'billion'protons,'60'nodes,'12'cores/node,'2'GPUs/node Execution time Implementation Speedup (seconds) 161.0 - cpu 139.3 1.16x gpu 102.3 1.57x hybrid Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'(turam@anl.gov) 16
Recommend
More recommend