Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn CERN, 6. October 2011
Outlines 1. Introduction 2. Transformation of Thread-Safety 3. Transformation of ROOT & AliRoot 4. Critical section: Cint 5. Multi-Threaded execution 6. Sharing resources to reduce the footprint 7. Testing Event-Reconstruction 8. Conclusions 6. Oct. 2011 2
Introduction This work is based on tools and techniques, developed for the parallelization of sequential source-code in C/C++. And successfully applied for the parallel Monte- Carlo simulation Geant4. More and more computing units, Intels Westmere has already 12 cores Two ways of parallel processing: plus Hyper-Threading 1) Multi-Processing • Slow Context-Switch • Sharing Memory is sophisticated • PROOF-Lite 2) Multi-Threading • Depends on support of But limited resources thread-safety like Caches, IO-Bandwidth and internal data links. (Origin: Dennis Schmitz, Wikimedia Foundation) 6. Oct. 2011 3
Introduction Thread-safety : Access to resources shared amongst threads can not interfere the processing of other threads, even through an unpredicted way. This also calls unconditional-thread safety. Question: Can we introduce a parallel AliRoot Event-Reconstruction using multi-threading? Event-Reconstruction AliRoot (physical Analysis) The basic steps are: 1. Introducing thread-safety Physical Analysis for 2. Keep performance and scalability ROOT huge amounts of data 3. Reducing the memory-footprint C/C++ interpreter CInt 6. Oct. 2011 4
Transformation of Thread-Safety Source-to-source transformation: What are we looking for to obtain thread-safety? Parsing Searching: • Global Decl. Static #include <iostream> #include <TROOT.h> #include <iostream> • Static and Analysis #include <TRint.h> #include <TROOT.h> #include <iostream> Int main(){ #include <TRint.h> #include <TROOT.h> • Extern Decl. TRint (); Int main(){ #include <TRint.h> AST } TRint (); Int main(){ } TRint (); } Source-code Source-code Abstract- Trafo. files Syntax-Tree Rewriting Adding Thread-Local Specifier: (PrittyPrinting) a) __thread int Variable; b) static __thread int Variable; c) extern __thread intVariable; 6. Oct. 2011 5
Transformation of Thread-Safety Source-to-source transformation: What are we looking for to obtain thread-safety? Parsing Searching: • Global Decl. Static #include <iostream> #include <TROOT.h> #include <iostream> • Static and Analysis #include <TRint.h> #include <TROOT.h> #include <iostream> Int main(){ #include <TRint.h> #include <TROOT.h> • Extern Decl. TRint (); Int main(){ #include <TRint.h> AST } TRint (); Int main(){ } TRint (); } Source-code Source-code Abstract- Trafo. files Syntax-Tree Rewriting But non-PODs need to be changed: (PrittyPrinting) std::string Var; 1. __thread std::string* Var_Ptr; 2. Correct access from functions 6. Oct. 2011 6
Transformation of Thread-Safety Parsing Patching GCC-Parser Static #include <iostream> #include <TROOT.h> #include <iostream> Analysis #include <TRint.h> #include <TROOT.h> #include <iostream> Int main(){ #include <TRint.h> #include <TROOT.h> Information TRint (); Int main(){ #include <TRint.h> AST } TRint (); Int main(){ about } TRint (); } Source-code declarations Source-code Trafo. files Rewriting (PrittyPrinting) 6. Oct. 2011 7
Transformation of Thread-Safety Parsing Patching GCC-Parser Static #include <iostream> #include <TROOT.h> #include <iostream> Analysis #include <TRint.h> #include <TROOT.h> #include <iostream> Int main(){ #include <TRint.h> #include <TROOT.h> Information TRint (); Int main(){ #include <TRint.h> AST } TRint (); Int main(){ about } TRint (); X } Source-code declarations Source-code Trafo. X files Rewriting (PrittyPrinting) X Unfortunately , no interaction to the Abstract Syntax Tree AND the GCC-plugin support is useless for our case 6. Oct. 2011 8
Transformation of Thread-Safety 1. Rose Compiler with EDG frontend Parsing 2. LLVM with Clang as C/C++ frontend Static #include <iostream> Both are capable of #include <TROOT.h> #include <iostream> Analysis #include <TRint.h> #include <TROOT.h> #include <iostream> Int main(){ performing the proposed #include <TRint.h> #include <TROOT.h> TRint (); Int main(){ #include <TRint.h> AST } TRint (); transformation with high Int main(){ } TRint (); } Source-code precision Source-code Trafo. files But: EDG is not accepting whole AliRoot code and is Rewriting licensed for commercial (PrittyPrinting) purposes. The RecursiveASTVisitor template in Clang is used for traversing the AST, Statement-, Expression- and Type Visitors to access the nodes of the AST. The Rewriter object can be used for replacing and adding own source-code. Implementation not finished. 6. Oct. 2011 9
Transformation of ROOT & AliRoot Converting statics/globals/extern decl. -> TLS: Statics Globals Extern AliRoot 1724 196 220 ROOT 897 7 554 CINT 749 715 941 Finally around 1000 TLS specifiers have been added in ROOT and 366 in AliRoot. 3000 lines in ROOT and 1660 in AliRoot are added for initialization. => almost 6000 lines added automatically with some extraordinary exceptions, treated manually. 6. Oct. 2011 10
Critical section: CInt As demonstrated, the transformation lacks on access to more precise and reliable information from the AST in the current state. CInt is not transformed yet. Additional CInt is not just assumed to be executed, but generates source-code which assumed to be executed, the so called dictionaries. This makes it still thread-unaware. Q.: Can we surround this issue? 1. Using ACliC, means first to compile macros. 2. Avoid concurrent write access of type information in the interpreter by building them in advance. 3. Locking critical sections, where CInt is called. 6. Oct. 2011 11
Critical section: CInt Following these three steps, CInt and the interface TCint can be used as singletons and stay thread-unaware. But TROOT is accessing TCInt and should stay a singleton to. Heap Q.: Can TROOT be used as a singleton? Will be replaced by Lists on thread private Heap: threads TROOT ListOfFiles ListOfMappedFiles ListOfCanvases ListOfStyles so on. 6. Oct. 2011 12
Multi-threaded execution 1. No interference between threads Initialization 2. Most parts stay almost the same Simple 3. Minor changes in the code for steering Test Setup the event-reconstruction 4. Extraction step needs to distribute the Extract required information and 5. a Merging step need to fuse results … Concurrent BUT processing 1. Additional runtime for extraction, Termination merging and initialization 2. The original initialization is repeated per thread and wasting time Merging 3. With many cores, IO is going worst => For fixing this, an IO-Managing thread is Exit proposed for implementation 6. Oct. 2011 13
Multi-threaded execution Investigating scalability: Initialization Simple Test Setup GAP caused by IO usage Extract ~100MB/s … Concurrent processing Termination Merging Creation of 1M random numbers, stored into separate files of 900MB in total. Grows till a speedup of 9.64 with 12 threads. Test machine: 12 Exit core Intel Westmere. 2.6 GHz, 12 MB LL cache. 6. Oct. 2011 14
Sharing resources The value of this approach is not just using multi-threading, but using shared memory to reduce the whole memory consumption. The same way, we shared TROOT to all threads, we can share other classes as well. SharedClass Relative read-only (after initialization) Class => Stay on global heap Member Fields Transatory fields (read-write) => Go to thread private heap 6. Oct. 2011 15
Sharing resources (Origin: X. Dong, G. Cooperman, J. Apostolakis, Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software ) 1) General classification by using profiler. E.g. Massif 2) Then one must roughly classify member fields. 3) ptrace and memory protection can then be used to verify if they are relative read-only or transatory fields. 6. Oct. 2011 16
Testing Event-Reconstruction Preliminary results for PPBench raw-reconstruction Test with 200 events and ITS only. (Proton-Proton collision) 4 threads are running with a speedup of 2.5. But only 2 times more memory is used than a single thread reconstruction. Only Cint & TROOT is shared. Test machine: 12 core Intel Westmere. 2.6 GHz, 12 MB LL cache. 6. Oct. 2011 17
Conclusions 1. Simple way of parallelization that woks for AliRoot. 2. Reducing time in development and maintenance. 3. Introducing multi-threading without expert knowledge. 4. Keeping memory consumption under control. 5. Providing an analysis technique to investigate candidates for shared classes 6. and to investigate concerns of correctness. Further efforts: 1. Analyze correctness for this approach. 2. Find sharable classes to reduce memory consumption (e.g. ITSgeom). 3. Investigate further needs for massive multithreading. 6. Oct. 2011 18
Recommend
More recommend