Speed And Accuracy Dilemma In NoC Simulation: What About Memory Impact? Manuel Selva Abdoulaye Gamati´ e David Novo Gilles Sassatelli LIRMM (CNRS and University of Montpellier) 18 January 2016
Context Manycore processors integrating a NoC are there ◮ Intel Xeon Phi ◮ Kalray MPPA2-256 ◮ TILE-Gx72 NoC simulation tools are needed (and already there) ◮ Booksim , NoCTweak, Garnet, Noxim, McSim ◮ The perfect simulator is both fast and accurate ◮ Speed/accuracy dillema 1 / 13
Context Manycore processors integrating a NoC are there ◮ Intel Xeon Phi ◮ Kalray MPPA2-256 ◮ TILE-Gx72 NoC simulation tools are needed (and already there) ◮ Booksim , NoCTweak, Garnet, Noxim, McSim ◮ The perfect simulator is both fast and accurate ◮ Speed/accuracy dillema What about memory footprint? 1 / 13
Why Care About Memory Footprint? Simulation time Swapping required Number of cores in simulated manycore 2 / 13
Why Care About Memory Footprint? Simulation time Evaluate memory Swapping footprint of required different simulators Number of cores in simulated manycore 2 / 13
Outline Considered Simulators Impact Of Accuracy On Memory Footprint Impact Of Programming Abstraction On Memory Footprint Conclusions and Perspectives 3 / 13
Considered Simulators - 2 Criteria Accuracy ◮ Bit-accurate ◮ Cycle-accurate ◮ Transactional Level Modeling (TLM) 4 / 13
Considered Simulators - 2 Criteria Accuracy ◮ Bit-accurate ◮ Cycle-accurate ◮ Transactional Level Modeling (TLM) Programming abstraction - + Simulation frameworks Simulation speed ( SystemC , Ptolemy II) Productivity High level programming languages ( C++ , Java) Low level programming languages (C) - + 4 / 13
Considered Simulators Simulator Accuracy Programming Injector abstraction TLM SystemC Application Model McSim-TLM McSim-CA Cycle-accurate SystemC Application Model Booksim Cycle-accurate C++ Random uniform McSim-CA is based on NoCTweak 5 / 13
Simulated Hardware - Distributed Memory System Router Router Router Core Core Core Mem Mem Mem 6 7 8 Router Router Router Core Core Core Mem Mem Mem 3 4 5 Router Router Router Core Core Core Mem Mem Mem 0 1 2 6 / 13
Simulated Hardware - Priority Based Routers Local port data out data in data in ... ... data out Routing data in data in ... ... Packet switching Arbitration data out data out ... data in data out 7 / 13
McSim-TLM vs McSim-CA - Accuracy 40 40 38 McSim-TLM 37 McSim-CA 33 31 Execution time (ms) 30 30 26 26 25 25 25 24 24 22 22 22 21 21 20 20 10 10 0 0 2x2 2x2 3x3 3x3 4x4 4x4 5x5 5x5 8x8 8x8 10x10 10x10 15x15 15x15 20x20 20x20 NoC Size 8 / 13
McSim-TLM vs McSim-CA - Memory Footprint McSim-TLM Host mem=4,000 3 , 777 Average memory footprint (Mb) 2 , 420 McSim-CA 1 , 000 1 , 000 718 608 209 136 100 100 75 31 25 11 9 10 10 4x4 4x4 8x8 8x8 16x16 16x16 20x20 20x20 32x32 32x32 64x64 64x64 128x128 128x128 NoC Size 9 / 13
McSim-CA vs Booksim - Memory Footprint Host mem=4,000 BookSim 3 , 777 Average memory footprint (Mb) 2 , 420 McSim-CA 1 , 069 1 , 000 1 , 000 608 276 136 100 100 72 25 15 10 10 8 5 4x4 4x4 8x8 8x8 16x16 16x16 20x20 20x20 32x32 32x32 64x64 64x64 128x128 128x128 NoC Size 10 / 13
Deep Memory Footprint Analysis A lot of objects ◮ Few big objects accounting for 1% of footprint ◮ A lot of small SystemC objects (3,500,000 for 20x20) Accellera implementation ◮ Each SystemC object has a unique name ◮ Debug purposes ◮ Required by the standard 11 / 13
Optimized Accellera - Memory Footprint 3 , 777 10 , 000 10 , 000 BookSim 3 , 039 738Mb 2 , 420 1 , 951 Average memory footprint (Mb) McSim-CA saved McSim-CA-Opt 608 1 , 000 1 , 000 491 136 105 100 100 25 15 10 10 8 5 4x4 4x4 8x8 8x8 16x16 16x16 20x20 20x20 NoC Size 12 / 13
Conclusion From TLM to cycle-accurate ◮ Costs memory in addition to CPU Cycle-accurate concerns ◮ Programming abstraction costs memory in addition to CPU ◮ SystemC object names can consume a lot of memory Perspectives ◮ Evaluate memory footprint of other simulators ◮ Perform lazy allocation in SystemC? 13 / 13
References I ◮ N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on , pages 33–42, April 2009. ◮ V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. Noxim: An open, extensible and cycle-accurate network on chip simulator. In Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on , pages 162–163, July 2015. ◮ L. S. Indrusiak and O. M. dos Santos. Fast and accurate transaction-level model of a wormhole network-on-chip with priority preemptive virtual channel arbitration. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011 , pages 1–6, March 2011. ◮ Leandro Soares Indrusiak, James Harbin, and Osmar Marchi Dos Santos. Fast simulation of networks-on-chip with priority-preemptive arbitration. ACM Trans. Des. Autom. Electron. Syst. , 20(4):56:1–56:22, September 2015.
References II ◮ Nan Jiang, D.U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D.E. Shaw, J. Kim, and W.J. Dally. A detailed and flexible cycle-accurate network-on-chip simulator. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on , pages 86–96, April 2013. ◮ Khalid Latif, Manuel Selva, Charles Effiong, Roman Ursu, Abdoulaye Gamatie, Gilles Sassatelli, Leonardo Zordan, Luciano Ost, Piotr Dziurzanski, and Leandro Soares Indrusiak. Design space exploration for complex automotive applications: An engine control system case study. In Proceedings of the 2016 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools , RAPIDO ’16, pages 2:1–2:7, New York, NY, USA, 2016. ACM. ◮ L. Lehtonen, E. Salminen, and T. D. Hmlinen. Analysis of modeling styles on network-on-chip simulation. In NORCHIP, 2010 , pages 1–4, Nov 2010. ◮ Gunar Schirner and Rainer D¨ omer. Quantitative analysis of the speed/accuracy trade-off in transaction level modeling. ACM Trans. Embed. Comput. Syst. , 8(1):4:1–4:29, January 2009.
References III ◮ Anh T. Tran and Bevan Baas. NoCTweak: A highly parameterizable simulator for early exploration of performance and energy of networks on-chip. Technical Report ECE-VCL-2012-2, VLSI Computation Lab, ECE Department, University of California, Davis, 2012.
C++ String Implementation ◮ g++ 5.2.1 ◮ for a 2 characters string: ◮ stack space = 32, heap space = 0, capacity = 15 ◮ for a 16 characters string: ◮ stack space = 32, heap space = 17, capacity = 16 ◮ 15 characters stack buffer to avoid dynamic memory allocation 16 / 13
Recommend
More recommend