case for transforming parallel run times into operating
play

Case for Transforming Parallel Run-times Into Operating System - PowerPoint PPT Presentation

Case for Transforming Parallel Run-times Into Operating System Kernels Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016 Table of Contents Introduction Evaluation Development effort Conclusion Table of


  1. Case for Transforming Parallel Run-times Into Operating System Kernels Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016

  2. Table of Contents Introduction Evaluation Development effort Conclusion

  3. Table of Contents Introduction Evaluation Development effort Conclusion

  4. What is this project? 1. Northwestern University, Sandia Labs, Oak Ridge 2. Part of Hobbes Project 3. They also develop Palacios

  5. Why is it interesting for us? ◮ Proposes a microkernel

  6. Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context

  7. Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi

  8. Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi ◮ It cites L4 paper: [40] J. Liedtke. On micro-kernel construction. In Proceedings of the 15 th ACM Symposium on Operating Systems Principles (SOSP 1995) , pages 237–250, Dec. 1995.

  9. Idea 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available

  10. A Typical dialog with the kernel

  11. ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES? I’d like to pin memory to a specific PFN range please user mode runtime kernel mode NO! general OS NOT ALWAYS 5

  12. ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES? I’d like to never be interrupted please user mode runtime kernel mode NOPE general OS NOT ALWAYS 6

  13. RESTRICTED ACCESS TO HARDWARE I’d like to set up some custom page mappings please user mode runtime kernel mode Uh no general OS 7

  14. RESTRICTED ACCESS TO HARDWARE I’d like to interrupt another processor please user mode runtime kernel mode HA! general OS 8

  15. Motivation 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes.

  16. Motivation 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes. 3.3 Kernel may waste resources for the features the application doesn’t need: For example, the run-time might not require coherence, but get it anyway.

  17. Contributions 1. Criticize traditional architecture 2. Propose a new OS structure 3. Port some of the existing runtimes

  18. Hybrid Run-Time (HRT) User%Mode% Parallel&App& Kernel%Mode% Performance*Path* Parallel&App& User%Mode% Parallel&Run,-me& Hybrid&Run,-me& General&Kernel& Kernel%Mode% (HRT)& Node&HW& Node&HW& (a) Current Model (b) Hybrid Run-time Model

  19. What is HRT? ◮ The runtime is the kernel, built within a kernel framework

  20. What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space

  21. What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware

  22. What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware ◮ HRT can pick its own abstractions

  23. Aerokernel User Mode Kernel Mode Parallel Application Kernel Runtime HRT Threads Sync. Paging Events Topology Bootstrap Timers IRQs Console Nautilus Hardware Figure 2: Structure of Nautilus.

  24. Benefits ◮ Better abstractions ◮ Noiseless ◮ Lightweight

  25. Legacy support User%Mode% Kernel%Mode% Parallel&App& Parallel&App& Parallel&Run,-me& User%Mode% Hybrid&Run,-me& Legacy*Path* (HRT)& General&Kernel& Kernel%Mode% Performance*Path* General& Specialized& Virtualiza-on& Virtualiza-on& Model& Model& Hybrid&Virtual&Machine&(HVM)& Node&HW& (c) Hybrid Run-time Model Within a Hybrid Virtual Machine

  26. Table of Contents Introduction Evaluation Development effort Conclusion

  27. Thread creation (a) (b) (c) 7x10 6 7x10 6 7x10 6 Nautilus Nautilus Nautilus 6x10 6 6x10 6 6x10 6 Linux Linux Linux 5x10 6 5x10 6 5x10 6 Cycles Cycles Cycles 4x10 6 4x10 6 4x10 6 3x10 6 3x10 6 3x10 6 2x10 6 2x10 6 2x10 6 1x10 6 1x10 6 1x10 6 0 0 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Figure: Average, minimum, and maximum time to create a number of threads in sequence. ce igh

  28. Thread creation (d) (e) (f) 26 40 45 24 40 35 22 35 Speedup 20 Speedup 30 Speedup 30 18 25 25 16 20 14 20 15 12 10 15 10 5 8 10 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Linux Figure: Nautilus from previous figure ce igh

  29. Thread creation (d) (e) (f) 26 40 45 24 40 35 22 35 20 30 30 Speedup Speedup Speedup 18 25 25 16 20 14 20 15 12 10 15 10 5 8 10 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Linux Figure: Nautilus from previous figure ce Why bends? At (d) at 8 threads, (e) at 32, and (f) at 8. Bugs? igh

  30. Thread creation (summary) Figure 2: Struct OS Avg Min Max Nautilus 16795 2907 44264 Linux 38456 34447 238866 Figure 3: Time to create a single thread measured in cycles.

  31. Spinlock microbenchmark OS Execution time (s) Nautilus 13.72 Linux 12.53 OS Avg. acquire/release time (cycles) Nautilus 59 Linux 36 Figure 5: Total time to acquire and release a spinlock 500 million times on Nautilus and Linux, and average time in cycles for an acquire/release pair.

  32. Wake-up microbenchmark 30000 not available in userspace 25000 20000 Cycles overhead too high 15000 in userspace 10000 5000 0 Linux N. MWAIT N. condvar N. w/kick Figure 6: Average event wakeup latency.

  33. Circuit simulator benchmark 110 Nautilus Linux 100 90 80 70 Runtime (s) 60 50 40 30 20 10 0 2 4 8 16 32 62 Legion Processors (threads) Figure 11: Run time of Legion circuit simulator versus core count. The baseline Nautilus version has higher performance at 62 cores than the Linux version.

  34. Circuit simulator benchmark 16 Nautilus Linux 14 12 10 Speedup 8 6 4 2 0 2 4 8 16 32 62 Legion Processors (threads) Figure 12: Speedup of Legion (normalized to 2 Legion pro- cessors) circuit simulator running on Linux and Nautilus as a function of Legion processor (thread) count.

  35. Circuit simulator benchmark 5 % 4.5 % 4 % 3.5 % Speedup 3 % 2.5 % 2 % 1.5 % 1 % 0.5 % 2 4 8 16 32 62 Figure 13: Speedup of Legion circuit simulator comparing the baseline Nautilus version and a Nautilus version that executes Legion tasks with interrupts off.

  36. Table of Contents Introduction Evaluation Development effort Conclusion

  37. Kernel development The process of building Nautilus as minimal kernel layer with support for a complex, modern, many-core x86 machine took six person-months of effort on the part of seasoned OS/VMM kernel developers. Language SLOC C 22697 C++ 133 x86 Assembly 428 Scripting 706 Figure 9: Source lines of code for the Nautilus kernel.

  38. Run-time support Porting Legion: ◮ 43000 SLOC in C++ ◮ Most of the work went into understating Legion ◮ Some code added to Nautilus Language SLOC C++ 133 C 636 Figure 10: Lines of code added to Nautilus to support Le- gion, NDPC, and NESL. ◮ Four person-months to port Also porting NESL and NDPC (related to each other).

  39. Table of Contents Introduction Evaluation Development effort Conclusion

  40. Conclusion ◮ A mikrokernel ◮ And a lightweight kernel ◮ Requires effort for porting ◮ Early stage of development

Recommend


More recommend