development
play

DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross Cunniff 04 April 2016 ABOUT US Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee.


  1. April 4-7, 2016 | Silicon Valley ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross Cunniff 04 April 2016

  2. ABOUT US Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee. Over 30 years of computer engineering experience. Erika Dignam Technical Program Manager and Bug Triager Studied computer arts. At NVIDIA for 9 years. 2 4/25/2016

  3. Bug types | Triage and Tools | Recap STRUCTURE Process Details | Bookkeeping Prevention and Benchmarking 3

  4. BUG TYPES Crash or TDR Corruption Performance SLI Scaling 4 4/25/2016

  5. TOOLS AND TRIAGE Traces – All bug types What is a trace? Intercepts calls between application and driver | Records to a file NV apitrace APP Driver file.trace Apitrace (DX and OpenGL) - http://apitrace.github.io/ • • Pass along .trace file – Replay, performance info, and dump API stream Simple to use - copy <API>.dll to executable location • Caveats - Long reproductions means large files | Tracing tools don’t always capture • | Some apps are not tracing friendly out of the box 5 4/25/2016

  6. TOOLS AND TRIAGE Traces More Tracing tools GLIntercept (OpenGL) - https://github.com/dtrebilco/glintercept • Useful for error states and other tracing, a little older than apitrace • • Copy opengl.dll and gliConfig.ini to executable folder location Swapping the DebugContext.ini config file can give very helpful information, for • example issues with SLI Scaling EXAMPLE: OpenGL: Performance(Medium) 131234: SLI performance warning: SLI AFR copy and • synchronization for texture mipmaps (42) 6 4/25/2016

  7. TOOLS AND TRIAGE Crashes/TDR Dump files • Mini dump - Always helpful, you can simply right click the process from the task manager or process explorer and select “Dump to File” Full dump - Better, but larger • https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx • TDR – Timeout Detection and Recovery Increase the TDR delay, what are the results then? • https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx • 7 4/25/2016

  8. TOOLS AND TRIAGE CPU Profilers - Performance Intel VTune In-depth perf analysis, finer tuned control, filters noise | Needs a license, not • free https://software.intel.com/en-us/intel-vtune-amplifier-xe • AMD CodeAnalyst • Simple, free, runs on both CPUs | Less robust than Vtune, no longer supported http://developer.amd.com/tools-and-sdks/archive/amd-codeanalyst- • performance-analyzer/ App bound? Driver bound? GPU bound? Performance paths taken 8 4/25/2016

  9. TOOLS AND TRIAGE Performance/Resources Process Explorer Free quick overview tool - Check loaded .dlls, can see load on resources, memory • leaks, GPU or CPU bound https://technet.microsoft.com/en-us/sysinternals/processexplorer.aspx • GPUview • Free Windows tool included with the Windows Performance Toolkit (WPT) https://graphics.stanford.edu/~mdfisher/GPUView.html • https://developer.nvidia.com/content/are-you-running-out-video-memory- • detecting-video-memory-overcommitment-using-gpuview 9 4/25/2016

  10. PROCESS EXPLORER 10

  11. TOOLS AND TRIAGE Tools gDEBugger http://www.gremedy.com/ • Free OpenGL debugging tool • • Useful for data gathering, good for tracking state changes, dynamically look at stream • EXAMPLE: • Polygon count information from models Performance bug was root caused to one mode of the model was sending a significant • amount more polys into the OpenGL pipeline. 11 4/25/2016

  12. NVIDIA TOOLS AND LOGS NVIDIA OpenGL Driver Error codes External Swak = Swiss Army Knife NVIDIA tool used to capture detailed system information • • Only available under NDA, on the partners site WSAppNotifier.exe – Profiles • For application profile problems, tells you which profiles are running/applied You may have to launch the app twice • NDA only, on partner site • 12 4/25/2016

  13. WSAPPNOTIFIER.EXE 13

  14. TRIAGE/DEBUGGING Profiles – Things to Try Changing Global Profiles • Workstation App - Dynamic Streaming | Turns off some optimized driver paths 3D App – Game Development | Simulates a GeForce • • SLI Aware Application | SLI performance testing Threaded optimization = OFF | In Profile settings • Notebooks Try setting NVIDIA GPU to default | In profiles or SBIOS if available • 14 4/25/2016

  15. RECAP What tools for what bugs Crash or TDR • TDR Delay RegKeys | Collect dump files | Trace | GPUView Corruption • Trace | Changing profiles Performance • Changing profiles | apitrace | VTune/CodeAnalyst SLI Scaling • Debug Context from GLIntercept 15 4/25/2016

  16. TRIAGE/DEBUGGING Vulkan https://www.khronos.org/vulkan/ New API that puts the application developer in control, appDev manages GPU memory and • resources Built in Validation Layer – API violations SDK - https://vulkan.lunarg.com/signin | Need account Demos • https://github.com/SaschaWillems/Vulkan | https://github.com/McNopper/Vulkan Renderdoc | Graphics Debugger https://github.com/baldurk/renderdoc 16 4/25/2016

  17. TRIAGE/DEBUGGING Vulkan Vulkan Talks • S6818 – Vulkan and NVIDIA: The Essentials S6138 – GPU Driven Rendering in Vulkan and OpenGL • S6133 – VKCPP: A C++ Layer on Top of Vulkan • Three Hangouts, Monday and Tuesday afternoons • Resources https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/vulkan/resources.md • 17 4/25/2016

  18. BUG PROCESS Normal External Bug Flow External Bug -> QA -> Triage -> Engineering • Accounts to file bugs • partners.nvidia.com – Needs NDA developer.nvidia.com\join • • Access to early release drivers and NVIDIA tools, report bugs! 18 4/25/2016

  19. BUG PROCESS Overview NVBUGS Start by filing as a software issue Important to have basic reproduction steps • OS, driver, card, application and version if applicable, system information, frequency • Severity and impact for you Type - Performance, Crash, Corruption, TDR • Regression information is very helpful if can be provided 19 4/25/2016

  20. 20

  21. TOOLS AND TRIAGE Overview Simple app/license A trace would be great, no license/app/model needed • Avoids delays, very useful when a third party has a repro others can’t get • • If not possible, then models/scenes/app/license/demo will be needed – Time sink What to attach to bugs • Logs, traces, performance snap shots, dump files, videos, event logs System information via externSwak (NVTOOL) • 21 4/25/2016

  22. WHAT HAPPENS TO YOUR BUG Fixes -> Driver | Branches ODE = Optimized Driver for Enterprise QNF = Quadro New Feature Long lived branch Short lived branch • • • Multiple releases or dot version per • One release per branch branch Release driver for testing new • For production use and features and fixes • certification WHQL = Windows Hardware Quality Labs Testing and Signed 22 4/25/2016

  23. PREVENTION What NVIDIA does ATP and QA • We have QA teams with application experts around the world testing applications, GPUs, OSs, and drivers ATP is our automated test harness for further testing to cover more configurations • DVS • Driver Validation System. Automated and run with every single code change. 10 million images/tests per day German Test Lab and Global Test Lab 24/7 automated testing of professional applications and features • 23 4/25/2016

  24. PREVENTION Best Process We want benchmarks and test suites! • Early detection of bugs and issues Early detection of performance regressions • Get involved in industry standard benchmarks, example SPEC • Over to Ross to discuss Performance Benchmark creation! 24 4/25/2016

  25. PERFORMANCE BENCHMARKING A key to high-quality user experience 25

  26. “WHEN YOU CANNOT MEASURE IT… …your knowledge is of a meagre and unsatisfactory kind” – Lord Kelvin Anything a computer can do, a human can do. Given enough time … Computers are accelerators. Without good performance, user experience is bad. Benchmarking is the technique to ensure repeatable performance 26

  27. WHAT MAKES A BENCHMARK? Originally a surveying mark which provided a repeatable reference for placing a leveling rod. Key attributes: #1: repeatable #2: accurate #3: reportable 27

  28. UNITS ARE NOT BENCHMARKS Many common units exist: MIPS, FLOPS, FPS, LPM, … Just because you can run a test and get units out, does not make your test a benchmark Quiz: if your test returns a result 60 FPS, what might you be measuring? What about 30, 20, 15, … FPS? 28

  29. REPEATABILITY First principle: make sure the same operations are benchmarked on all configs Most benchmarks exhibit some randomness in performance The causes are many; some examples: Non-deterministic operating system process / thread scheduler Disk I/O – variable times to reach a sector with rotational media; variable wear leveling for solid state media Build-to-build variation due to cache layout changes Virus scan cycles Rule of thumb: a variation of up to 5% is generally acceptable (if higher, use multiple runs and rely on regression toward the mean) 29

Recommend


More recommend