optimizing builds
play

OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS - PowerPoint PPT Presentation

OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com 2019 Bay Area LLVM Developers' Meeting, Oct.22-23 1 SUMMARY PART 1 PREAMBLE PART 2 EXPERIMENTS PART 3 PROPOSAL PART 4 NEXT


  1. OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com 2019 Bay Area LLVM Developers' Meeting, Oct.22-23 1

  2. SUMMARY PART 1 PREAMBLE PART 2 EXPERIMENTS PART 3 PROPOSAL PART 4 NEXT STEPS 2

  3. PART 1 PREAMBLE: CHALLENGES 3

  4. Lines of Code (Assassins’ Creed, Far Cry) PART 1 – PREAMBLE 4

  5. Game production constraints @ Ubisoft Concurent AAA games 20 – 25 Editor Build 20,000 .CPP LoC/game 30 - 50 M 25,000 .H Programmers/title 100 – 250 23 GB .OBJ Code Changes/day 100 – 150 9 GB .DEBUG$T (peak:400) 10 M TYPE RECORDS Build targets/platform 5 – 6 42 M SYMBOLS Platforms/Game 4+ 300 M .EXE Code workspace 70 - 100 GB 2 GB .PDB Data workspace 100 - 200 GB Windows 10 Game builds/day 100 – 150 Fastbuild, distributed Stripped Build 1 - 6 GB Always Unity builds Final Build 50 - 90 GB PART 1 – PREAMBLE 5

  6. AAA GAME, CLEAN REBUILD X64 EDITOR RELEASE (FASTBUILD) 00 min 00 sec 02 min 53 sec 05 min 46 sec 08 min 38 sec 11 min 31 sec 14 min 24 sec 17 min 17 sec 20 min 10 sec 2017 (MSVC) 08 min 50 sec 10 min 20 sec 2018 (MSVC) 08 min 33 sec 06 min 46 sec Fall 2018 (MSVC + LLD) 08 min 50 sec 01 min 18 sec 2019 (MSVC + LLD) 08 min 20 sec 43 sec 2019 (Clang) 07 min 00 sec 29 sec 100% cache hit, local SSD 04 min 00 sec 29 sec 100% cache hit, 1 Gpbs network 04 min 15 sec 29 sec Compiler Linker PART 1 – PREAMBLE 6

  7. PART 2 EXPERIMENTS 7

  8. 2.1 Clang-scan-deps & Fastbuild cache PART 2 – EXPERIMENTS 8

  9. FASTBUILD CACHE READ ALGORITHM found 5-10 sec clang-cl /E md5sum curl https://store/ clang-cl not found a.cpp 0.02 sec 0.02 sec while read x; do clang-scan-deps md5sum $x; done a.cpp deps.txt deps+MD5.txt PART 2 – EXPERIMENTS 9

  10. 100% NETWORK CACHE HITS AAA GAME, X64 EDITOR RELEASE (FASTBUILD) VS2017 15.9.16 06 min 10 sec 40 sec Network cache 04 min 05 sec 40 sec Network cache + clang-scan-deps 35 sec 40 sec Compiler/Cache Linker PART 2 – EXPERIMENTS 10

  11. clang-scan-deps LLD + network cache (MSVC OBJs + ghash) (ms) 50k files 7 GB – > 22.6 GB Intel Xeon W-2135 @ 3.7 GHz, 128 GB, NVMe SSD, 1Gbps Network PART 2 – EXPERIMENTS 11

  12. 2.2 StringMap 1 PART 2 – EXPERIMENTS 2

  13. CLANG-SCAN-DEPS STANDALONE (50K FILES) avg ~90% cpu 11.5% process time Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD Title of Document 13

  14. STRINGMAP PART 2 – EXPERIMENTS 14

  15. STRINGMAP PART 2 – EXPERIMENTS 15

  16. DOWN THE RABBIT HOLE sizeof(std::error_code) -> 16 bytes sizeof(llvm::ErrorOr<DirectoryEntry&>) -> 24 bytes sizeof(llvm::StringMapEntry<llvm::ErrorOr<DirectoryEntry&>>) – > 32 bytes (+string contents) PART 2 – EXPERIMENTS 16

  17. STRINGMAP: MEMORY LAYOUT StringMapEntry* size_t nullptr uint32_t count nullptr T NumBuckets nullptr 0 value 0x15f238a92 0 count nullptr NumBuckets 0 nullptr string 0x12345678 nullptr 0 0 0 PART 2 – EXPERIMENTS 17

  18. STRINGMAP (VTUNE) PART 2 – EXPERIMENTS 18

  19. STRINGMAP STATS Hash collisions / call Cachelines hit / call 79.4% 60.2% 70.0% 90.0% 80.0% 60.0% 70.0% 50.0% 60.0% 40.0% 50.0% 14.7% 11.7% 40.0% 30.0% 8.0% 3.5% 30.0% 20.0% 5.3% 20.0% 1.6% 3.5% 10.0% 0.8% 0.5% 0.5% 0.3% 0.2% 1.7% 0.2% 1.0% 10.0% 0.1% 0.1% 0.1% 0.1% 0.0% 0.0% 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 187 M samples PART 2 – EXPERIMENTS 19

  20. DenseMap<uint64_t,T> + xxHash64() + StringSaver PART 2 – EXPERIMENTS 20

  21. DenseMap<__int128,T> + XXH128() + StringSaver PART 2 – EXPERIMENTS 21

  22. 2.4 Multithreading LLD (COFF driver) 2 PART 2 – EXPERIMENTS 3

  23. LINK AAA GAME, X64 EDITOR RELEASE (22.8GB MSVC OBJS) VS2019 16.2 58 sec LLD 9.0 62 sec LLD 8 + // GHASH 49 sec Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD PART 2 – EXPERIMENTS 24

  24. .0 s 5.0 s 10.0 s 15.0 s 20.0 s 25.0 s Clang 9.0, no Ghash 19.21 s uint64_t uint32_t Clang 8.0 + // Ghash (12-byte buckets) 7.42 s GHash TypeIndex Clang 8.0 + // Ghash (8-byte buckets) 5.29 s uint64_t Clang 8.0 + // Ghash (8-byte buckets) + 2MB pages 4.20 s TypeIndex GHash PART 2 – EXPERIMENTS 25

  25. 2.3 Process Creation 2 PART 2 – EXPERIMENTS 6

  26. COMPILING WITH CLANG 9.0 PART 2 – EXPERIMENTS 27

  27. CLANG CC1 IN PROCMON 93 ms PART 2 – EXPERIMENTS 28

  28. MAKING CC1 REENTRANT clang/lib/driver/Job.cpp int Command::Execute(ArrayRef<llvm::Optional<StringRef>> Redirects, clang/tools/driver/driver.cpp std ::string *ErrMsg, bool *ExecutionFailed) const { [...] int main(int argc_, const char **argv_) { typedef int (*ClangDriverMainFunc)(SmallVectorImpl<const char *> &); noteBottomOfStack(); ClangDriverMainFunc ClangDriverMain = nullptr; llvm::InitLLVM X(argc_, argv_); SmallVector<const char *, 256> argv(argv_, argv_ + argc_); [...] if (ClangDriverMain) { if (llvm::sys::Process::FixupStandardFileDescriptors()) [...] return 1; llvm::CrashRecoveryContext CRC; CRC.EnableExceptionHandler = true; llvm::InitializeAllTargets(); return ClangDriverMain(argv); const void *PrettyState = llvm::SavePrettyStackState(); } int Ret = 0; int ClangDriverMain (SmallVectorImpl<const char *>& argv) { auto ExecuteClangMain = [&]() { Ret = ClangDriverMain(Argv); }; static LLVM_THREAD_LOCAL bool EnterPE = true; if (EnterPE) { if (!CRC.RunSafely(ExecuteClangMain)) { llvm::sys::DynamicLibrary::AddSymbol("ClangDriverMain", (void*)( i.. llvm::RestorePrettyStackState(PrettyState); EnterPE = false; return CRC.RetCode; } else { } llvm::cl::ResetAllOptionOccurrences(); return Ret; } } else { auto Args = llvm::toStringRefArray(Argv.data()); auto TargetAndMode = ToolChain::getTargetAndModeFromProgramName(arg.. return llvm::sys::ExecuteAndWait(Executable, Args, Env, Redirects, /*secondsToWait*/ 0, /*memoryLimit*/ 0, ErrMsg, ExecutionFailed); } } PART 2 – EXPERIMENTS 29

  29. CLANG DRIVER & CC1 MERGED PART 2 – EXPERIMENTS 30

  30. BYPASSING THE CC1 PROCESS CLEAN REBUILD LLVM, CLANG & LLD 34 min 00 sec 6-core - W10 build 1803 32 min 30 sec 22 min 46 sec 28 min 00 sec 6-core - W10 build 1903 30 min 16 sec 19 min 54 sec 12 min 00 sec 36-core - W10 build 1709 13 min 10 sec 07 min 10 sec VS2019 16.2 Clang 9.0 Clang 9.0 + cc1 bypass PART 2 – EXPERIMENTS 31

  31. 2.5 CRT Allocator 3 PART 2 – EXPERIMENTS 2

  32. LINKING RAINBOW6: SIEGE WITH THINLTO :-( 4% 96% idle PART 2 – EXPERIMENTS 33

  33. THINLTO: ALLOCATOR CONTENTION PART 2 – EXPERIMENTS 34

  34. REPLACING THE CRT ALLOCATOR llvm/lib/Support/Windows/Memory.inc #include "rpmalloc/rpmalloc.c" extern "C" { _ACRTIMP _CRTRESTRICT void *malloc( size_t size) { return rpmalloc(size); } _ACRTIMP void free(void *p) { rpfree(p); } _ACRTIMP _CRTRESTRICT void *calloc( size_t n, size_t elem_size) { return rpcalloc(n, elem_size); $ LD_PRELOAD=/path/to/my/malloc.so /bin/ls } _ACRTIMP _CRTRESTRICT void *realloc(void *ptr, size_t size) { return rprealloc(ptr, size); } } // Bypass CRT debug allocator #ifdef _DEBUG void *operator new(decltype(sizeof(0)) n) noexcept(false) { return malloc(n); } void __CRTDECL operator delete(void *const block) noexcept { free(block); } void *operator new[]( std :: size_t s) throw( std :: bad_alloc ) { return malloc(s); } void operator delete[](void *p) throw() { free(p); } #endif https://github.com/mjansson/rpmalloc PART 2 – EXPERIMENTS 35

  35. THINLTO (CLEAN REBUILD) RAINBOW 6: SIEGE, PC GAME PROFILE 57 min 00 sec VS 2017 15.9.16 37 min 12 sec 20 min 13 sec Clang 9.0 ThinLTO > 1 h 30 min 16 min 19 sec Clang 9.0 ThinLTO + rpmalloc 03 min 57 sec 6-core (W10 build 1903) 36-core (W10 build 1709) PART 2 – EXPERIMENTS 36

  36. PART 3 PROPOSAL PROOF-OF-CONCEPT 37

  37. PREVIOUS BUILD PROCESS FASTBUILD clang.exe rc.exe llvm-lib.exe lld-link.exe cmake.exe ml64.exe (masm) clang-tblgen.exe llvm-tblgen.exe PART 3 – PROPOSAL 38

  38. Maybe there’s a better way PART 3 – PROPOSAL 39

  39. Image Credit: Caterpillar PART 3 – PROPOSAL 40

  40. BUILDING WITH BUILDOZER FASTBUILD LLVM-BUILDOZER rc.exe clang.exe cmake.exe llvm-lib.exe lld-link.exe ml64.exe (masm) llvm-tblgen.exe clang-tblgen.exe PART 3 – PROPOSAL 41

  41. Local Local rc.exe FASTBUILD Local cmake.exe ml64.exe (masm) Worker 5 Worker 3 Worker 4 Worker 1 Worker 2 PART 3 – PROPOSAL 42

  42. RUNNING THE DOZER int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA (EXE.data()); if (!H) return 0; RemapIAT(H); ” LoadLibrary can also be used to load other executable modules.[..] InitDebInfo(); However, do not use LoadLibrary to run an .exe file. PatchRPMalloc(M); Instead, use the CreateProcess function. ” (MSDN) InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] } PART 3 – PROPOSAL 43

  43. RUNNING THE DOZER int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA (EXE.data()); if (!H) return 0; RemapImportAddressTable(H); InitDebInfo(); PatchRPMalloc(M); InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] } PART 3 – PROPOSAL 44

Recommend


More recommend