OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com 2019 Bay Area LLVM Developers' Meeting, Oct.22-23 1
SUMMARY PART 1 PREAMBLE PART 2 EXPERIMENTS PART 3 PROPOSAL PART 4 NEXT STEPS 2
PART 1 PREAMBLE: CHALLENGES 3
Lines of Code (Assassins’ Creed, Far Cry) PART 1 – PREAMBLE 4
Game production constraints @ Ubisoft Concurent AAA games 20 – 25 Editor Build 20,000 .CPP LoC/game 30 - 50 M 25,000 .H Programmers/title 100 – 250 23 GB .OBJ Code Changes/day 100 – 150 9 GB .DEBUG$T (peak:400) 10 M TYPE RECORDS Build targets/platform 5 – 6 42 M SYMBOLS Platforms/Game 4+ 300 M .EXE Code workspace 70 - 100 GB 2 GB .PDB Data workspace 100 - 200 GB Windows 10 Game builds/day 100 – 150 Fastbuild, distributed Stripped Build 1 - 6 GB Always Unity builds Final Build 50 - 90 GB PART 1 – PREAMBLE 5
AAA GAME, CLEAN REBUILD X64 EDITOR RELEASE (FASTBUILD) 00 min 00 sec 02 min 53 sec 05 min 46 sec 08 min 38 sec 11 min 31 sec 14 min 24 sec 17 min 17 sec 20 min 10 sec 2017 (MSVC) 08 min 50 sec 10 min 20 sec 2018 (MSVC) 08 min 33 sec 06 min 46 sec Fall 2018 (MSVC + LLD) 08 min 50 sec 01 min 18 sec 2019 (MSVC + LLD) 08 min 20 sec 43 sec 2019 (Clang) 07 min 00 sec 29 sec 100% cache hit, local SSD 04 min 00 sec 29 sec 100% cache hit, 1 Gpbs network 04 min 15 sec 29 sec Compiler Linker PART 1 – PREAMBLE 6
PART 2 EXPERIMENTS 7
2.1 Clang-scan-deps & Fastbuild cache PART 2 – EXPERIMENTS 8
FASTBUILD CACHE READ ALGORITHM found 5-10 sec clang-cl /E md5sum curl https://store/ clang-cl not found a.cpp 0.02 sec 0.02 sec while read x; do clang-scan-deps md5sum $x; done a.cpp deps.txt deps+MD5.txt PART 2 – EXPERIMENTS 9
100% NETWORK CACHE HITS AAA GAME, X64 EDITOR RELEASE (FASTBUILD) VS2017 15.9.16 06 min 10 sec 40 sec Network cache 04 min 05 sec 40 sec Network cache + clang-scan-deps 35 sec 40 sec Compiler/Cache Linker PART 2 – EXPERIMENTS 10
clang-scan-deps LLD + network cache (MSVC OBJs + ghash) (ms) 50k files 7 GB – > 22.6 GB Intel Xeon W-2135 @ 3.7 GHz, 128 GB, NVMe SSD, 1Gbps Network PART 2 – EXPERIMENTS 11
2.2 StringMap 1 PART 2 – EXPERIMENTS 2
CLANG-SCAN-DEPS STANDALONE (50K FILES) avg ~90% cpu 11.5% process time Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD Title of Document 13
STRINGMAP PART 2 – EXPERIMENTS 14
STRINGMAP PART 2 – EXPERIMENTS 15
DOWN THE RABBIT HOLE sizeof(std::error_code) -> 16 bytes sizeof(llvm::ErrorOr<DirectoryEntry&>) -> 24 bytes sizeof(llvm::StringMapEntry<llvm::ErrorOr<DirectoryEntry&>>) – > 32 bytes (+string contents) PART 2 – EXPERIMENTS 16
STRINGMAP: MEMORY LAYOUT StringMapEntry* size_t nullptr uint32_t count nullptr T NumBuckets nullptr 0 value 0x15f238a92 0 count nullptr NumBuckets 0 nullptr string 0x12345678 nullptr 0 0 0 PART 2 – EXPERIMENTS 17
STRINGMAP (VTUNE) PART 2 – EXPERIMENTS 18
STRINGMAP STATS Hash collisions / call Cachelines hit / call 79.4% 60.2% 70.0% 90.0% 80.0% 60.0% 70.0% 50.0% 60.0% 40.0% 50.0% 14.7% 11.7% 40.0% 30.0% 8.0% 3.5% 30.0% 20.0% 5.3% 20.0% 1.6% 3.5% 10.0% 0.8% 0.5% 0.5% 0.3% 0.2% 1.7% 0.2% 1.0% 10.0% 0.1% 0.1% 0.1% 0.1% 0.0% 0.0% 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 187 M samples PART 2 – EXPERIMENTS 19
DenseMap<uint64_t,T> + xxHash64() + StringSaver PART 2 – EXPERIMENTS 20
DenseMap<__int128,T> + XXH128() + StringSaver PART 2 – EXPERIMENTS 21
2.4 Multithreading LLD (COFF driver) 2 PART 2 – EXPERIMENTS 3
LINK AAA GAME, X64 EDITOR RELEASE (22.8GB MSVC OBJS) VS2019 16.2 58 sec LLD 9.0 62 sec LLD 8 + // GHASH 49 sec Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD PART 2 – EXPERIMENTS 24
.0 s 5.0 s 10.0 s 15.0 s 20.0 s 25.0 s Clang 9.0, no Ghash 19.21 s uint64_t uint32_t Clang 8.0 + // Ghash (12-byte buckets) 7.42 s GHash TypeIndex Clang 8.0 + // Ghash (8-byte buckets) 5.29 s uint64_t Clang 8.0 + // Ghash (8-byte buckets) + 2MB pages 4.20 s TypeIndex GHash PART 2 – EXPERIMENTS 25
2.3 Process Creation 2 PART 2 – EXPERIMENTS 6
COMPILING WITH CLANG 9.0 PART 2 – EXPERIMENTS 27
CLANG CC1 IN PROCMON 93 ms PART 2 – EXPERIMENTS 28
MAKING CC1 REENTRANT clang/lib/driver/Job.cpp int Command::Execute(ArrayRef<llvm::Optional<StringRef>> Redirects, clang/tools/driver/driver.cpp std ::string *ErrMsg, bool *ExecutionFailed) const { [...] int main(int argc_, const char **argv_) { typedef int (*ClangDriverMainFunc)(SmallVectorImpl<const char *> &); noteBottomOfStack(); ClangDriverMainFunc ClangDriverMain = nullptr; llvm::InitLLVM X(argc_, argv_); SmallVector<const char *, 256> argv(argv_, argv_ + argc_); [...] if (ClangDriverMain) { if (llvm::sys::Process::FixupStandardFileDescriptors()) [...] return 1; llvm::CrashRecoveryContext CRC; CRC.EnableExceptionHandler = true; llvm::InitializeAllTargets(); return ClangDriverMain(argv); const void *PrettyState = llvm::SavePrettyStackState(); } int Ret = 0; int ClangDriverMain (SmallVectorImpl<const char *>& argv) { auto ExecuteClangMain = [&]() { Ret = ClangDriverMain(Argv); }; static LLVM_THREAD_LOCAL bool EnterPE = true; if (EnterPE) { if (!CRC.RunSafely(ExecuteClangMain)) { llvm::sys::DynamicLibrary::AddSymbol("ClangDriverMain", (void*)( i.. llvm::RestorePrettyStackState(PrettyState); EnterPE = false; return CRC.RetCode; } else { } llvm::cl::ResetAllOptionOccurrences(); return Ret; } } else { auto Args = llvm::toStringRefArray(Argv.data()); auto TargetAndMode = ToolChain::getTargetAndModeFromProgramName(arg.. return llvm::sys::ExecuteAndWait(Executable, Args, Env, Redirects, /*secondsToWait*/ 0, /*memoryLimit*/ 0, ErrMsg, ExecutionFailed); } } PART 2 – EXPERIMENTS 29
CLANG DRIVER & CC1 MERGED PART 2 – EXPERIMENTS 30
BYPASSING THE CC1 PROCESS CLEAN REBUILD LLVM, CLANG & LLD 34 min 00 sec 6-core - W10 build 1803 32 min 30 sec 22 min 46 sec 28 min 00 sec 6-core - W10 build 1903 30 min 16 sec 19 min 54 sec 12 min 00 sec 36-core - W10 build 1709 13 min 10 sec 07 min 10 sec VS2019 16.2 Clang 9.0 Clang 9.0 + cc1 bypass PART 2 – EXPERIMENTS 31
2.5 CRT Allocator 3 PART 2 – EXPERIMENTS 2
LINKING RAINBOW6: SIEGE WITH THINLTO :-( 4% 96% idle PART 2 – EXPERIMENTS 33
THINLTO: ALLOCATOR CONTENTION PART 2 – EXPERIMENTS 34
REPLACING THE CRT ALLOCATOR llvm/lib/Support/Windows/Memory.inc #include "rpmalloc/rpmalloc.c" extern "C" { _ACRTIMP _CRTRESTRICT void *malloc( size_t size) { return rpmalloc(size); } _ACRTIMP void free(void *p) { rpfree(p); } _ACRTIMP _CRTRESTRICT void *calloc( size_t n, size_t elem_size) { return rpcalloc(n, elem_size); $ LD_PRELOAD=/path/to/my/malloc.so /bin/ls } _ACRTIMP _CRTRESTRICT void *realloc(void *ptr, size_t size) { return rprealloc(ptr, size); } } // Bypass CRT debug allocator #ifdef _DEBUG void *operator new(decltype(sizeof(0)) n) noexcept(false) { return malloc(n); } void __CRTDECL operator delete(void *const block) noexcept { free(block); } void *operator new[]( std :: size_t s) throw( std :: bad_alloc ) { return malloc(s); } void operator delete[](void *p) throw() { free(p); } #endif https://github.com/mjansson/rpmalloc PART 2 – EXPERIMENTS 35
THINLTO (CLEAN REBUILD) RAINBOW 6: SIEGE, PC GAME PROFILE 57 min 00 sec VS 2017 15.9.16 37 min 12 sec 20 min 13 sec Clang 9.0 ThinLTO > 1 h 30 min 16 min 19 sec Clang 9.0 ThinLTO + rpmalloc 03 min 57 sec 6-core (W10 build 1903) 36-core (W10 build 1709) PART 2 – EXPERIMENTS 36
PART 3 PROPOSAL PROOF-OF-CONCEPT 37
PREVIOUS BUILD PROCESS FASTBUILD clang.exe rc.exe llvm-lib.exe lld-link.exe cmake.exe ml64.exe (masm) clang-tblgen.exe llvm-tblgen.exe PART 3 – PROPOSAL 38
Maybe there’s a better way PART 3 – PROPOSAL 39
Image Credit: Caterpillar PART 3 – PROPOSAL 40
BUILDING WITH BUILDOZER FASTBUILD LLVM-BUILDOZER rc.exe clang.exe cmake.exe llvm-lib.exe lld-link.exe ml64.exe (masm) llvm-tblgen.exe clang-tblgen.exe PART 3 – PROPOSAL 41
Local Local rc.exe FASTBUILD Local cmake.exe ml64.exe (masm) Worker 5 Worker 3 Worker 4 Worker 1 Worker 2 PART 3 – PROPOSAL 42
RUNNING THE DOZER int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA (EXE.data()); if (!H) return 0; RemapIAT(H); ” LoadLibrary can also be used to load other executable modules.[..] InitDebInfo(); However, do not use LoadLibrary to run an .exe file. PatchRPMalloc(M); Instead, use the CreateProcess function. ” (MSDN) InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] } PART 3 – PROPOSAL 43
RUNNING THE DOZER int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA (EXE.data()); if (!H) return 0; RemapImportAddressTable(H); InitDebInfo(); PatchRPMalloc(M); InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] } PART 3 – PROPOSAL 44
Recommend
More recommend