cpu design e ff ects that can degrade performance of your
play

CPU design e ff ects that can degrade performance of your programs - PowerPoint PPT Presentation

CPU design e ff ects that can degrade performance of your programs Jakub Bernek jakub.beranek@vsb.cz whoami PhD student @ VSB-TUO, Ostrava, Czech Republic Research assistant @ IT Innovations (HPC center) HPC, distributed


  1. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  2. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  3. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  4. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  5. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  6. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  7. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  8. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken

  9. Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken � hits, � misses ( �� % hit rate)

  10. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken Prediction: Not taken

  11. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  12. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  13. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  14. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  15. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  16. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  17. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  18. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  19. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

  20. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  21. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  22. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  23. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  24. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  25. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

  26. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken

  27. Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken � hits, � misses ( �� % hit rate)

  28. How can the compiler help? With float , there are two branches per iteration

  29. How can the compiler help? With int , one branch is removed (using cmov )

  30. How to measure? branch-misses How many times was a branch mispredicted?

  31. How to measure? branch-misses How many times was a branch mispredicted? $ perf stat -e branch-misses ./example0a with sort -> 383 902 without sort -> 101 652 009

  32. How to help the branch predictor? •More predictable data

  33. How to help the branch predictor? •More predictable data •Pro fi le-guided optimization

  34. How to help the branch predictor? •More predictable data •Pro fi le-guided optimization •Remove (unpredictable) branches

  35. How to help the branch predictor? •More predictable data •Pro fi le-guided optimization •Remove (unpredictable) branches •Compiler hints (use with caution) if (__builtin_expect(will_it_blend(), 0)) { // this branch is not likely to be taken }

  36. Branch target prediction •Target of a jump is not known at compile time:

  37. Branch target prediction •Target of a jump is not known at compile time: •Function pointer

  38. Branch target prediction •Target of a jump is not known at compile time: •Function pointer •Function return address

  39. Branch target prediction •Target of a jump is not known at compile time: •Function pointer •Function return address •Virtual method

  40. Code (backup) struct A { virtual void handle(size_t* data) const = 0; }; struct B: public A { void handle(size_t* data) const final { *data += 1; } }; struct C: public A { void handle(size_t* data) const final { *data += 2; } }; std::vector<std::unique_ptr<A>> data = /* 4K random B/C instances */ ; // std::sort(data.begin(), data.end(), /* sort by instance type */); size_t sum = 0; for (auto& x : data) { x->handle(&sum); }

  41. Result (backup)

  42. perf (backup) $ perf stat -e branch-misses ./example0b with sort -> 337 274 without sort -> 84 183 161

  43. Code (backup) // Addresses of N integers, each `offset` bytes apart std::vector<int*> data = ...; for (auto ptr: data) { *ptr += 1; } // Offsets: 4, 64, 4000, 4096, 4128

  44. Result (backup)

  45. Cache memory

Recommend


More recommend