non transient side channels
play

Non-transient Side Channels Mengjia Yan Fall 2020 6.888 - PowerPoint PPT Presentation

Non-transient Side Channels Mengjia Yan Fall 2020 6.888 L5-Non-transient Side Channels 1 Lab Assignment Handout on course website Each (regular) student will receive an email Solo or 2-person group Individual GitHub repo


  1. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) 6.888 L5-Non-transient Side Channels 12

  2. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: (8 sets) 6.888 L5-Non-transient Side Channels 12

  3. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Line offset (8 sets) (6 bits) 6.888 L5-Non-transient Side Channels 12

  4. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Index Line offset (8 sets) (3 bits) (6 bits) 6.888 L5-Non-transient Side Channels 12

  5. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Tag Index Line offset (8 sets) (3 bits) (6 bits) 6.888 L5-Non-transient Side Channels 12

  6. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Tag Index Line offset (8 sets) (3 bits) (6 bits) Cache mapping: (256 sets) 6.888 L5-Non-transient Side Channels 12

  7. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Tag Index Line offset (8 sets) (3 bits) (6 bits) Cache mapping: Tag Set Index Line offset (8 bits) (6 bits) (256 sets) 6.888 L5-Non-transient Side Channels 12

  8. Find Eviction Set Using Virtual Addresses 48 12 11 0 Virtual Address (48bit): Virtual page number Page offset 31 12 11 0 Physical Address (32bit): Page offset physical page number 4KB page (12 bits) Cache mapping: Tag Index Line offset (8 sets) (3 bits) (6 bits) Cache mapping: 2 Tag Set Index Line offset bit (8 bits) (6 bits) (256 sets) Not controllable via virtual address. 6.888 L5-Non-transient Side Channels 12

  9. Huge Pages • Huge page size: 2MB or 1GB • Number of bits for page offset? 6.888 L5-Non-transient Side Channels 13

  10. Huge Pages • Huge page size: 2MB or 1GB • Number of bits for page offset? 48 12 11 0 Virtual Address : Page offset Virtual page number 4KB page (12 bits) 48 21 20 0 Virtual Address : Page offset 2MB page Virtual page number (21 bits) 6.888 L5-Non-transient Side Channels 13

  11. Huge Pages • Huge page size: 2MB or 1GB • Number of bits for page offset? 48 12 11 0 Virtual Address : Page offset Virtual page number 4KB page (12 bits) 48 21 20 0 Virtual Address : Page offset 2MB page Virtual page number (21 bits) Cache mapping: Tag Set Index Line offset (256 sets) (8 bits) (6 bits) 6.888 L5-Non-transient Side Channels 13

  12. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … L2 L2 LLC 6.888 L5-Non-transient Side Channels 14

  13. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC 6.888 L5-Non-transient Side Channels 14

  14. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC A typical configuration of Intel Ivy Bridge. Configurations are different with processor types. L1-I/D cache L2 cache L3 cache (LLC) DRAM Size 32KB 256KB 1MB/core 16GB Associativity 4 or 8 8 16 N/A (# ways) Latency 1-5 12 ~40 ~150 (cycles) 6.888 L5-Non-transient Side Channels 14

  15. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC • LLC is generally divided into multiple slices 6.888 L5-Non-transient Side Channels 15

  16. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC • LLC is generally divided into multiple slices Tag Set Index Line offset 6.888 L5-Non-transient Side Channels 15

  17. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC • LLC is generally divided into multiple slices Tag Set Index Line offset An undocumented secret hash function Slice ID = Hash(bits) 6.888 L5-Non-transient Side Channels 15

  18. Multi-level Caches core core D-L1 D-L1 I-L1 I-L1 … • Motivation: L2 L2 • A memory cannot be large and fast. Add level of cache to reduce miss penalty LLC • LLC is generally divided into multiple slices • Conflict happens if addresses map to the same slice and the same set Tag Set Index Line offset An undocumented secret hash function Slice ID = Hash(bits) 6.888 L5-Non-transient Side Channels 15

  19. Eviction Set Construction Algorithm Sender line Sender Receiver Receiver line Time Access Candidate Addresses Vila et al. Theory and Practice of Finding Eviction Sets. S&P’19 Shared Cache 6.888 L5-Non-transient Side Channels 16

  20. Eviction Set Construction Algorithm Sender line Sender Receiver Receiver line Access Target Time Address Access Candidate Wait Addresses Vila et al. Theory and Practice of Finding Eviction Sets. S&P’19 6.888 L5-Non-transient Side Channels 17

  21. Eviction Set Construction Algorithm Sender line Sender Receiver Receiver line Access Target Address Time Access Candidate Measure Latency of Wait Addresses Each Candidate Address Vila et al. Theory and Practice of Finding Eviction Sets. S&P’19 6.888 L5-Non-transient Side Channels 18

  22. Problems Due to Replacement Policy • Self-eviction due to replacement policy • An LRU (least recently used) example Initial: 6.888 L5-Non-transient Side Channels 19

  23. Problems Due to Replacement Policy • Self-eviction due to replacement policy • An LRU (least recently used) example Initial: 1 2 3 4 5 6 7 8 Prime: 6.888 L5-Non-transient Side Channels 19

  24. Problems Due to Replacement Policy • Self-eviction due to replacement policy • An LRU (least recently used) example Initial: 1 2 3 4 5 6 7 8 Prime: Victim access: 1 9 2 3 4 5 6 7 8 6.888 L5-Non-transient Side Channels 19

  25. Problems Due to Replacement Policy • Self-eviction due to replacement policy • An LRU (least recently used) example Initial: 1 2 3 4 5 6 7 8 Prime: Victim access: 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 Probe: Which to evict? 6.888 L5-Non-transient Side Channels 19

  26. Problems Due to Replacement Policy • Self-eviction due to replacement policy • An LRU (least recently used) example Initial: • A small trick: 1 2 3 4 5 6 7 8 Prime: • Access addresses in reverse order Victim access: 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 Probe: Which to evict? 6.888 L5-Non-transient Side Channels 19

  27. Measure Latency of Multiple Accesses • HW Prefetcher + Out-of-order execution T1 = rdtsc() Dummy1=Ld(Addr1) …… Dummy8=Ld(Addr8) T2 = rdtsc() Latency = T2-T1 6.888 L5-Non-transient Side Channels 20

  28. Measure Latency of Multiple Accesses • HW Prefetcher + Out-of-order execution What we expect: Ld A1 Ld A2 …… Ld A7 Ld A8 T1 = rdtsc() Time Dummy1=Ld(Addr1) …… Dummy8=Ld(Addr8) T2 = rdtsc() Latency = T2-T1 6.888 L5-Non-transient Side Channels 20

  29. Measure Latency of Multiple Accesses • HW Prefetcher + Out-of-order execution What we expect: Ld A1 Ld A2 …… Ld A7 Ld A8 T1 = rdtsc() Time Dummy1=Ld(Addr1) What actually will happen: …… Dummy8=Ld(Addr8) Ld A1 T2 = rdtsc() Ld A2 …… Latency = T2-T1 Ld A7 Ld A8 Time 6.888 L5-Non-transient Side Channels 20

  30. Out-of-Order Processor Writeback Fetch Decode RegRead Execute (Commit) 6.888 L5-Non-transient Side Channels 21

  31. Out-of-Order Processor Writeback Fetch Decode RegRead Execute (Commit) Check whether the register to read is ready. 6.888 L5-Non-transient Side Channels 21

  32. Out-of-Order Processor Writeback Fetch Decode RegRead Execute (Commit) Check whether the register to read is ready. Ld A1 Ld A2 …… Ld A7 Ld A8 Time 6.888 L5-Non-transient Side Channels 21

  33. Out-of-Order Processor Writeback Fetch Decode RegRead Execute (Commit) Check whether the register to read is ready. Ld A1 Ld A2 Question: How to serialize …… Ld A7 data accesses? Ld A8 Time 6.888 L5-Non-transient Side Channels 21

  34. Serialize Data Accesses • A special instruction “mfence” https://www.felixcloutier.com/x86/mfence 6.888 L5-Non-transient Side Channels 22

  35. Serialize Data Accesses • A special instruction “mfence” https://www.felixcloutier.com/x86/mfence • Add data dependency by creating a linked list Dummy1 = Ld(Addr1) Addr2 = Ld(Addr1) 6.888 L5-Non-transient Side Channels 22

  36. Serialize Data Accesses • A special instruction “mfence” https://www.felixcloutier.com/x86/mfence • Add data dependency by creating a linked list Pointer to the Dummy1 = Ld(Addr1) content next node dummy A1 dummy A2 dummy A3 …… Addr2 = Ld(Addr1) 6.888 L5-Non-transient Side Channels 22

  37. Serialize Data Accesses • A special instruction “mfence” https://www.felixcloutier.com/x86/mfence • Add data dependency by creating a linked list Pointer to the Dummy1 = Ld(Addr1) content next node dummy A1 dummy A2 dummy A3 …… Addr2 = Ld(Addr1) • Double linked list to access addresses in reverse order A1 A2 A3 …… A1 A2 6.888 L5-Non-transient Side Channels 22

  38. Handle Noise 6.888 L5-Non-transient Side Channels 23

  39. Handle Noise A real-world example: Square-and-Multiply Exponentiation • What you generally see in papers: for i = n-1 to 0 do r = sqr(r) mod n if e i == 1 then r = mul(r, b) mod n end end 6.888 L5-Non-transient Side Channels 23

  40. The Multiply Function 6.888 L5-Non-transient Side Channels 24

  41. The Multiply Function 6.888 L5-Non-transient Side Channels 24

  42. Raw Trace Access latencies measured in the probe operation in Prime+Probe. A sequence of “01010111011001” can be deduced as part of the exponent. 6.888 L5-Non-transient Side Channels 25

  43. There may exist other problems • Tips for lab assignment • Build the attack step-by-step • Recommend to read “Last-Level Cache Side-Channel Attacks are Practical” • Ask questions via Piazza 6.888 L5-Non-transient Side Channels 26

  44. Defenses 6.888 L5-Non-transient Side Channels 27

  45. Micro-architecture Side Channels A Channel (a micro-architecture structure) Victim Attacker Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 28

  46. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 28

  47. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 28

  48. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 28

  49. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker X {Cache, DRAM, TLB, NoC, etc.} {Transient, Non-transient} Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 28

  50. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Defenses: Block creation of signals: Oblivious execution, speculative execution defenses, etc. Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 29

  51. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Defenses: Block creation of signals: Close the channel: Oblivious execution, Isolation, etc. speculative execution defenses, etc. Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 29

  52. Micro-architecture Side Channels secret-dependent execution A Channel (a micro-architecture structure) Victim Attacker Defenses: Block creation of signals: Block detection of signals: Close the channel: Oblivious execution, Randomization, etc. Isolation, etc. speculative execution defenses, etc. Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18 6.888 L5-Non-transient Side Channels 29

  53. Defense Design Considerations Security Performance Portability 6.888 L5-Non-transient Side Channels 30

  54. The Problem: The ISA Abstraction • Interface between HW and SW: ISA • Advantage: HW optimizations without affecting usability/portability Software (branch, arithmetic instruction, load/store) ISA (instruction set architecture) Hardware (caches, DRAM, TLBs, etc.) 6.888 L5-Non-transient Side Channels 31

  55. From https://www.felixcloutier.com/x86/index.html 6.888 L5-Non-transient Side Channels 32

  56. The Problem: The ISA Abstraction • Interface between HW and SW: ISA • ISA specifies functionality, not performance/timing Software (branch, arithmetic • Compare Intel Ivy Bridge and Cascade Processor instruction, load/store) ISA (instruction set architecture) Hardware Example: (caches, DRAM, TLBs, etc.) DEC [addr] 6.888 L5-Non-transient Side Channels 33

  57. Data Oblivious/“Constant time” Programming Write program w/o data-dependent behavior 6.888 L5-Non-transient Side Channels 34

  58. Data Oblivious/“Constant time” Programming Write program w/o data-dependent behavior Original: if ( secret ) a = *(addr1); else a = *(addr2); secret = confidential addr1 = public addr2 = public 6.888 L5-Non-transient Side Channels 34

  59. Data Oblivious/“Constant time” Programming Write program w/o data-dependent behavior Original: Data Oblivious: if ( secret ) a ← load (addr1); a = *(addr1); b ← load (addr2); else cmov a = ( secret ) ? a : b; a = *(addr2); secret = confidential addr1 = public addr2 = public 6.888 L5-Non-transient Side Channels 34

  60. Data Oblivious/“Constant time” Programming Write program w/o data-dependent behavior Original: Data Oblivious: secret if ( secret ) a ← load (addr1); a = *(addr1); b ← load (addr2); a ← load addr1 b ← load addr2 else cmov a = ( secret ) ? a : b; a b a = *(addr2); cmov secret , b, a secret = confidential addr1 = public addr2 = public 6.888 L5-Non-transient Side Channels 34

  61. Programming in Circuit Abstraction • Program = DAG (“circuit”) op1 Node/Gate op2 • Operations = nodes (“gates”) • Data transfers = edges (“wires”) Edge/Wire op3 • Topology must be confidential data- independent • Each gate’s execution must hide its inputs • Each wire must hide the value it carries op4 6.888 L5-Non-transient Side Channels 35

  62. What assumptions underpin the model? secret addr2 addr1 if ( secret ) a = *(addr1); a ← load addr1 b ← load addr2 else a = *(addr2); a b secret = confidential addr1 = public cmov secret , b, a addr2 = public 36

  63. What assumptions underpin the model? secret addr2 addr1 if ( secret ) a = *(addr1); a ← load addr1 b ← load addr2 else a = *(addr2); a b secret = confidential addr1 = public cmov secret , b, a addr2 = public • Rule 1: instruction/gate execution = confidential data-independent 36

  64. What assumptions underpin the model? secret addr2 addr1 if ( secret ) a = *(addr1); a ← load addr1 b ← load addr2 else a = *(addr2); a b secret = confidential addr1 = public cmov secret , b, a addr2 = public • Rule 1: instruction/gate execution = confidential data-independent • Rule 2: data transfer/wire = confidential data-independent 36

  65. What assumptions underpin the model? secret addr2 addr1 if ( secret ) a = *(addr1); a ← load addr1 b ← load addr2 else a = *(addr2); a b secret = confidential addr1 = public cmov secret , b, a addr2 = public • Rule 1: instruction/gate execution = confidential data-independent • Rule 2: data transfer/wire = confidential data-independent 36 • Rule 3: circuit/program topology = fixed

  66. Today’s machines can violate these assumptions secret addr2 addr1 Violations due to: a ← load addr1 b ← load addr2 Data-dependent instruction optimizations a b (e.g., zero-skip, early exit, microcode, silent stores, …) cmov secret , b, a • Rule 1: instruction/gate execution = confidential data-independent • Rule 2: data transfer/wire = confidential data-independent • Rule 3: circuit/program topology = fixed 37

  67. Today’s machines can violate these assumptions secret addr2 addr1 Violations due to: a ← load addr1 b ← load addr2 Data at rest optimizations a b (e.g., compression in register file/uop fusion, cache, page tables, …) cmov secret , b, a • Rule 1: instruction/gate execution = confidential data-independent • Rule 2: data transfer/wire = confidential data-independent • Rule 3: circuit/program topology = fixed 38

  68. Today’s machines can violate these assumptions secret addr2 addr1 Violations due to: a ← load addr1 b ← load addr2 Speculative/OoO a b execution cmov secret , b, a • Rule 1: instruction/gate execution = confidential data-independent • Rule 2: data transfer/wire = confidential data-independent • Rule 3: circuit/program topology = fixed 39

  69. HW Resource Partition • Security v.s. Quality of Service (QoS) • Intel Cache Allocation Technology (CAT) 6.888 L5-Non-transient Side Channels 40

  70. HW Resource Partition • Security v.s. Quality of Service (QoS) • Intel Cache Allocation Technology (CAT) • Temporal Partition v.s. Spatial Partition 6.888 L5-Non-transient Side Channels 40

  71. HW Resource Partition • Security v.s. Quality of Service (QoS) • Intel Cache Allocation Technology (CAT) • Temporal Partition v.s. Spatial Partition • Challenges nowadays: • Security domain determination is tricky nowadays • Scalability: what is #domains > #partitions • How to partition inside cores? • Why not execute applications on a single node? 6.888 L5-Non-transient Side Channels 40

  72. Randomization/Fuzzing • Introduce noise to time measurement/Make time measurement coarse-grained • Pros and cons? 6.888 L5-Non-transient Side Channels 41

  73. Randomization/Fuzzing • Introduce noise to time measurement/Make time measurement coarse-grained • Pros and cons? + Simple and no performance overhead + Effective towards a group of popular attacks …… - Not effective to attacks that do not measure time - Not effective to victims that cause big timing difference - Affect usability if benign application needs to use a fine-grained timer 6.888 L5-Non-transient Side Channels 41

  74. Randomization/Fuzzing • Introduce noise to time measurement/Make time measurement coarse-grained • Pros and cons? + Simple and no performance overhead + Effective towards a group of popular attacks …… - Not effective to attacks that do not measure time - Not effective to victims that cause big timing difference - Affect usability if benign application needs to use a fine-grained timer • Randomize cache mapping functions • Pros and cons? 6.888 L5-Non-transient Side Channels 41

Recommend


More recommend