verifiable asics trustworthy hardware with untrusted
play

Verifiable ASICs: trustworthy hardware with untrusted components - PowerPoint PPT Presentation

Verifiable ASICs: trustworthy hardware with untrusted components Riad S. Wahby , Max Howald , Siddharth Garg , abhi shelat , and Michael Walfish Stanford University New York University The Cooper Union


  1. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] F must be expressed as a layered arithmetic circuit.

  2. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs V P x

  3. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  4. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  5. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  6. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y V P x y y

  7. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires thinking... V P x y

  8. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check V P x y . . sum-check . [LFKN90]

  9. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer V P x y . . sum-check . [LFKN90]

  10. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  11. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  12. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  13. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates, gets claim about inputs, which it can check V P x y . . sum-check . [LFKN90] more sum-checks

  14. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 V P x y . . sum-check . [LFKN90] more sum-checks

  15. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) V P x y . . sum-check . [LFKN90] more sum-checks

  16. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) P ’s sequential running time: O(depth · width · log width) V P x y . . sum-check . [LFKN90] more sum-checks

  17. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel

  18. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?

  19. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost.

  20. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost. But: there is still parallelism to be extracted. . .

  21. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. F( x 1 )

  22. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. Simultaneously, P returns F( x 2 ). F( x 1 ) F( x 2 )

  23. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer F( x 1 )

  24. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. F( x 1 ) F( x 2 )

  25. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. Meanwhile, P returns F( x 3 ). F( x 1 ) F( x 2 ) F( x 3 )

  26. Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 )

  27. Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 ) F( x 5 )

  28. Extracting parallelism in Zebra’s P F( x 1 ) This process continues F( x 2 ) until V and P interact about every layer F( x 3 ) simultaneously—but for different computations. F( x 4 ) V and P can complete one proof in each time F( x 5 ) step. F( x 6 ) F( x 7 ) F( x 8 )

  29. Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.

  30. Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged. There are other opportunities to leverage the protocol’s structure.

  31. Per-layer computations For each sum-check round, P sums over each gate in a layer

  32. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 }

  33. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In software: // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )

  34. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )

  35. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: RAM state[g] ← δ (g, r j )

  36. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  37. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 1) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  38. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 2) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  39. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  40. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  41. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: state[0] state[1] state[2] state[3] // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: gate gate gate gate H[k] ← 0 prover prover prover prover δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) for g ∈ layer: . . . δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) H[k] ← H[k] + δ (g, k) δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) // δ uses state[g] δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // update lookup table // with V ’s random coin + + for g ∈ layer: + state[g] ← δ (g, r j ) Adder tree

  42. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed

  43. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control

  44. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control ✓ Reduce, reuse, recycle e.g., computation: save energy by adding memoization to P e.g., hardware: save chip area by reusing the same circuits

  45. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area

  46. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

  47. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13]

  48. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs

  49. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive V input x y P output pre i proof that y = F( x )

  50. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive ✓ Zebra uses untrusted storage + authenticated encryption input x V y P E k (pre i ) output proof that y = F( x )

  51. Implementation Zebra’s implementation includes • a compiler that produces synthesizable Verilog for P • two V implementations • hardware (Verilog) • software (C++) • library to generate V ’s precomputations • Verilog simulator extensions to model software or hardware V ’s interactions with P

  52. . . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

  53. . . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof! But that’s not a serious evaluation. . .

  54. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V

  55. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper)

  56. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V

  57. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) 350 nm: 1997 (Pentium II) 7 nm: ≈ 2017 [TSMC] Measurements: based on circuit synthesis and simulation, ≈ 20 year gap between published chip designs, and CMOS scaling models trusted and untrusted fab Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; 200 mm 2 max chip area; 150 W max total power

  58. Application #1: number theoretic transform NTT: a Fourier transform over F p Widely used, e.g., in computer algebra

  59. Application #1: number theoretic transform Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 6 7 8 9 10 11 12 13 log 2 (NTT size)

  60. Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive, e.g., for ECDH

  61. Application #2: Curve25519 point multiplication Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 84 170 340 682 1147 Parallel Curve25519 point multiplications

  62. A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes .

  63. Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit

  64. Summary of Zebra’s applicability Applies to IPs, but not arguments 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit

  65. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ Reduce, reuse, recycle ✓ Argument protocols seem friendly to hardware?

  66. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM

  67. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

  68. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits . . . but we hope these issues are surmountable!

  69. Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit Common to essentially all built proof systems

Recommend


More recommend