Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] F must be expressed as a layered arithmetic circuit.
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs V P x
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y V P x y y
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires thinking... V P x y
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check V P x y . . sum-check . [LFKN90]
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer V P x y . . sum-check . [LFKN90]
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates, gets claim about inputs, which it can check V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) V P x y . . sum-check . [LFKN90] more sum-checks
Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) P ’s sequential running time: O(depth · width · log width) V P x y . . sum-check . [LFKN90] more sum-checks
Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel
Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?
Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost.
Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost. But: there is still parallelism to be extracted. . .
Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. F( x 1 )
Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. Simultaneously, P returns F( x 2 ). F( x 1 ) F( x 2 )
Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer F( x 1 )
Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. F( x 1 ) F( x 2 )
Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. Meanwhile, P returns F( x 3 ). F( x 1 ) F( x 2 ) F( x 3 )
Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 )
Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 ) F( x 5 )
Extracting parallelism in Zebra’s P F( x 1 ) This process continues F( x 2 ) until V and P interact about every layer F( x 3 ) simultaneously—but for different computations. F( x 4 ) V and P can complete one proof in each time F( x 5 ) step. F( x 6 ) F( x 7 ) F( x 8 )
Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.
Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged. There are other opportunities to leverage the protocol’s structure.
Per-layer computations For each sum-check round, P sums over each gate in a layer
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 }
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In software: // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: RAM state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 1) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 2) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )
Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: state[0] state[1] state[2] state[3] // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: gate gate gate gate H[k] ← 0 prover prover prover prover δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) for g ∈ layer: . . . δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) H[k] ← H[k] + δ (g, k) δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) // δ uses state[g] δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // update lookup table // with V ’s random coin + + for g ∈ layer: + state[g] ← δ (g, r j ) Adder tree
Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed
Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control
Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control ✓ Reduce, reuse, recycle e.g., computation: save energy by adding memoization to P e.g., hardware: save chip area by reusing the same circuits
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13]
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive V input x y P output pre i proof that y = F( x )
Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive ✓ Zebra uses untrusted storage + authenticated encryption input x V y P E k (pre i ) output proof that y = F( x )
Implementation Zebra’s implementation includes • a compiler that produces synthesizable Verilog for P • two V implementations • hardware (Verilog) • software (C++) • library to generate V ’s precomputations • Verilog simulator extensions to model software or hardware V ’s interactions with P
. . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!
. . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof! But that’s not a serious evaluation. . .
Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V
Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper)
Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V
Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) 350 nm: 1997 (Pentium II) 7 nm: ≈ 2017 [TSMC] Measurements: based on circuit synthesis and simulation, ≈ 20 year gap between published chip designs, and CMOS scaling models trusted and untrusted fab Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; 200 mm 2 max chip area; 150 W max total power
Application #1: number theoretic transform NTT: a Fourier transform over F p Widely used, e.g., in computer algebra
Application #1: number theoretic transform Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 6 7 8 9 10 11 12 13 log 2 (NTT size)
Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive, e.g., for ECDH
Application #2: Curve25519 point multiplication Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 84 170 340 682 1147 Parallel Curve25519 point multiplications
A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes .
Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit
Summary of Zebra’s applicability Applies to IPs, but not arguments 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit
Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ Reduce, reuse, recycle ✓ Argument protocols seem friendly to hardware?
Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM
Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits
Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits . . . but we hope these issues are surmountable!
Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit Common to essentially all built proof systems
Recommend
More recommend