4 5 A7, original A8 are in-order, NEON crypto 2012 Bernstein–Schw insns at once. ⇒ Simpler, “NEON Basic ARM insn set uses er, more energy-efficient. new Cortex-A8 16 32-bit registers: 512 bits. for various than one billion Cortex-A7 Optional NEON extension uses devices have been sold. e.g. Curve25519 16 128-bit registers: 2048 bits. 460200 cycles r in low-cost and mid-range Cortex-A7 and Cortex-A8 498284 cycles rtphones: Mobiistar Buddy, (and Cortex-A15 and Cortex-A17 Mobiistar Kool, Mobiistar LAI Z1, Compare and Qualcomm Scorpion Samsung Galaxy J1 Ace Neo, etc. cycles on and Qualcomm Krait) for NIST used in typical TV boxes, always have NEON insns. 9 million SmartWatch 3, Samsung Cortex-A5 and Cortex-A9 4.8 million 2, Raspberry Pi 2, etc. sometimes have NEON insns. 3.9 million
4 5 A8 are in-order, NEON crypto 2012 Bernstein–Schw once. ⇒ Simpler, “NEON crypto” soft Basic ARM insn set uses energy-efficient. new Cortex-A8 speed 16 32-bit registers: 512 bits. for various crypto billion Cortex-A7 Optional NEON extension uses een sold. e.g. Curve25519 ECDH: 16 128-bit registers: 2048 bits. 460200 cycles on Co w-cost and mid-range Cortex-A7 and Cortex-A8 498284 cycles on Co Mobiistar Buddy, (and Cortex-A15 and Cortex-A17 Mobiistar LAI Z1, Compare to OpenSSL and Qualcomm Scorpion J1 Ace Neo, etc. cycles on Cortex-A8-slo and Qualcomm Krait) for NIST P-256 ECDH: ypical TV boxes, always have NEON insns. 9 million for OpenSSL atch 3, Samsung Cortex-A5 and Cortex-A9 4.8 million for OpenSSL rry Pi 2, etc. sometimes have NEON insns. 3.9 million for OpenSSL
4 5 in-order, NEON crypto 2012 Bernstein–Schwabe Simpler, “NEON crypto” software: Basic ARM insn set uses energy-efficient. new Cortex-A8 speed records 16 32-bit registers: 512 bits. for various crypto primitives. tex-A7 Optional NEON extension uses e.g. Curve25519 ECDH: 16 128-bit registers: 2048 bits. 460200 cycles on Cortex-A8-fast, mid-range Cortex-A7 and Cortex-A8 498284 cycles on Cortex-A8-slo Buddy, (and Cortex-A15 and Cortex-A17 LAI Z1, Compare to OpenSSL and Qualcomm Scorpion Neo, etc. cycles on Cortex-A8-slow and Qualcomm Krait) for NIST P-256 ECDH: xes, always have NEON insns. 9 million for OpenSSL 0.9.8k. Samsung Cortex-A5 and Cortex-A9 4.8 million for OpenSSL 1.0.1c. etc. sometimes have NEON insns. 3.9 million for OpenSSL 1.0.2j.
5 6 NEON crypto 2012 Bernstein–Schwabe “NEON crypto” software: Basic ARM insn set uses new Cortex-A8 speed records 16 32-bit registers: 512 bits. for various crypto primitives. Optional NEON extension uses e.g. Curve25519 ECDH: 16 128-bit registers: 2048 bits. 460200 cycles on Cortex-A8-fast, Cortex-A7 and Cortex-A8 498284 cycles on Cortex-A8-slow. (and Cortex-A15 and Cortex-A17 Compare to OpenSSL and Qualcomm Scorpion cycles on Cortex-A8-slow and Qualcomm Krait) for NIST P-256 ECDH: always have NEON insns. 9 million for OpenSSL 0.9.8k. Cortex-A5 and Cortex-A9 4.8 million for OpenSSL 1.0.1c. sometimes have NEON insns. 3.9 million for OpenSSL 1.0.2j.
5 6 crypto 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: ARM insn set uses 4x a = b new Cortex-A8 speed records 32-bit registers: 512 bits. is a vecto for various crypto primitives. a[0] Optional NEON extension uses e.g. Curve25519 ECDH: a[1] 128-bit registers: 2048 bits. 460200 cycles on Cortex-A8-fast, a[2] rtex-A7 and Cortex-A8 498284 cycles on Cortex-A8-slow. a[3] Cortex-A15 and Cortex-A17 Compare to OpenSSL Qualcomm Scorpion cycles on Cortex-A8-slow Qualcomm Krait) for NIST P-256 ECDH: have NEON insns. 9 million for OpenSSL 0.9.8k. rtex-A5 and Cortex-A9 4.8 million for OpenSSL 1.0.1c. sometimes have NEON insns. 3.9 million for OpenSSL 1.0.2j.
5 6 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: set uses 4x a = b + c new Cortex-A8 speed records registers: 512 bits. is a vector of 4 32-bit for various crypto primitives. a[0] = b[0] + extension uses e.g. Curve25519 ECDH: a[1] = b[1] + registers: 2048 bits. 460200 cycles on Cortex-A8-fast, a[2] = b[2] + Cortex-A8 498284 cycles on Cortex-A8-slow. a[3] = b[3] + and Cortex-A17 Compare to OpenSSL Scorpion cycles on Cortex-A8-slow Krait) for NIST P-256 ECDH: NEON insns. 9 million for OpenSSL 0.9.8k. Cortex-A9 4.8 million for OpenSSL 1.0.1c. NEON insns. 3.9 million for OpenSSL 1.0.2j.
5 6 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: 4x a = b + c new Cortex-A8 speed records bits. is a vector of 4 32-bit additions: for various crypto primitives. a[0] = b[0] + c[0]; uses e.g. Curve25519 ECDH: a[1] = b[1] + c[1]; bits. 460200 cycles on Cortex-A8-fast, a[2] = b[2] + c[2]; 498284 cycles on Cortex-A8-slow. a[3] = b[3] + c[3] . rtex-A17 Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. insns. 3.9 million for OpenSSL 1.0.2j.
6 7 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: 4x a = b + c new Cortex-A8 speed records is a vector of 4 32-bit additions: for various crypto primitives. a[0] = b[0] + c[0]; e.g. Curve25519 ECDH: a[1] = b[1] + c[1]; 460200 cycles on Cortex-A8-fast, a[2] = b[2] + c[2]; 498284 cycles on Cortex-A8-slow. a[3] = b[3] + c[3] . Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.
6 7 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: 4x a = b + c new Cortex-A8 speed records is a vector of 4 32-bit additions: for various crypto primitives. a[0] = b[0] + c[0]; e.g. Curve25519 ECDH: a[1] = b[1] + c[1]; 460200 cycles on Cortex-A8-fast, a[2] = b[2] + c[2]; 498284 cycles on Cortex-A8-slow. a[3] = b[3] + c[3] . Compare to OpenSSL Cortex-A8 NEON arithmetic unit cycles on Cortex-A8-slow can do this every cycle. for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.
6 7 2012 Bernstein–Schwabe NEON instructions “NEON crypto” software: 4x a = b + c new Cortex-A8 speed records is a vector of 4 32-bit additions: for various crypto primitives. a[0] = b[0] + c[0]; e.g. Curve25519 ECDH: a[1] = b[1] + c[1]; 460200 cycles on Cortex-A8-fast, a[2] = b[2] + c[2]; 498284 cycles on Cortex-A8-slow. a[3] = b[3] + c[3] . Compare to OpenSSL Cortex-A8 NEON arithmetic unit cycles on Cortex-A8-slow can do this every cycle. for NIST P-256 ECDH: Stage N2: reads b and c . 9 million for OpenSSL 0.9.8k. Stage N3: performs addition. 4.8 million for OpenSSL 1.0.1c. Stage N4: a is ready. 3.9 million for OpenSSL 1.0.2j. 2 cycles � ADD 2 cycles � ADD ADD
6 7 Bernstein–Schwabe NEON instructions 4x a = b “NEON crypto” software: is a vecto 4x a = b + c Cortex-A8 speed records a[0] is a vector of 4 32-bit additions: rious crypto primitives. a[1] a[0] = b[0] + c[0]; a[2] Curve25519 ECDH: a[1] = b[1] + c[1]; a[3] 460200 cycles on Cortex-A8-fast, a[2] = b[2] + c[2]; 498284 cycles on Cortex-A8-slow. a[3] = b[3] + c[3] . Stage N1: Stage N2: Compare to OpenSSL Cortex-A8 NEON arithmetic unit Stage N3: on Cortex-A8-slow can do this every cycle. Stage N4: IST P-256 ECDH: Stage N2: reads b and c . 2 million for OpenSSL 0.9.8k. ADD Stage N3: performs addition. million for OpenSSL 1.0.1c. Stage N4: a is ready. Also logic million for OpenSSL 1.0.2j. 2 cycles � ADD 2 cycles � ADD ADD
6 7 Bernstein–Schwabe NEON instructions 4x a = b - c software: is a vector of 4 32-bit 4x a = b + c speed records a[0] = b[0] - is a vector of 4 32-bit additions: crypto primitives. a[1] = b[1] - a[0] = b[0] + c[0]; a[2] = b[2] - ECDH: a[1] = b[1] + c[1]; a[3] = b[3] - on Cortex-A8-fast, a[2] = b[2] + c[2]; on Cortex-A8-slow. a[3] = b[3] + c[3] . Stage N1: reads c Stage N2: reads b enSSL Cortex-A8 NEON arithmetic unit Stage N3: performs rtex-A8-slow can do this every cycle. Stage N4: a is ready ECDH: Stage N2: reads b and c . 2 or 3 cycles � enSSL 0.9.8k. ADD Stage N3: performs addition. OpenSSL 1.0.1c. Stage N4: a is ready. Also logic insns, shifts, OpenSSL 1.0.2j. 2 cycles � ADD 2 cycles � ADD ADD
6 7 NEON instructions 4x a = b - c is a vector of 4 32-bit subtractions: 4x a = b + c rds a[0] = b[0] - c[0]; is a vector of 4 32-bit additions: rimitives. a[1] = b[1] - c[1]; a[0] = b[0] + c[0]; a[2] = b[2] - c[2]; a[1] = b[1] + c[1]; a[3] = b[3] - c[3] . rtex-A8-fast, a[2] = b[2] + c[2]; rtex-A8-slow. a[3] = b[3] + c[3] . Stage N1: reads c . Stage N2: reads b , negates Cortex-A8 NEON arithmetic unit Stage N3: performs addition can do this every cycle. Stage N4: a is ready. Stage N2: reads b and c . 2 or 3 cycles � SUB 0.9.8k. ADD Stage N3: performs addition. 1.0.1c. Stage N4: a is ready. Also logic insns, shifts, etc. 1.0.2j. 2 cycles � ADD 2 cycles � ADD ADD
7 8 NEON instructions 4x a = b - c is a vector of 4 32-bit subtractions: 4x a = b + c a[0] = b[0] - c[0]; is a vector of 4 32-bit additions: a[1] = b[1] - c[1]; a[0] = b[0] + c[0]; a[2] = b[2] - c[2]; a[1] = b[1] + c[1]; a[3] = b[3] - c[3] . a[2] = b[2] + c[2]; a[3] = b[3] + c[3] . Stage N1: reads c . Stage N2: reads b , negates c . Cortex-A8 NEON arithmetic unit Stage N3: performs addition. can do this every cycle. Stage N4: a is ready. Stage N2: reads b and c . 2 or 3 cycles � SUB ADD Stage N3: performs addition. Stage N4: a is ready. Also logic insns, shifts, etc. 2 cycles � ADD 2 cycles � ADD ADD
7 8 instructions Multiplication 4x a = b - c is a vector of 4 32-bit subtractions: c[0,1] b + c a[0] = b[0] - c[0]; c[2,3] vector of 4 32-bit additions: a[1] = b[1] - c[1]; Two cycles a[0] = b[0] + c[0]; a[2] = b[2] - c[2]; a[1] = b[1] + c[1]; Multiply-accumulate a[3] = b[3] - c[3] . a[2] = b[2] + c[2]; c[0,1] a[3] = b[3] + c[3] . Stage N1: reads c . c[2,3] Stage N2: reads b , negates c . rtex-A8 NEON arithmetic unit Also two Stage N3: performs addition. this every cycle. Stage N4: a is ready. Stage N1: N2: reads b and c . 2 or 3 cycles � SUB Stage N2: ADD N3: performs addition. Stage N3: N4: a is ready. . Also logic insns, shifts, etc. . . 2 cycles � ADD 2 cycles � ADD Stage N8:
7 8 instructions Multiplication insn: 4x a = b - c is a vector of 4 32-bit subtractions: c[0,1] = a[0] signed* a[0] = b[0] - c[0]; c[2,3] = a[1] signed* 32-bit additions: a[1] = b[1] - c[1]; Two cycles on Cor + c[0]; a[2] = b[2] - c[2]; + c[1]; Multiply-accumulate a[3] = b[3] - c[3] . + c[2]; c[0,1] += a[0] signed* + c[3] . Stage N1: reads c . c[2,3] += a[1] signed* Stage N2: reads b , negates c . NEON arithmetic unit Also two cycles on Stage N3: performs addition. every cycle. Stage N4: a is ready. Stage N1: reads b b and c . 2 or 3 cycles � SUB Stage N2: reads a ADD rms addition. Stage N3: reads c ready. . Also logic insns, shifts, etc. . . 2 cycles � ADD ADD Stage N8: c is ready
7 8 Multiplication insn: 4x a = b - c is a vector of 4 32-bit subtractions: c[0,1] = a[0] signed* b[0]; a[0] = b[0] - c[0]; c[2,3] = a[1] signed* b[1] additions: a[1] = b[1] - c[1]; Two cycles on Cortex-A8. a[2] = b[2] - c[2]; Multiply-accumulate insn: a[3] = b[3] - c[3] . c[0,1] += a[0] signed* b[0]; Stage N1: reads c . c[2,3] += a[1] signed* b[1] Stage N2: reads b , negates c . rithmetic unit Also two cycles on Cortex-A8. Stage N3: performs addition. Stage N4: a is ready. Stage N1: reads b . 2 or 3 cycles � SUB Stage N2: reads a . ADD addition. Stage N3: reads c if accumulate. . Also logic insns, shifts, etc. . . cycles � ADD Stage N8: c is ready.
8 9 Multiplication insn: 4x a = b - c is a vector of 4 32-bit subtractions: c[0,1] = a[0] signed* b[0]; a[0] = b[0] - c[0]; c[2,3] = a[1] signed* b[1] a[1] = b[1] - c[1]; Two cycles on Cortex-A8. a[2] = b[2] - c[2]; Multiply-accumulate insn: a[3] = b[3] - c[3] . c[0,1] += a[0] signed* b[0]; Stage N1: reads c . c[2,3] += a[1] signed* b[1] Stage N2: reads b , negates c . Also two cycles on Cortex-A8. Stage N3: performs addition. Stage N4: a is ready. Stage N1: reads b . 2 or 3 cycles � SUB Stage N2: reads a . ADD Stage N3: reads c if accumulate. . Also logic insns, shifts, etc. . . Stage N8: c is ready.
8 9 Multiplication insn: Typical sequence b - c vector of 4 32-bit subtractions: c[0,1] = a[0] signed* b[0]; c[0,1] a[0] = b[0] - c[0]; c[2,3] = a[1] signed* b[1] c[2,3] a[1] = b[1] - c[1]; Two cycles on Cortex-A8. c[0,1] a[2] = b[2] - c[2]; Multiply-accumulate insn: c[2,3] a[3] = b[3] - c[3] . c[0,1] += a[0] signed* b[0]; c[0,1] N1: reads c . c[2,3] += a[1] signed* b[1] c[2,3] N2: reads b , negates c . Also two cycles on Cortex-A8. N3: performs addition. Cortex-A8 N4: a is ready. Stage N1: reads b . Reads c 2 or 3 cycles � SUB Stage N2: reads a . Stage N3: reads c if accumulate. . logic insns, shifts, etc. . . Stage N8: c is ready.
8 9 Multiplication insn: Typical sequence of 32-bit subtractions: c[0,1] = a[0] signed* b[0]; c[0,1] = a[0] signed* - c[0]; c[2,3] = a[1] signed* b[1] c[2,3] = a[1] signed* - c[1]; Two cycles on Cortex-A8. c[0,1] += e[2] signed* - c[2]; Multiply-accumulate insn: c[2,3] += e[3] signed* - c[3] . c[0,1] += a[0] signed* b[0]; c[0,1] += g[0] signed* c . c[2,3] += a[1] signed* b[1] c[2,3] += g[1] signed* b , negates c . Also two cycles on Cortex-A8. rms addition. Cortex-A8 recognizes ready. Stage N1: reads b . Reads c in N6 instead cycles � SUB Stage N2: reads a . Stage N3: reads c if accumulate. . shifts, etc. . . Stage N8: c is ready.
8 9 Multiplication insn: Typical sequence of three insns: subtractions: c[0,1] = a[0] signed* b[0]; c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. c[0,1] += e[2] signed* f[2]; Multiply-accumulate insn: c[2,3] += e[3] signed* f[3] c[0,1] += a[0] signed* b[0]; c[0,1] += g[0] signed* h[2]; c[2,3] += a[1] signed* b[1] c[2,3] += g[1] signed* h[3] negates c . Also two cycles on Cortex-A8. addition. Cortex-A8 recognizes this pattern. Stage N1: reads b . Reads c in N6 instead of N3. Stage N2: reads a . Stage N3: reads c if accumulate. . etc. . . Stage N8: c is ready.
9 10 Multiplication insn: Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. c[0,1] += e[2] signed* f[2]; Multiply-accumulate insn: c[2,3] += e[3] signed* f[3] c[0,1] += a[0] signed* b[0]; c[0,1] += g[0] signed* h[2]; c[2,3] += a[1] signed* b[1] c[2,3] += g[1] signed* h[3] Also two cycles on Cortex-A8. Cortex-A8 recognizes this pattern. Stage N1: reads b . Reads c in N6 instead of N3. Stage N2: reads a . Stage N3: reads c if accumulate. . . . Stage N8: c is ready.
9 10 Time N1 Multiplication insn: Typical sequence of three insns: 1 b = a[0] signed* b[0]; c[0,1] = a[0] signed* b[0]; 2 = a[1] signed* b[1] c[2,3] = a[1] signed* b[1] 3 f 4 cycles on Cortex-A8. c[0,1] += e[2] signed* f[2]; 5 h Multiply-accumulate insn: c[2,3] += e[3] signed* f[3] 6 += a[0] signed* b[0]; 7 c[0,1] += g[0] signed* h[2]; 8 += a[1] signed* b[1] c[2,3] += g[1] signed* h[3] 9 wo cycles on Cortex-A8. 10 Cortex-A8 recognizes this pattern. 11 N1: reads b . Reads c in N6 instead of N3. 12 N2: reads a . N3: reads c if accumulate. N8: c is ready.
9 10 Time N1 N2 N3 N4 insn: Typical sequence of three insns: 1 b signed* b[0]; c[0,1] = a[0] signed* b[0]; 2 a signed* b[1] c[2,3] = a[1] signed* b[1] 3 f × 4 e × Cortex-A8. c[0,1] += e[2] signed* f[2]; 5 h × Multiply-accumulate insn: c[2,3] += e[3] signed* f[3] 6 g × signed* b[0]; 7 × c[0,1] += g[0] signed* h[2]; 8 × signed* b[1] c[2,3] += g[1] signed* h[3] 9 on Cortex-A8. 10 Cortex-A8 recognizes this pattern. 11 b . Reads c in N6 instead of N3. 12 a . c if accumulate. ready.
9 10 Time N1 N2 N3 N4 N5 N6 N7 Typical sequence of three insns: 1 b b[0]; c[0,1] = a[0] signed* b[0]; 2 a b[1] c[2,3] = a[1] signed* b[1] 3 f × 4 e × c[0,1] += e[2] signed* f[2]; 5 h × × c[2,3] += e[3] signed* f[3] 6 g × × b[0]; 7 × × c[0,1] += g[0] signed* h[2]; 8 × × b[1] c[2,3] += g[1] signed* h[3] 9 × + rtex-A8. 10 × Cortex-A8 recognizes this pattern. 11 + Reads c in N6 instead of N3. 12 accumulate.
10 11 Time N1 N2 N3 N4 N5 N6 N7 N8 Typical sequence of three insns: 1 b c[0,1] = a[0] signed* b[0]; 2 a c[2,3] = a[1] signed* b[1] 3 f × 4 e × c[0,1] += e[2] signed* f[2]; 5 h × × c[2,3] += e[3] signed* f[3] 6 g × × 7 × × c[0,1] += g[0] signed* h[2]; 8 × × c c[2,3] += g[1] signed* h[3] 9 × + 10 × c Cortex-A8 recognizes this pattern. 11 + Reads c in N6 instead of N3. 12 c
10 11 Time N1 N2 N3 N4 N5 N6 N7 N8 ypical sequence of three insns: NEON also 1 b and permutation = a[0] signed* b[0]; 2 a r = s[1] = a[1] signed* b[1] 3 f × 4 e × Cortex-A8 += e[2] signed* f[2]; 5 h × × NEON load/sto += e[3] signed* f[3] 6 g × × that runs 7 × × += g[0] signed* h[2]; NEON a 8 × × c += g[1] signed* h[3] 9 × + Arithmetic 10 × c rtex-A8 recognizes this pattern. most imp 11 + in N6 instead of N3. can often 12 c to hide loads/sto Cortex-A7 handling
10 11 Time N1 N2 N3 N4 N5 N6 N7 N8 sequence of three insns: NEON also has load/sto 1 b and permutation insns: signed* b[0]; 2 a r = s[1] t[2] r[2,3] signed* b[1] 3 f × 4 e × Cortex-A8 has a sepa signed* f[2]; 5 h × × NEON load/store signed* f[3] 6 g × × that runs in parallel 7 × × signed* h[2]; NEON arithmetic unit. 8 × × c signed* h[3] 9 × + Arithmetic is typically 10 × c recognizes this pattern. most important bottlenec 11 + instead of N3. can often schedule 12 c to hide loads/stores/p Cortex-A7 is different: handling all NEON
10 11 Time N1 N2 N3 N4 N5 N6 N7 N8 insns: NEON also has load/store insns 1 b and permutation insns: e.g., b[0]; 2 a r = s[1] t[2] r[2,3] b[1] 3 f × 4 e × Cortex-A8 has a separate f[2]; 5 h × × NEON load/store unit f[3] 6 g × × that runs in parallel with 7 × × h[2]; NEON arithmetic unit. 8 × × c h[3] 9 × + Arithmetic is typically 10 × c pattern. most important bottleneck: 11 + N3. can often schedule insns 12 c to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns.
11 12 Time N1 N2 N3 N4 N5 N6 N7 N8 NEON also has load/store insns 1 b and permutation insns: e.g., 2 a r = s[1] t[2] r[2,3] 3 f × 4 e × Cortex-A8 has a separate 5 h × × NEON load/store unit 6 g × × that runs in parallel with 7 × × NEON arithmetic unit. 8 × × c 9 × + Arithmetic is typically 10 × c most important bottleneck: 11 + can often schedule insns 12 c to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns.
11 12 N1 N2 N3 N4 N5 N6 N7 N8 NEON also has load/store insns Curve25519 and permutation insns: e.g., Radix 2 25 a r = s[1] t[2] r[2,3] ( f 0 ; f 1 ; f 2 × e × Cortex-A8 has a separate to represent × × NEON load/store unit f = f 0 + g × × 2 102 f 4 + that runs in parallel with × × 2 204 f 8 + NEON arithmetic unit. × × c × + Arithmetic is typically Unscaled × c most important bottleneck: f is value + f 0 t 0 + 2 0 can often schedule insns c f 4 t 4 + 2 0 to hide loads/stores/perms. f 8 t 8 + 2 0 Cortex-A7 is different: one unit handling all NEON insns.
11 12 N4 N5 N6 N7 N8 NEON also has load/store insns Curve25519 on NEON and permutation insns: e.g., Radix 2 25 : 5 : Use small r = s[1] t[2] r[2,3] ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f × Cortex-A8 has a separate to represent the integer × f = f 0 + 2 26 f 1 + 2 NEON load/store unit × × 2 102 f 4 + 2 128 f 5 + 2 that runs in parallel with × 2 204 f 8 + 2 230 f 9 mo NEON arithmetic unit. × × c × + Arithmetic is typically Unscaled polynomial × c f is value at 2 25 : 5 most important bottleneck: + f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 can often schedule insns c f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 to hide loads/stores/perms. f 8 t 8 + 2 0 : 5 f 9 t 9 . Cortex-A7 is different: one unit handling all NEON insns.
11 12 N7 N8 NEON also has load/store insns Curve25519 on NEON and permutation insns: e.g., Radix 2 25 : 5 : Use small integers r = s[1] t[2] r[2,3] ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 Cortex-A8 has a separate to represent the integer f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 NEON load/store unit 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 that runs in parallel with 2 204 f 8 + 2 230 f 9 modulo 2 255 NEON arithmetic unit. c + Arithmetic is typically Unscaled polynomial view: c f is value at 2 25 : 5 of the poly most important bottleneck: + f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f can often schedule insns c f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f to hide loads/stores/perms. f 8 t 8 + 2 0 : 5 f 9 t 9 . Cortex-A7 is different: one unit handling all NEON insns.
12 13 NEON also has load/store insns Curve25519 on NEON and permutation insns: e.g., Radix 2 25 : 5 : Use small integers r = s[1] t[2] r[2,3] ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) Cortex-A8 has a separate to represent the integer f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + NEON load/store unit 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + that runs in parallel with 2 204 f 8 + 2 230 f 9 modulo 2 255 − 19. NEON arithmetic unit. Arithmetic is typically Unscaled polynomial view: f is value at 2 25 : 5 of the poly most important bottleneck: f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + can often schedule insns f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + to hide loads/stores/perms. f 8 t 8 + 2 0 : 5 f 9 t 9 . Cortex-A7 is different: one unit handling all NEON insns.
12 13 also has load/store insns Curve25519 on NEON h ≡ f g ermutation insns: e.g., Radix 2 25 : 5 : Use small integers h 0 = f 0 g 0 + s[1] t[2] r[2,3] ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + rtex-A8 has a separate to represent the integer h 2 = f 0 g 2 + f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + load/store unit h 3 = f 0 g 3 + 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + runs in parallel with h 4 = f 0 g 4 + 2 204 f 8 + 2 230 f 9 modulo 2 255 − 19. arithmetic unit. h 5 = f 0 g 5 + h 6 = f 0 g 6 + Arithmetic is typically Unscaled polynomial view: h 7 = f 0 g 7 + f is value at 2 25 : 5 of the poly important bottleneck: h 8 = f 0 g 8 + f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + often schedule insns h 9 = f 0 g 9 + f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + hide loads/stores/perms. f 8 t 8 + 2 0 : 5 f 9 t 9 . rtex-A7 is different: one unit Proof: multiply handling all NEON insns.
12 13 (mod 2 255 load/store insns Curve25519 on NEON h ≡ f g insns: e.g., Radix 2 25 : 5 : Use small integers h 0 = f 0 g 0 +38 f 1 g 9 +19 r[2,3] ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 separate to represent the integer h 2 = f 0 g 2 + 2 f 1 g 1 + f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + re unit h 3 = f 0 g 3 + f 1 g 2 + 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + rallel with h 4 = f 0 g 4 + 2 f 1 g 3 + 2 204 f 8 + 2 230 f 9 modulo 2 255 − 19. rithmetic unit. h 5 = f 0 g 5 + f 1 g 4 + h 6 = f 0 g 6 + 2 f 1 g 5 + ypically Unscaled polynomial view: h 7 = f 0 g 7 + f 1 g 6 + f is value at 2 25 : 5 of the poly bottleneck: h 8 = f 0 g 8 + 2 f 1 g 7 + f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + schedule insns h 9 = f 0 g 9 + f 1 g 8 + f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + loads/stores/perms. f 8 t 8 + 2 0 : 5 f 9 t 9 . different: one unit Proof: multiply polys NEON insns.
12 13 (mod 2 255 − 19) where insns Curve25519 on NEON h ≡ f g e.g., Radix 2 25 : 5 : Use small integers h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g to represent the integer h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 2 204 f 8 + 2 230 f 9 modulo 2 255 − 19. h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g Unscaled polynomial view: h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f is value at 2 25 : 5 of the poly f 3 g k: h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + f 3 g erms. f 8 t 8 + 2 0 : 5 f 9 t 9 . one unit Proof: multiply polys mod t
13 14 (mod 2 255 − 19) where Curve25519 on NEON h ≡ f g Radix 2 25 : 5 : Use small integers h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + ( f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + to represent the integer h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + f = f 0 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 2 102 f 4 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 2 204 f 8 + 2 230 f 9 modulo 2 255 − 19. h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + Unscaled polynomial view: h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f is value at 2 25 : 5 of the poly h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + f 0 t 0 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 4 t 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + f 8 t 8 + 2 0 : 5 f 9 t 9 . Proof: multiply polys mod t 10 − 19.
13 14 (mod 2 255 − 19) where Curve25519 on NEON h ≡ f g 2 25 : 5 : Use small integers h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 f 2 ; f 3 ; f 4 ; f 5 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 resent the integer h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 + 2 26 f 1 + 2 51 f 2 + 2 77 f 3 + h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 + 2 128 f 5 + 2 153 f 6 + 2 179 f 7 + h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 + 2 230 f 9 modulo 2 255 − 19. h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 Unscaled polynomial view: h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + value at 2 25 : 5 of the poly f 6 h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + 2 0 : 5 f 1 t 1 + f 2 t 2 + 2 0 : 5 f 3 t 3 + f 6 h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + 2 0 : 5 f 5 t 5 + f 6 t 6 + 2 0 : 5 f 7 t 7 + f 6 2 0 : 5 f 9 t 9 . Proof: multiply polys mod t 10 − 19.
13 14 (mod 2 255 − 19) where NEON h ≡ f g small integers h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 ; f 6 ; f 7 ; f 8 ; f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 integer h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 2 51 f 2 + 2 77 f 3 + h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 2 153 f 6 + 2 179 f 7 + h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 modulo 2 255 − 19. h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 olynomial view: h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + 5 of the poly f 7 g 0 h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 f 2 t 2 + 2 0 : 5 f 3 t 3 + h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 6 t 6 + 2 0 : 5 f 7 t 7 + f 7 g 2 Proof: multiply polys mod t 10 − 19.
13 14 (mod 2 255 − 19) where h ≡ f g integers h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 ) h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 2 77 f 3 + h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 2 179 f 7 + h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 255 − 19. h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 oly h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 : 5 f 3 t 3 + h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + : 5 f 7 t 7 + Proof: multiply polys mod t 10 − 19.
14 15 (mod 2 255 − 19) where h ≡ f g h 0 = f 0 g 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; h 1 = f 0 g 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; h 2 = f 0 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; h 3 = f 0 g 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; h 4 = f 0 g 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; h 5 = f 0 g 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; h 6 = f 0 g 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; h 7 = f 0 g 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; h 8 = f 0 g 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; h 9 = f 0 g 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Proof: multiply polys mod t 10 − 19.
14 15 (mod 2 255 − 19) where Each h i products 0 +38 f 1 g 9 +19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; of 2 f 1 ; 2 f 1 + f 1 g 0 +19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 + 2 f 1 g 1 + f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i 3 + f 1 g 2 + f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable 4 + 2 f 1 g 3 + f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 5 + f 1 g 4 + f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; 6 + 2 f 1 g 5 + f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze 7 + f 1 g 6 + f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can 8 + 2 f 1 g 7 + f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 9 + f 1 g 8 + f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Barbosa–V several recent multiply polys mod t 10 − 19. h 0 ; h 1 ; : : for subse
14 15 255 − 19) where Each h i is a sum of products after precomputation 19 f 2 g 8 +38 f 3 g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 19 f 2 g 9 +19 f 3 g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g f 2 g 0 +38 f 3 g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i fits into 64 f 2 g 1 + f 3 g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable lim f 2 g 2 + 2 f 3 g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 2 g 3 + f 3 g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; f 2 g 4 + 2 f 3 g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze this very f 2 g 5 + f 3 g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can slip past f 2 g 6 + 2 f 3 g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 Brumley–P f 2 g 7 + f 3 g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Barbosa–Vercauteren several recent OpenSS polys mod t 10 − 19. h 0 ; h 1 ; : : : are too la for subsequent multiplication.
14 15 where Each h i is a sum of ten products after precomputation g 7 +19 f 4 g 6 + 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; g 8 +19 f 4 g 7 + 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . g 9 +19 f 4 g 8 + 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i fits into 64 bits g 0 +19 f 4 g 9 + 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable limits on g 1 + f 4 g 0 + 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . g 2 + f 4 g 1 + f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; g 3 + f 4 g 2 + 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze this very carefully: g 4 + f 4 g 3 + f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can slip past most tests! g 5 + f 4 g 4 + 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 Brumley–Page– g 6 + f 4 g 5 + f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Barbosa–Vercauteren and several recent OpenSSL bugs.) t 10 − 19. h 0 ; h 1 ; : : : are too large for subsequent multiplication.
15 16 Each h i is a sum of ten products after precomputation 38 f 5 g 5 +19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; 19 f 5 g 6 +19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . 38 f 5 g 7 +19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i fits into 64 bits 19 f 5 g 8 +19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable limits on 38 f 5 g 9 +19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . f 5 g 0 +19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; 2 f 5 g 1 + f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze this very carefully: f 5 g 2 + f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can slip past most tests! 2 f 5 g 3 + f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 Brumley–Page– f 5 g 4 + f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Barbosa–Vercauteren and several recent OpenSSL bugs.) h 0 ; h 1 ; : : : are too large for subsequent multiplication.
15 16 Each h i is a sum of ten Carry h 0 products after precomputation replace ( 19 f 6 g 4 +38 f 7 g 3 +19 f 8 g 2 +38 f 9 g 1 ; of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; ( h 0 mod 19 f 6 g 5 +19 f 7 g 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . This mak 19 f 6 g 6 +38 f 7 g 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i fits into 64 bits Similarly 19 f 6 g 7 +19 f 7 g 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable limits on Eventually 19 f 6 g 8 +38 f 7 g 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . 19 f 6 g 9 +19 f 7 g 8 +19 f 8 g 7 +19 f 9 g 6 ; We actual f 6 g 0 +38 f 7 g 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze this very carefully: Slightly mo f 6 g 1 + f 7 g 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can slip past most tests! (given details f 6 g 2 + 2 f 7 g 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 Brumley–Page– but more f 6 g 3 + f 7 g 2 + f 8 g 1 + f 9 g 0 : Barbosa–Vercauteren and Some things several recent OpenSSL bugs.) • Mix signed, h 0 ; h 1 ; : : : are too large • Interleave for subsequent multiplication.
15 16 Each h i is a sum of ten Carry h 0 → h 1 : i.e., products after precomputation replace ( h 0 ; h 1 ) with 3 +19 f 8 g 2 +38 f 9 g 1 ; ( h 0 mod 2 26 ; h 1 + of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; 4 +19 f 8 g 3 +19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . This makes h 0 small. 5 +19 f 8 g 4 +38 f 9 g 3 ; Each h i fits into 64 bits Similarly for other 6 +19 f 8 g 5 +19 f 9 g 4 ; under reasonable limits on Eventually all h i are 7 +19 f 8 g 6 +38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . 8 +19 f 8 g 7 +19 f 9 g 6 ; We actually use signed 9 +19 f 8 g 8 +38 f 9 g 7 ; (Analyze this very carefully: Slightly more expensive 0 +19 f 8 g 9 +19 f 9 g 8 ; bugs can slip past most tests! (given details of insn 1 + f 8 g 0 +38 f 9 g 9 ; See 2011 Brumley–Page– but more room for 2 + f 8 g 1 + f 9 g 0 : Barbosa–Vercauteren and Some things we haven’t several recent OpenSSL bugs.) • Mix signed, unsigned h 0 ; h 1 ; : : : are too large • Interleave reduction, for subsequent multiplication.
15 16 Each h i is a sum of ten Carry h 0 → h 1 : i.e., products after precomputation replace ( h 0 ; h 1 ) with 38 f 9 g 1 ; ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; ). 19 f 9 g 2 ; 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . This makes h 0 small. 38 f 9 g 3 ; Each h i fits into 64 bits Similarly for other h i . 19 f 9 g 4 ; under reasonable limits on Eventually all h i are small enough. 38 f 9 g 5 ; sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . 19 f 9 g 6 ; We actually use signed coeffs. 38 f 9 g 7 ; (Analyze this very carefully: Slightly more expensive carries 19 f 9 g 8 ; bugs can slip past most tests! (given details of insn set) but more room for ab + c 2 etc. 38 f 9 g 9 ; See 2011 Brumley–Page– f 9 g 0 : Barbosa–Vercauteren and Some things we haven’t tried several recent OpenSSL bugs.) • Mix signed, unsigned carries. h 0 ; h 1 ; : : : are too large • Interleave reduction, carrying. for subsequent multiplication.
16 17 Each h i is a sum of ten Carry h 0 → h 1 : i.e., products after precomputation replace ( h 0 ; h 1 ) with ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ of 2 f 1 ; 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; ). 19 g 1 ; 19 g 2 ; : : : ; 19 g 9 . This makes h 0 small. Each h i fits into 64 bits Similarly for other h i . under reasonable limits on Eventually all h i are small enough. sizes of f 1 ; g 1 ; : : : ; f 9 ; g 9 . We actually use signed coeffs. (Analyze this very carefully: Slightly more expensive carries bugs can slip past most tests! (given details of insn set) but more room for ab + c 2 etc. See 2011 Brumley–Page– Barbosa–Vercauteren and Some things we haven’t tried yet: several recent OpenSSL bugs.) • Mix signed, unsigned carries. h 0 ; h 1 ; : : : are too large • Interleave reduction, carrying. for subsequent multiplication.
16 17 i is a sum of ten Carry h 0 → h 1 : i.e., Minor cha ducts after precomputation replace ( h 0 ; h 1 ) with Result of ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ 2 f 3 ; 2 f 5 ; 2 f 7 ; 2 f 9 ; ). used until 19 g 2 ; : : : ; 19 g 9 . This makes h 0 small. Find an i fits into 64 bits Similarly for other h i . for the CPU reasonable limits on Eventually all h i are small enough. while the of f 1 ; g 1 ; : : : ; f 9 ; g 9 . We actually use signed coeffs. Sometimes (Analyze this very carefully: Slightly more expensive carries higher-level can slip past most tests! (given details of insn set) Example: but more room for ab + c 2 etc. 2011 Brumley–Page– h 2 → h 3 osa–Vercauteren and Some things we haven’t tried yet: h 7 → h 8 several recent OpenSSL bugs.) • Mix signed, unsigned carries. have long : : : are too large • Interleave reduction, carrying. bsequent multiplication.
16 17 of ten Carry h 0 → h 1 : i.e., Minor challenge: pip recomputation replace ( h 0 ; h 1 ) with Result of each insn ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ f 7 ; 2 f 9 ; ). used until a few cycles 19 g 9 . This makes h 0 small. Find an independent 64 bits Similarly for other h i . for the CPU to sta limits on Eventually all h i are small enough. while the first insn : ; f 9 ; g 9 . We actually use signed coeffs. Sometimes helps to very carefully: Slightly more expensive carries higher-level computations. past most tests! (given details of insn set) Example: carries h but more room for ab + c 2 etc. Brumley–Page– h 2 → h 3 → h 4 → h ercauteren and Some things we haven’t tried yet: h 7 → h 8 → h 9 → h OpenSSL bugs.) • Mix signed, unsigned carries. have long chain of o large • Interleave reduction, carrying. multiplication.
16 17 Carry h 0 → h 1 : i.e., Minor challenge: pipelining. recomputation replace ( h 0 ; h 1 ) with Result of each insn cannot b ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ ). used until a few cycles later. This makes h 0 small. Find an independent insn Similarly for other h i . for the CPU to start working Eventually all h i are small enough. while the first insn is in progress. We actually use signed coeffs. Sometimes helps to adjust refully: Slightly more expensive carries higher-level computations. tests! (given details of insn set) Example: carries h 0 → h 1 → but more room for ab + c 2 etc. h 2 → h 3 → h 4 → h 5 → h 6 → Some things we haven’t tried yet: h 7 → h 8 → h 9 → h 0 → h 1 bugs.) • Mix signed, unsigned carries. have long chain of dependencies. • Interleave reduction, carrying. multiplication.
17 18 Carry h 0 → h 1 : i.e., Minor challenge: pipelining. replace ( h 0 ; h 1 ) with Result of each insn cannot be ( h 0 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ ). used until a few cycles later. This makes h 0 small. Find an independent insn Similarly for other h i . for the CPU to start working on Eventually all h i are small enough. while the first insn is in progress. We actually use signed coeffs. Sometimes helps to adjust Slightly more expensive carries higher-level computations. (given details of insn set) Example: carries h 0 → h 1 → but more room for ab + c 2 etc. h 2 → h 3 → h 4 → h 5 → h 6 → Some things we haven’t tried yet: h 7 → h 8 → h 9 → h 0 → h 1 • Mix signed, unsigned carries. have long chain of dependencies. • Interleave reduction, carrying.
17 18 h 0 → h 1 : i.e., Minor challenge: pipelining. Alternative: replace ( h 0 ; h 1 ) with Result of each insn cannot be h 0 → h 1 mod 2 26 ; h 1 + h 0 = 2 26 ˝ ¨ ). used until a few cycles later. h 1 → h 2 makes h 0 small. h 2 → h 3 Find an independent insn h 3 → h 4 rly for other h i . for the CPU to start working on h 4 → h 5 Eventually all h i are small enough. while the first insn is in progress. h 5 → h 6 actually use signed coeffs. Sometimes helps to adjust 12 carries Slightly more expensive carries higher-level computations. but latency details of insn set) Example: carries h 0 → h 1 → more room for ab + c 2 etc. Now much h 2 → h 3 → h 4 → h 5 → h 6 → to find indep things we haven’t tried yet: h 7 → h 8 → h 9 → h 0 → h 1 for CPU signed, unsigned carries. have long chain of dependencies. Interleave reduction, carrying.
17 18 i.e., Minor challenge: pipelining. Alternative: carry with Result of each insn cannot be h 0 → h 1 and h 5 → h 0 = 2 26 ˝ ¨ + ). used until a few cycles later. h 1 → h 2 and h 6 → small. h 2 → h 3 and h 7 → Find an independent insn h 3 → h 4 and h 8 → other h i . for the CPU to start working on h 4 → h 5 and h 9 → are small enough. while the first insn is in progress. h 5 → h 6 and h 0 → signed coeffs. Sometimes helps to adjust 12 carries instead of expensive carries higher-level computations. but latency is much insn set) Example: carries h 0 → h 1 → for ab + c 2 etc. Now much easier h 2 → h 3 → h 4 → h 5 → h 6 → to find independent haven’t tried yet: h 7 → h 8 → h 9 → h 0 → h 1 for CPU to handle unsigned carries. have long chain of dependencies. reduction, carrying.
17 18 Minor challenge: pipelining. Alternative: carry Result of each insn cannot be h 0 → h 1 and h 5 → h 6 ; ˝ ). used until a few cycles later. h 1 → h 2 and h 6 → h 7 ; h 2 → h 3 and h 7 → h 8 ; Find an independent insn h 3 → h 4 and h 8 → h 9 ; for the CPU to start working on h 4 → h 5 and h 9 → h 0 ; enough. while the first insn is in progress. h 5 → h 6 and h 0 → h 1 . effs. Sometimes helps to adjust 12 carries instead of 11, rries higher-level computations. but latency is much smaller. Example: carries h 0 → h 1 → etc. Now much easier h 2 → h 3 → h 4 → h 5 → h 6 → to find independent insns tried yet: h 7 → h 8 → h 9 → h 0 → h 1 for CPU to handle in parallel. rries. have long chain of dependencies. rrying.
18 19 Minor challenge: pipelining. Alternative: carry Result of each insn cannot be h 0 → h 1 and h 5 → h 6 ; used until a few cycles later. h 1 → h 2 and h 6 → h 7 ; h 2 → h 3 and h 7 → h 8 ; Find an independent insn h 3 → h 4 and h 8 → h 9 ; for the CPU to start working on h 4 → h 5 and h 9 → h 0 ; while the first insn is in progress. h 5 → h 6 and h 0 → h 1 . Sometimes helps to adjust 12 carries instead of 11, higher-level computations. but latency is much smaller. Example: carries h 0 → h 1 → Now much easier h 2 → h 3 → h 4 → h 5 → h 6 → to find independent insns h 7 → h 8 → h 9 → h 0 → h 1 for CPU to handle in parallel. have long chain of dependencies.
18 19 challenge: pipelining. Alternative: carry Major ch of each insn cannot be h 0 → h 1 and h 5 → h 6 ; e.g. 4x a until a few cycles later. h 1 → h 2 and h 6 → h 7 ; does 4 additions h 2 → h 3 and h 7 → h 8 ; an independent insn but needs h 3 → h 4 and h 8 → h 9 ; CPU to start working on of inputs h 4 → h 5 and h 9 → h 0 ; the first insn is in progress. On Cortex-A8, h 5 → h 6 and h 0 → h 1 . Sometimes helps to adjust occasional 12 carries instead of 11, higher-level computations. run in pa but latency is much smaller. but frequent Example: carries h 0 → h 1 → Now much easier would be h 3 → h 4 → h 5 → h 6 → to find independent insns h 8 → h 9 → h 0 → h 1 On Cortex-A7, for CPU to handle in parallel. long chain of dependencies. every op
18 19 : pipelining. Alternative: carry Major challenge: ve insn cannot be h 0 → h 1 and h 5 → h 6 ; e.g. 4x a = b + c cycles later. h 1 → h 2 and h 6 → h 7 ; does 4 additions at h 2 → h 3 and h 7 → h 8 ; endent insn but needs particula h 3 → h 4 and h 8 → h 9 ; start working on of inputs and outputs. h 4 → h 5 and h 9 → h 0 ; insn is in progress. On Cortex-A8, h 5 → h 6 and h 0 → h 1 . to adjust occasional permutations 12 carries instead of 11, computations. run in parallel with but latency is much smaller. but frequent permutations h 0 → h 1 → Now much easier would be a bottleneck. h 5 → h 6 → to find independent insns h 0 → h 1 On Cortex-A7, for CPU to handle in parallel. of dependencies. every operation cos
18 19 elining. Alternative: carry Major challenge: vectorization. cannot be h 0 → h 1 and h 5 → h 6 ; e.g. 4x a = b + c later. h 1 → h 2 and h 6 → h 7 ; does 4 additions at once, h 2 → h 3 and h 7 → h 8 ; but needs particular arrangement h 3 → h 4 and h 8 → h 9 ; rking on of inputs and outputs. h 4 → h 5 and h 9 → h 0 ; rogress. On Cortex-A8, h 5 → h 6 and h 0 → h 1 . occasional permutations 12 carries instead of 11, run in parallel with arithmetic, but latency is much smaller. but frequent permutations → Now much easier would be a bottleneck. → to find independent insns On Cortex-A7, for CPU to handle in parallel. endencies. every operation costs cycles.
19 20 Alternative: carry Major challenge: vectorization. h 0 → h 1 and h 5 → h 6 ; e.g. 4x a = b + c h 1 → h 2 and h 6 → h 7 ; does 4 additions at once, h 2 → h 3 and h 7 → h 8 ; but needs particular arrangement h 3 → h 4 and h 8 → h 9 ; of inputs and outputs. h 4 → h 5 and h 9 → h 0 ; On Cortex-A8, h 5 → h 6 and h 0 → h 1 . occasional permutations 12 carries instead of 11, run in parallel with arithmetic, but latency is much smaller. but frequent permutations Now much easier would be a bottleneck. to find independent insns On Cortex-A7, for CPU to handle in parallel. every operation costs cycles.
19 20 Alternative: carry Major challenge: vectorization. Often higher-level h 1 and h 5 → h 6 ; do a pair e.g. 4x a = b + c h 2 and h 6 → h 7 ; h = f g ; does 4 additions at once, h 3 and h 7 → h 8 ; but needs particular arrangement Vectorize h 4 and h 8 → h 9 ; of inputs and outputs. Merge f 0 h 5 and h 9 → h 0 ; and f ′ 0 ; f 1 On Cortex-A8, h 6 and h 0 → h 1 . into vecto occasional permutations rries instead of 11, Similarly run in parallel with arithmetic, latency is much smaller. Then compute but frequent permutations much easier would be a bottleneck. Computation independent insns into NEON On Cortex-A7, CPU to handle in parallel. c[0,1] every operation costs cycles. c[2,3]
19 20 rry Major challenge: vectorization. Often higher-level → h 6 ; do a pair of mults e.g. 4x a = b + c h = f g ; h ′ = f ′ g ′ . → h 7 ; does 4 additions at once, → h 8 ; but needs particular arrangement Vectorize across those → h 9 ; of inputs and outputs. Merge f 0 ; f 1 ; : : : ; f 9 → h 0 ; and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 On Cortex-A8, → h 1 . into vectors ( f i ; f ′ i ). occasional permutations Similarly ( g i ; g ′ instead of 11, i ). run in parallel with arithmetic, much smaller. Then compute ( h i ; but frequent permutations easier would be a bottleneck. Computation fits natura endent insns into NEON insns: On Cortex-A7, handle in parallel. c[0,1] = a[0] signed* every operation costs cycles. c[2,3] = a[1] signed*
19 20 Major challenge: vectorization. Often higher-level operations do a pair of mults in parallel: e.g. 4x a = b + c h = f g ; h ′ = f ′ g ′ . does 4 additions at once, but needs particular arrangement Vectorize across those mults. of inputs and outputs. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 On Cortex-A8, into vectors ( f i ; f ′ i ). occasional permutations Similarly ( g i ; g ′ i ). run in parallel with arithmetic, Then compute ( h i ; h ′ smaller. i ). but frequent permutations would be a bottleneck. Computation fits naturally into NEON insns: e.g., On Cortex-A7, rallel. c[0,1] = a[0] signed* b[0]; every operation costs cycles. c[2,3] = a[1] signed* b[1]
20 21 Major challenge: vectorization. Often higher-level operations do a pair of mults in parallel: e.g. 4x a = b + c h = f g ; h ′ = f ′ g ′ . does 4 additions at once, but needs particular arrangement Vectorize across those mults. of inputs and outputs. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 On Cortex-A8, into vectors ( f i ; f ′ i ). occasional permutations Similarly ( g i ; g ′ i ). run in parallel with arithmetic, Then compute ( h i ; h ′ i ). but frequent permutations would be a bottleneck. Computation fits naturally into NEON insns: e.g., On Cortex-A7, c[0,1] = a[0] signed* b[0]; every operation costs cycles. c[2,3] = a[1] signed* b[1]
20 21 challenge: vectorization. Often higher-level operations Example: do a pair of mults in parallel: C = X 1 · a = b + c h = f g ; h ′ = f ′ g ′ . inside point-addition additions at once, for Edwa needs particular arrangement Vectorize across those mults. inputs and outputs. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 rtex-A8, into vectors ( f i ; f ′ i ). ccasional permutations Similarly ( g i ; g ′ i ). parallel with arithmetic, Then compute ( h i ; h ′ i ). frequent permutations be a bottleneck. Computation fits naturally into NEON insns: e.g., rtex-A7, c[0,1] = a[0] signed* b[0]; operation costs cycles. c[2,3] = a[1] signed* b[1]
20 21 allenge: vectorization. Often higher-level operations Example: Recall do a pair of mults in parallel: C = X 1 · X 2 ; D = h = f g ; h ′ = f ′ g ′ . inside point-addition at once, for Edwards curves. rticular arrangement Vectorize across those mults. outputs. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 into vectors ( f i ; f ′ i ). ermutations Similarly ( g i ; g ′ i ). with arithmetic, Then compute ( h i ; h ′ i ). ermutations ottleneck. Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; costs cycles. c[2,3] = a[1] signed* b[1]
20 21 rization. Often higher-level operations Example: Recall do a pair of mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 h = f g ; h ′ = f ′ g ′ . inside point-addition formulas for Edwards curves. rrangement Vectorize across those mults. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 into vectors ( f i ; f ′ i ). Similarly ( g i ; g ′ i ). rithmetic, Then compute ( h i ; h ′ i ). Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; cycles. c[2,3] = a[1] signed* b[1]
21 22 Often higher-level operations Example: Recall do a pair of mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 h = f g ; h ′ = f ′ g ′ . inside point-addition formulas for Edwards curves. Vectorize across those mults. Merge f 0 ; f 1 ; : : : ; f 9 and f ′ 0 ; f ′ 1 ; : : : ; f ′ 9 into vectors ( f i ; f ′ i ). Similarly ( g i ; g ′ i ). Then compute ( h i ; h ′ i ). Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]
21 22 Often higher-level operations Example: Recall do a pair of mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 h = f g ; h ′ = f ′ g ′ . inside point-addition formulas for Edwards curves. Vectorize across those mults. Merge f 0 ; f 1 ; : : : ; f 9 Example: Can compute and f ′ 0 ; f ′ 1 ; : : : ; f ′ 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 9 into vectors ( f i ; f ′ i ). 2 P = P + P ; Similarly ( g i ; g ′ i ). 3 P = 2 P + P and 4 P = 2 P + 2 P ; Then compute ( h i ; h ′ i ). 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]
21 22 Often higher-level operations Example: Recall do a pair of mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 h = f g ; h ′ = f ′ g ′ . inside point-addition formulas for Edwards curves. Vectorize across those mults. Merge f 0 ; f 1 ; : : : ; f 9 Example: Can compute and f ′ 0 ; f ′ 1 ; : : : ; f ′ 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 9 into vectors ( f i ; f ′ i ). 2 P = P + P ; Similarly ( g i ; g ′ i ). 3 P = 2 P + P and 4 P = 2 P + 2 P ; Then compute ( h i ; h ′ i ). 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . Computation fits naturally into NEON insns: e.g., Example: Typical algorithms for fixed-base scalarmult c[0,1] = a[0] signed* b[0]; have many parallel point adds. c[2,3] = a[1] signed* b[1]
21 22 higher-level operations Example: Recall Example: pair of mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 with a backlog ; h ′ = f ′ g ′ . inside point-addition formulas can vecto for Edwards curves. rize across those mults. f 0 ; f 1 ; : : : ; f 9 Example: Can compute ; f ′ 1 ; : : : ; f ′ 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 9 vectors ( f i ; f ′ i ). 2 P = P + P ; rly ( g i ; g ′ i ). 3 P = 2 P + P and 4 P = 2 P + 2 P ; compute ( h i ; h ′ i ). 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . Computation fits naturally NEON insns: e.g., Example: Typical algorithms for fixed-base scalarmult = a[0] signed* b[0]; have many parallel point adds. = a[1] signed* b[1]
21 22 higher-level operations Example: Recall Example: A busy server mults in parallel: C = X 1 · X 2 ; D = Y 1 · Y 2 with a backlog of scala ′ . inside point-addition formulas can vectorize across for Edwards curves. those mults. ; f 9 Example: Can compute 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as f ′ i ). 2 P = P + P ; ). 3 P = 2 P + P and 4 P = 2 P + 2 P ; h i ; h ′ i ). 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . fits naturally insns: e.g., Example: Typical algorithms for fixed-base scalarmult signed* b[0]; have many parallel point adds. signed* b[1]
21 22 erations Example: Recall Example: A busy server rallel: C = X 1 · X 2 ; D = Y 1 · Y 2 with a backlog of scalarmults inside point-addition formulas can vectorize across them. for Edwards curves. mults. Example: Can compute 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 2 P = P + P ; 3 P = 2 P + P and 4 P = 2 P + 2 P ; 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . Example: Typical algorithms for fixed-base scalarmult b[0]; have many parallel point adds. b[1]
22 23 Example: Recall Example: A busy server C = X 1 · X 2 ; D = Y 1 · Y 2 with a backlog of scalarmults inside point-addition formulas can vectorize across them. for Edwards curves. Example: Can compute 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 2 P = P + P ; 3 P = 2 P + P and 4 P = 2 P + 2 P ; 5 P = 4 P + P and 6 P = 3 P + 3 P and 7 P = 4 P + 3 P . Example: Typical algorithms for fixed-base scalarmult have many parallel point adds.
22 23 Example: Recall Example: A busy server C = X 1 · X 2 ; D = Y 1 · Y 2 with a backlog of scalarmults inside point-addition formulas can vectorize across them. for Edwards curves. Beware a disadvantage of Example: Can compute vectorizing across two mults: 256-bit f ; f ′ ; g; g ′ ; h; h ′ 2 P; 3 P; 4 P; 5 P; 6 P; 7 P as 2 P = P + P ; occupy at least 1536 bits, 3 P = 2 P + P and 4 P = 2 P + 2 P ; leaving very little room 5 P = 4 P + P and 6 P = 3 P + 3 P for temporary registers. and 7 P = 4 P + 3 P . We use some loads and stores Example: Typical algorithms inside vectorized mulmul . for fixed-base scalarmult Mostly invisible on Cortex-A8, have many parallel point adds. but bigger issue on Cortex-A7.
22 23 Example: Recall Example: A busy server Some field 1 · X 2 ; D = Y 1 · Y 2 with a backlog of scalarmults inside a single point-addition formulas can vectorize across them. Example: wards curves. Beware a disadvantage of convert fraction Z − 1 X ∈ Example: Can compute vectorizing across two mults: 256-bit f ; f ′ ; g; g ′ ; h; h ′ ; 4 P; 5 P; 6 P; 7 P as Easy, constant P + P ; occupy at least 1536 bits, 11 M + 254 2 P + P and 4 P = 2 P + 2 P ; leaving very little room z2 = z1^2^1 4 P + P and 6 P = 3 P + 3 P for temporary registers. z8 = z2^2^2 = 4 P + 3 P . We use some loads and stores z9 = z1*z8 Example: Typical algorithms inside vectorized mulmul . z11 = z2*z9 fixed-base scalarmult Mostly invisible on Cortex-A8, z22 = z11^2^1 many parallel point adds. but bigger issue on Cortex-A7. z_5_0 = z_10_5 =
22 23 Example: A busy server Some field ops are = Y 1 · Y 2 with a backlog of scalarmults inside a single scala oint-addition formulas can vectorize across them. Example: At end of curves. Beware a disadvantage of convert fraction ( X Z − 1 X ∈ { 0 ; 1 ; : : : ; compute vectorizing across two mults: 256-bit f ; f ′ ; g; g ′ ; h; h ′ P; 7 P as Easy, constant time: occupy at least 1536 bits, 11 M + 254 S for p and 4 P = 2 P + 2 P ; leaving very little room z2 = z1^2^1 and 6 P = 3 P + 3 P for temporary registers. z8 = z2^2^2 3 P . We use some loads and stores z9 = z1*z8 ypical algorithms inside vectorized mulmul . z11 = z2*z9 scalarmult Mostly invisible on Cortex-A8, z22 = z11^2^1 rallel point adds. but bigger issue on Cortex-A7. z_5_0 = z9*z22 z_10_5 = z_5_0^2^5
22 23 Example: A busy server Some field ops are hard to pair with a backlog of scalarmults inside a single scalarmult. rmulas can vectorize across them. Example: At end of ECDH, Beware a disadvantage of convert fraction ( X : Z ) into Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . vectorizing across two mults: 256-bit f ; f ′ ; g; g ′ ; h; h ′ Easy, constant time: Z − 1 = occupy at least 1536 bits, 11 M + 254 S for p = 2 255 − P + 2 P ; leaving very little room z2 = z1^2^1 P + 3 P for temporary registers. z8 = z2^2^2 We use some loads and stores z9 = z1*z8 rithms inside vectorized mulmul . z11 = z2*z9 Mostly invisible on Cortex-A8, z22 = z11^2^1 adds. but bigger issue on Cortex-A7. z_5_0 = z9*z22 z_10_5 = z_5_0^2^5
23 24 Example: A busy server Some field ops are hard to pair with a backlog of scalarmults inside a single scalarmult. can vectorize across them. Example: At end of ECDH, Beware a disadvantage of convert fraction ( X : Z ) into Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . vectorizing across two mults: 256-bit f ; f ′ ; g; g ′ ; h; h ′ Easy, constant time: Z − 1 = Z p − 2 . occupy at least 1536 bits, 11 M + 254 S for p = 2 255 − 19: leaving very little room z2 = z1^2^1 for temporary registers. z8 = z2^2^2 We use some loads and stores z9 = z1*z8 inside vectorized mulmul . z11 = z2*z9 Mostly invisible on Cortex-A8, z22 = z11^2^1 but bigger issue on Cortex-A7. z_5_0 = z9*z22 z_10_5 = z_5_0^2^5
23 24 Example: A busy server Some field ops are hard to pair z_10_0 = backlog of scalarmults inside a single scalarmult. z_20_10 vectorize across them. z_20_0 = Example: At end of ECDH, z_40_20 re a disadvantage of convert fraction ( X : Z ) into z_40_0 = Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . rizing across two mults: z_50_10 256-bit f ; f ′ ; g; g ′ ; h; h ′ Easy, constant time: Z − 1 = Z p − 2 . z_50_0 = at least 1536 bits, 11 M + 254 S for p = 2 255 − 19: z_100_50 very little room z2 = z1^2^1 z_100_0 emporary registers. z8 = z2^2^2 z_200_100 use some loads and stores z9 = z1*z8 z_200_0 vectorized mulmul . z11 = z2*z9 z_250_50 invisible on Cortex-A8, z22 = z11^2^1 z_250_0 bigger issue on Cortex-A7. z_5_0 = z9*z22 z_255_5 z_10_5 = z_5_0^2^5 z_255_21
23 24 busy server Some field ops are hard to pair z_10_0 = z_10_5*z_5_0 of scalarmults inside a single scalarmult. z_20_10 = z_10_0^2^10 across them. z_20_0 = z_20_10*z_10_0 Example: At end of ECDH, z_40_20 = z_20_0^2^20 disadvantage of convert fraction ( X : Z ) into z_40_0 = z_40_20*z_20_0 Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . across two mults: z_50_10 = z_40_0^2^10 ′ ; h; h ′ Easy, constant time: Z − 1 = Z p − 2 . z_50_0 = z_50_10*z_10_0 1536 bits, 11 M + 254 S for p = 2 255 − 19: z_100_50 = z_50_0^2^50 little room z2 = z1^2^1 z_100_0 = z_100_50*z_50_0 registers. z8 = z2^2^2 z_200_100 = z_100_0^2^100 loads and stores z9 = z1*z8 z_200_0 = z_200_100*z_100_0 mulmul . z11 = z2*z9 z_250_50 = z_200_0^2^50 on Cortex-A8, z22 = z11^2^1 z_250_0 = z_250_50*z_50_0 on Cortex-A7. z_5_0 = z9*z22 z_255_5 = z_250_0^2^5 z_10_5 = z_5_0^2^5 z_255_21 = z_255_5*z11
23 24 Some field ops are hard to pair z_10_0 = z_10_5*z_5_0 rmults inside a single scalarmult. z_20_10 = z_10_0^2^10 them. z_20_0 = z_20_10*z_10_0 Example: At end of ECDH, z_40_20 = z_20_0^2^20 convert fraction ( X : Z ) into z_40_0 = z_40_20*z_20_0 Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . mults: z_50_10 = z_40_0^2^10 Easy, constant time: Z − 1 = Z p − 2 . z_50_0 = z_50_10*z_10_0 11 M + 254 S for p = 2 255 − 19: z_100_50 = z_50_0^2^50 z2 = z1^2^1 z_100_0 = z_100_50*z_50_0 z8 = z2^2^2 z_200_100 = z_100_0^2^100 stores z9 = z1*z8 z_200_0 = z_200_100*z_100_0 z11 = z2*z9 z_250_50 = z_200_0^2^50 rtex-A8, z22 = z11^2^1 z_250_0 = z_250_50*z_50_0 rtex-A7. z_5_0 = z9*z22 z_255_5 = z_250_0^2^5 z_10_5 = z_5_0^2^5 z_255_21 = z_255_5*z11
24 25 Some field ops are hard to pair z_10_0 = z_10_5*z_5_0 inside a single scalarmult. z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 Example: At end of ECDH, z_40_20 = z_20_0^2^20 convert fraction ( X : Z ) into z_40_0 = z_40_20*z_20_0 Z − 1 X ∈ { 0 ; 1 ; : : : ; p − 1 } . z_50_10 = z_40_0^2^10 Easy, constant time: Z − 1 = Z p − 2 . z_50_0 = z_50_10*z_10_0 11 M + 254 S for p = 2 255 − 19: z_100_50 = z_50_0^2^50 z2 = z1^2^1 z_100_0 = z_100_50*z_50_0 z8 = z2^2^2 z_200_100 = z_100_0^2^100 z9 = z1*z8 z_200_0 = z_200_100*z_100_0 z11 = z2*z9 z_250_50 = z_200_0^2^50 z22 = z11^2^1 z_250_0 = z_250_50*z_50_0 z_5_0 = z9*z22 z_255_5 = z_250_0^2^5 z_10_5 = z_5_0^2^5 z_255_21 = z_255_5*z11
Recommend
More recommend