Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim ⋆ and Kristin Lauter † ⋆ Seoul National University † Microsoft Research iDASH Privacy&Security Workshop, March 16, 2015 1 / 16
Secure Outsourcing GWAS 2 / 16
Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16
Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16
Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16
Minor Allele Frequency Encryption & Evaluation G 1 G 2 G 311 Enc P 1 : C 1 1 2 · · · 1 0 − → . . . P 200 : · · · 2 0 1 0 + − → C 200 � 200 · · · ← − i =1 C i 0 Dec #(T) (We can perform the aggregate operations simultaneously for all the genotypes.) Decryption ◮ Decrypt the ciphertext “ � 200 i =1 C i ” with the secret key. ◮ Let ℓ i be the value in the i ’th slot. Decoding ◮ For 1 ≤ i ≤ 311, if ℓ i > 200, then ℓ i ← (400 − ℓ i ). � ℓ i ◮ The minor allele frequency of the genotype G i is � . 400 4 / 16
Minor Allele Frequency Encryption & Evaluation G 1 G 2 G 311 Enc P 1 : C 1 1 2 · · · 1 0 − → . . . P 200 : · · · 2 0 1 0 + − → C 200 � 200 · · · ← − i =1 C i 0 Dec #(T) (We can perform the aggregate operations simultaneously for all the genotypes.) Decryption ◮ Decrypt the ciphertext “ � 200 i =1 C i ” with the secret key. ◮ Let ℓ i be the value in the i ’th slot. Decoding ◮ For 1 ≤ i ≤ 311, if ℓ i > 200, then ℓ i ← (400 − ℓ i ). � ℓ i ◮ The minor allele frequency of the genotype G i is � . 400 4 / 16
Chi-squared Test Data Encoding ◮ For each genotype, encode the given SNPs of case group and control group. * Note that the result of chi-squared test is 800 ( a (400 − c ) − c (400 − a )) 2 n ( ad − bc ) 2 = r · s · g · k 400 · 400 · g · k 800 ( a − c ) 2 = ( a + c )(800 − ( a + c )) where ‘ a ’ and ‘ c ’ are the allele counts of some SNP in case and control group. 5 / 16
Chi-squared Test Data Encoding ◮ For each genotype, encode the given SNPs of case group and control group. * Note that the result of chi-squared test is 800 ( a (400 − c ) − c (400 − a )) 2 n ( ad − bc ) 2 = r · s · g · k 400 · 400 · g · k 800 ( a − c ) 2 = ( a + c )(800 − ( a + c )) where ‘ a ’ and ‘ c ’ are the allele counts of some SNP in case and control group. 5 / 16
Chi-squared Test Evaluation Let us denote C i and C ′ i the ciphertexts for the case&control groups. ◮ Evaluate � 200 let = C case ) and � 200 let i =1 C i ( i =1 C ′ i ( = C cont ). ◮ Compute “ C case − C cont ” and “ C case + C cont ” Decryption For the message space Z t = [0 , t ), let ◮ den = Dec( C case + C cont ) = a + c ( < t ) � if a > c , a − c let ◮ num = Dec( C case − C cont ) = ( a − c ) + t otherwise . Decoding ◮ If num > t 2 , then num ← (num − t ). 800(num) 2 ◮ The result of chi-squared test is (den)(800 − den) 6 / 16
Chi-squared Test Evaluation Let us denote C i and C ′ i the ciphertexts for the case&control groups. ◮ Evaluate � 200 let = C case ) and � 200 let i =1 C i ( i =1 C ′ i ( = C cont ). ◮ Compute “ C case − C cont ” and “ C case + C cont ” Decryption For the message space Z t = [0 , t ), let ◮ den = Dec( C case + C cont ) = a + c ( < t ) � if a > c , a − c let ◮ num = Dec( C case − C cont ) = ( a − c ) + t otherwise . Decoding ◮ If num > t 2 , then num ← (num − t ). 800(num) 2 ◮ The result of chi-squared test is (den)(800 − den) 6 / 16
Secure Comparison between Genomic Data 7 / 16
Hamming Distance Two individuals have genotypes over many SNPs. For a fixed genotype, � 1 if ( S 1 = null) || ( S 2 = null) || ( S 1 . alt � = S 2 . alt) d = 0 otherwise x [ j ] let = j -th bit of x , starting with the least significant bit of x . ⊕ : XOR gate (= Add over Z 2 ), ∧ : AND gate (= Mult over Z 2 ) . SVTYPE d SV 1 or SV 2 = INS/DEL 0 SV 1 or SV 2 = null 1 SV 1 and SV 2 = SNP/SUB EQU ( S 1 , S 2 ) ⊕ 1 � 1 if S 1 = S 2 = ∧ µ where EQU ( S 1 , S 2 ) = j =1 ( S 1 [ j ] ⊕ S 2 [ j ] ⊕ 1) 0 o.w, We need the encodings to determine ‘null’ and ‘INS/DEL’. 8 / 16
Hamming Distance Two individuals have genotypes over many SNPs. For a fixed genotype, � 1 if ( S 1 = null) || ( S 2 = null) || ( S 1 . alt � = S 2 . alt) d = 0 otherwise x [ j ] let = j -th bit of x , starting with the least significant bit of x . ⊕ : XOR gate (= Add over Z 2 ), ∧ : AND gate (= Mult over Z 2 ) . SVTYPE d SV 1 or SV 2 = INS/DEL 0 SV 1 or SV 2 = null 1 SV 1 and SV 2 = SNP/SUB EQU ( S 1 , S 2 ) ⊕ 1 � 1 if S 1 = S 2 = ∧ µ where EQU ( S 1 , S 2 ) = j =1 ( S 1 [ j ] ⊕ S 2 [ j ] ⊕ 1) 0 o.w, We need the encodings to determine ‘null’ and ‘INS/DEL’. 8 / 16
Hamming Distance Data Encoding ◮ Clean two datasets using POS, then make the merged list L . ◮ For i ∈ [1 , #( L )], � � 1 if POS i ∈ L 0 if SV i = INS/DEL define m i = and h i = 0 otherwise 1 otherwise ⇒ ( m 1 ⊕ m 2 ) = 1 iff (SV 1 , i = null) or (SV 2 , i = null) ( h 1 ∧ h 2 ) = 0 iff (SV 1 , i = INS/DEL) or (SV 2 , i = INS/DEL) ◮ Encode the SNP string as follows: A → 00 , G → 01 , C → 10 , T → 11 , ⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say S i . ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘ GTA ’ is encoded as ‘01 || 11 || 00 || 1 0 . . . 00 ’. � �� � 14 9 / 16
Hamming Distance Data Encoding ◮ Clean two datasets using POS, then make the merged list L . ◮ For i ∈ [1 , #( L )], � � 1 if POS i ∈ L 0 if SV i = INS/DEL define m i = and h i = 0 otherwise 1 otherwise ⇒ ( m 1 ⊕ m 2 ) = 1 iff (SV 1 , i = null) or (SV 2 , i = null) ( h 1 ∧ h 2 ) = 0 iff (SV 1 , i = INS/DEL) or (SV 2 , i = INS/DEL) ◮ Encode the SNP string as follows: A → 00 , G → 01 , C → 10 , T → 11 , ⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say S i . ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘ GTA ’ is encoded as ‘01 || 11 || 00 || 1 0 . . . 00 ’. � �� � 14 9 / 16
Hamming Distance Encryption ◮ Embed the data of P 1 (= m 1 , i , h 1 , i , S 1 , i ) and P 2 (= m 2 , i , h 2 , i , S 2 , i ) into the plaintext slots in a bit-by-bit manner. ◮ Encrypt the slots with the public key. m 1 · · · h 1 · · · S 1 [1] · · · . . . . . . . . . S 1 [21] · · · #( L ) 10 / 16
Hamming Distance Evaluation ◮ Evaluate the following binary circuit over encrypted data: � �� � � � ( h 1 , i ∧ h 2 , i ) ∧ ( m 1 , i ⊕ m 2 , i ) ⊕ m 1 , i ⊕ m 2 , i ⊕ 1 ∧ EQU ( S 1 , i , S 2 , i ) ⊕ 1 ◮ Take m = 8191 so that we can embed 630 messages into one ciphertext and perform the operations simultaneously for all the messages. Decryption ◮ Decrypt the evaluated value and let ℓ i the value in the i ’th slot. Decoding ◮ Note that ℓ i is the Hamming distance result of i ’th genotype. ◮ Compute � #( L ) i =1 ℓ i . 11 / 16
Recommend
More recommend