Parquet Modular Encryption Gidon Gershinsky IBM Research – Haifa Lab
Speaker Senior Architect at IBM Research – Haifa Lab gidon@il.ibm.com Leading role in Apache Parquet work on definition of encryption format and its implementation • community work, folks from many companies are involved Number of projects on secure analytics on encrypted data • connected car and healthcare usecases • Apache Spark with Parquet encryption • Spark&AI Summit talk, 2018
Overview • Goals of this technology • Parquet encryption – Features • Sample usecases • How to use Parquet encryption API • Basic integration with Apache Spark • Performance implications • Roadmap
Apache Parquet Popular columnar storage format Encoding, compression Advanced data filtering • columnar projection : skip columns • predicate push down : skip files, or row groups, or data pages Performance benefits • less data to fetch from storage: I/O, latency • less data to process: CPU, latency How to protect sensitive Parquet data? • in any storage - keeping projection/predicates, supporting column access control, data tamper- proofing etc.
Parquet Encryption: Goals Protect sensitive data-at-rest (in storage) • data privacy/confidentiality: encryption - hiding sensitive information • data integrity: tamper-proofing sensitive information • in any storage - untrusted, cloud or private, file system, object store, archives Preserve performance of analytic engines • full Parquet capabilities (columnar projection, predicate pushdown, etc) with encrypted data Leverage encryption for fine-grained access control • per-column encryption keys • key-based access in any storage: private -> cloud -> archive
Parquet Encryption: Features Privacy: Hiding sensitive information • Full encryption: all data and metadata modules • min/max values, schema, encryption key ids, list of sensitive columns, etc • Separate keys for sensitive columns • column data and metadata • column access control • Separate key for file-wide metadata • Parquet file footer – encrypted with footer key • Storage server / admin never sees encryption keys or unencrypted data • “client - side” encryption
Parquet Encryption: Features Privacy: Hiding sensitive information (continued) • Multiple encryption algorithms • different security and performance trade-offs • currently two algorithms are defined and implemented • AES_GCM : encrypts and tamper-proofs everything (data and metadata) • AES_GCM_CTR : encrypts everything, tamper-proofs metadata only could be useful in platforms without AES hardware acceleration, like Java 8 • if you need a new one, talk to us • Optional plaintext footer mode for legacy readers • any (old) Parquet reader can access unencrypted columns • footer is unencrypted – but tamper-proofed • signed with footer key
Parquet Encryption: Features customers-sept-2019.part0.parquet customers-jan-2014.part0.parquet Data integrity verification • File data and metadata are not tampered with • modifying data page contents • replacing one data page with another • File not replaced with wrong file • unmodified - but e.g. outdated • sign file contents and file id • Example: altering customer / billing data • Example: altering healthcare data (!) - patient record or medical sensor readings • AES GCM: “authenticated encryption” • implemented in hardware
Current Status • Apache Parquet community work • Encryption specification approved in January 2019 • signed-off by PMC • Specification and Thrift format merged • in apache/parquet-format master • part of parquet-format-2.7.0 release pull request (merged too) • Implementation • C++ and Java code • pull requests being reviewed, some already merged • implementation and API that closely follows the encryption specification
Parquet Encryption Usecases Same as “Parquet Usecases ” – with sensitive column data • Data queries, analytic applications - in any industry • Spark/Hive/Presto with Parquet: horizontal platform, not a vertical solution • Protect data privacy / confidentiality • personal data privacy • sensitive business data • regulations • Protect data integrity • business processes • wrong billing due to tampering with e.g. customer data • personal health • wrong treatment due to tampering with patient records or sensor readings
Connected Car Usecase “ RestAssured ” – EU Horizon 2020 research project (N 731678) Project partners IBM, Adaptant, OCC, Thales, UDE, IT Innovation Project usecases • usage-based car insurance, social services • encryption: protect personal data • integrity: prevent billing tampering Spark&AI Summit EU 2018: demo shots with Spark/Parquet Encryption
Healthcare Usecase “ ProTego ” – EU Horizon 2020 research project (N 826284) Project partners St Raffaele hospital, Marina Salud hospital, IBM, GFI, ITI, UAH, IMEC, KUL, ICE Project usecases • Queries / analytics on sensitive healthcare data • HL7 FHIR standard: maps nicely to Parquet • encryption: protect personal data • integrity: prevent tampering with diagnosis and treatment
Encryption API • Parquet API - without encryption ParquetFileWriter fileWriter = new ParquetFileWriter(file_path , schema, …); • then write data ParquetFileReader fileReader = ParquetFileReader.open(file_path, options); • then read data • Parquet API - with encryption ParquetFileWriter fileWriter = new ParquetFileWriter(file_path , schema, …, fileEncryptionProperties ); • then write data (just like before) ParquetFileReader fileReader = ParquetFileReader.open(file_path, options, fileDecryptionProperties ); • then read data (just like before)
File Encryption Properties Trivial • encrypt all columns (and footer) with key0 • tamper-proof encrypted content • enable columnar projection, predicate pushdown, etc byte[] key0 = … // e.g. 128 bit key – 16 bytes FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0).build();
File Encryption Properties Basic • encrypt columnA with key1, columnB with key2 (and footer with key0) • differential column access control • assign key IDs (key metadata) for simplified key retrieval • tamper-proof encrypted content • enable columnar projection, predicate pushdown, etc
File Encryption Properties Basic • encrypt columnA with key1, column with key2 (and footer with key0) byte[] key1 = … // e.g. 128 bit key – 16 bytes ColumnEncryptionProperties encrColumnA = ColumnEncryptionProperties .builder(“ columnA") .withKey(key1) .withKeyID (”key1”) .build(); same for column B. Then file properties: FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID (“key0”) .withEncryptedColumns(encryptedColumns) // list (map) of column encryption properties .build();
File Encryption Properties Advanced • Protect against file replacement attacks • Replacement with untampered but e.g. outdated file (table partition) String fileID = “customers -sept- 2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID (“key0”) .withAADPrefix(aadPrefix) .withEncryptedColumns(encryptedColumns) .build();
File Encryption Properties Advanced • Allow legacy clients to read unencrypted columns in encrypted files • plaintext (unencrypted) footer mode • visible file metadata (schema, names of secret columns and of their keys, etc) • tamper-proof (sign) file metadata with footer key FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID (“key0”) .withPlaintextFooter() .withEncryptedColumns(encryptedColumns) .build();
File Encryption Properties Advanced • Use alternative encryption algorithm • better performance in old Java versions • tamper-proofing metadata only (not data) FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID (“key0”) .withAlgorithm(ParquetCipher.AES_GCM_CTR_V1) .withEncryptedColumns(encryptedColumns) .build();
File Decryption Properties Simpler than encryption properties • most of details are specified in file metadata StringKeyIdRetriever keyRetriever = new StringKeyIdRetriever(); keyRetriever.putKey (“key0”, key0); keyRetriever.putKey (“key1”, key1); keyRetriever.putKey (“key2”, key2); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .build();
File Decryption Properties Advanced • Protect against file replacement attacks String fileID = “customers -sept- 2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .withAADPrefix(aadPrefix) .build();
Beyond Low Level API Low level API – full power of Parquet encryption • directly implements the approved specification features • enables any key management scheme • work with KMS instead of explicit keys • you need to build one – choosing from many options for KMS, Auth, envelope encryption (data key wrapping) • if you know how – Parquet low level encryption API is all you need • no one-size-fits-all solution for KMS/Auth/Wrapping
Recommend
More recommend