Encryption and Anonymization in Hadoop Current and Future needs Sept-28-2015 ApacheCon, Budapest Page 1
Agenda • Need for data protection – Encryption and Anonymization • Current State of Encryption in Hadoop • Demo • Future focus areas for the community Page 2
Speakers bganesan@apache.org bosco@apache.org Chief Security Architect, Sr Director, Enterprise Security Hortonworks Hortonworks Committer - Apache Ranger Committer - Apache Ranger and Apache Hawq Page 3
Security today in Hadoop Centralized Security Administration w/ Ranger Hadoop Ecosystem Authentication Authorization Audit Data Protection Who am I/prove it? What can I do? What did I do? Can data be encrypted at rest and over the wire? • Kerberos • Fine grain access • Centralized • Wire encryption audit reporting • API security with control with in Hadoop Apache Knox Apache Ranger w/ Apache • HDFS, Hbase Ranger encryption Page 4
Data Protection Encryption and Anonymization Page 5 Page 5
Why is Encryption at Rest required? • Sensitive data could be stored in Hadoop • Compliance or external regulation may mandate encryption, example PCI (Retail, Consumer) or HIPAA ( Healthcare) • Cost of not encrypting is increasing • Enhanced Security • Added layer on top of authentication (passwords) and authorization (ACLs) • Protect certain rogue administrators from accessing sensitive data Page 6
Available Hadoop Encryption Options Hbase Custom implementation Granualrity Ease of HDFS OS Page 7
OS Level Encryption – LUKS/DM-CRYPT Hadoop Why it helps? • Encrypts entire disk volume / / / /root grid0 grid2 gridn – all data is encrypted • Simpler setup, native OS and Vendor solutions available Cons DM - CRYPT • Performance challenges • Admin can still see raw Partition 1 Partition 2..n data Page 8
HDFS Transparent Encryption Solution HDFS Client Ranger KMS NN Why it helps? • Encrypt only specific data • Different access control A B A B A B levels • Transparent to end C C D D C D application, little changes needed DN DN DN • Auditing of Key Access Page 9
HDFS Encryption – Protect Application Data Spark HBase Hive Oozie Sqoop Guidelines NN • Encrypt Hive, Hbase data stored in HDFS • Specific changes in Hive to ensure scratch dir is encrypted A B A B A B • Separate admins in HDFS, Yarn, Oozie C C C D D D • Spark application logs should be in EZ DN DN DN Page 10
Ranger KMS – Centralized Key Management Page 11
HDFS TDE Workflow NN, DN Ranger KMS Client Create EZ Keys Create Provide EZ NN marks Encryption Keys folder as EZ Zone Page 12
HDFS TDE Workflow – Write a File NN, DN Ranger KMS Client NN does Create DEK Client request access and encrypt to write to EZ check. with EZ Key Receive Decrypt EDEK. EDEK, Request DEK provide DEK Send block Encrypt data information to and write to client. EDEK DN. stored with file Page 13
HDFS TDE Workflow – Read a File NN, DN Ranger KMS Client NN does Client request access check. to read from Provide data, EZ EDEK Receive Decrypt EDEK. EDEK, Request DEK provide DEK Use DEK to read file data Page 14
Hbase Encryption in 0.98 Why it helps? • Hfile encrypted and stored in disk • Per CF configuration • Keys stored in Java keystore Page 15
Demo Don Bosco Durai Page 16 Page 16
Future Work Focus areas for the community Page 17
Encryption and Anonymization - Future Focus Areas ² Hive Column Encryption ² Solidifying Hbase Encryption ² Kafka and Solr Encryption ² Need for Tokenization/Masking Page 18
Hive Column Encryption • Being discussed in the community. Apache JIRA # ORC-14 How it will help? • Handled at the ORC layer • Encrypt • Elegant solution. Encryption done after ORC compression. fields • Each columns are different files and they can be instead of file encrypted with different key • Data • Leverage keyprovider API. Potentially can use Hadoop/ protected in HDFS as Ranger KMS well as OS layer Page 19
Kafka Encryption • Discussion going on in Kafka community • Two possible approaches How it will • Broker encrypts and stores the data help? • Encrypt any Client(s) encrypt/decrypt the data • local data • Pros with client side encrypt/decrypt stored in disks • No encryption/decryption overhead on Broker side • Data • Keys not available on Broker, so data safe from everyone encrypted • No need for wire encryption on wire • Cons with client side encrypt/decrypt • Compaction/compression not effective with encrypted data. • Needs protocol change and update client libraries. Page 20
Solr Encryption • No active discussion currently How it will help? • Will be good to have native support • Sensitive data • Index files could be encrypted/decrypted just like could be stored in ORC indexes, may • Could be integrated with external KMS (Hadoop/ need to be encrypted Ranger) • Higher granularity than OS or HDFS encryption Page 21
Beyond Encryption... Anonymization 1 ? - Tokenization – Replace a sensitive field (eg: card number) with some other value. Could be format How it helps? preserving or random unique value. • Protect sensitive data beyond - Redaction - Mask sensitive data (eg: card numbers access control can be changed to xxxx xxxx xxxx 1234) • Field level control • Enable compliance to privacy laws 1. http://blogs.gartner.com/merv-adrian/2014/01/13/aaa-is-not-enough-security-in-the-big-data-era/ Page 22
Where is it applicable? • Sensitive data in HDFS file • Column values in Hive or Hbase • Field values in Solr • Messages in Kafka or NiFi Page 23
How? • Tokenize on source • Tokenize while ingesting data (Flume, NiFi, Sqoop, etc.) • Data stored tokenized, so safe to give access to others. • Selective users can de-tokenize if needed • Tokenize/Mask on read • E.g. select name, mobile_number from customer; Based on policy, if user is Data Scientist, then tokenize/mask data before returning Name Returned (Format Actual Preserved) John Doe 415 -123-4567 415 -682-5638 Jane Smith 408 -123-4567 408 -802-4027 Mary Pick 650 -123-4567 650 -865-6921 Page 24
Questions ?? Page 25
Recommend
More recommend