I V N E U R S E I Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted De-duplication Le Zhang <zhang.le@ed.ac.uk> Paul Anderson <dcspaul@ed.ac.uk> LISA 2010
Laptop Backup Options External Hard Drive No offsite storage ? What if I have a break-in? Or there is a fire? I need a very large capacity to handle archival storage as well ...
Laptop Backup Options DVDs are only small - I can only backups subsets of files ... Recordable CD/DVD I have to make multiple copies if I want offsite storage ...
Laptop Backup Options Broadband upload speeds are slow - 30 DAYS to upload 300Gb to cloud storage is typical ... Often, there is a transfer cost as well as a storage cost ... Cloud Storage
Laptop Backup Options External Hard Drive Recordable CD/DVD Cloud Storage
What do people do? Store no vital data Regular full backups Partial backups Keep copy on University machine Don’t do backups 11% 11% Don’t use laptop 5% When people bother keeping 25% 16% backups, they are mostly ad-hoc - and usually only involve hand- 33% selected subsets
What kind of data? User files Applications System files Perhaps a lot of the 29% system files and application files (at least) are common? 63% 8% From our sample of academic Mac laptop users
Shared Data SYS Storage Saving It seems like there is a good Actual Storage (TB) 0.5 Saved Storage (TB) deal of duplication among the SYS Storage (TB) 0.4 system and application files. 0.3 0.2 0.1 And this increases with the 0 0 5 10 15 20 25 number of machines Number of machines added APP Storage Saving But it is interesting that a 0.15 Actual Storage (TB) Saved Storage (TB) good many files are not APP Storage (TB) 0.1 common! So is it a good idea not to back up these 0.05 categories? 0 0 5 10 15 20 25 Number of machines added
Shared Data Obviously, there is less USR Storage Saving 1.4 Actual Storage (TB) sharing among the user data Saved Storage (TB) 1.2 USR Storage (TB) 1 - but the overall saving is still 0.8 significant 0.6 0.4 0.2 And we might expect a higher 0 0 5 10 15 20 25 Number of machines added degree of sharing among the user data for different Overall Storage Saving 2 communities - Actual Storage (TB) Saved Storage (TB) 1.5 for example, common music Storage (TB) files would make a big 1 difference ... 0.5 0 0 5 10 15 20 25 Number of machines added
Deduplication “Deduplication” is becoming very popular for saving space when storing multiple copies of the same file A “hash” (digital signature) is generated from the contents of the file Two files with the same content will have the same hash Two files with different contents have a very high chance of having different hashes Use the hash as the name of the stored file
Block sizes Deduplicating at the 6 x 10 3.5 block level is more 3 efficient than the file 2.5 level. Frequency 2 1.5 What is an appropriate 1 0.5 block size? 0 0 10Bytes 1K 100K 1MB 1GB 10GB File size distribution (in log10 domain) a. Data duplication rate vs block size b. Actual storage needed vs block size c. Number of backup objects vs block size 32.5 1.4 40 All Objs 32 Stored Objs Actual Storage (TB) Duplication Rate % 31.5 30 Million Objects 1.35 31 30.5 20 30 1.3 29.5 10 29 28.5 1.25 0 128K 256K 512K 1024K File 128K 256K 512K 1024K File 128K 256K 512K 1024K File
Deduplication problems? Most de-duplication systems work at the storage level This has two problems in our application .. If the data is encrypted “at source” (with different keys) then the deduplication is defeated (the cipher text will be different) The full data still has to be transmitted to the “server” - and this time is a more significant problem than the storage!
Convergent Encryption “Convergent Encryption” neatly solves the first problem ... Files are encrypted using the hash of the data as the key Files containing the same data will encrypt to the same cypher text and hence deduplication continues to work File owners will have the key (because they originally had the data) and will be able to decrypt the data - others won’t
Managing keys Each (unique) file now has a separate key which we need to manage Our solution creates a “data object” for each directory which contains the keys for the children, as well as their metadata The directory object is then encoded and stored in the same was as a normal file The user only has to record the key for the root object Entire duplicate subtrees can be detected
Avoiding Transmission To avoid transmitting data which already exists on the server, we need to do the deduplication on the source system Many services (eg. Amazon) don’t provide the necessary interfaces for the client to communicate directly There are several approaches to this, depending on specific application ... • A private server • A local “caching” server for a remote cloud service
A Protoype Changed les Local FS Events Disk Meta Update Local Meta DB Files A Mac OsX client List of les to backup Backup status update Backup Manager A local (departmental, home) Data Compression (Optional) server which performs hash Symmetric Encryption with key checking, authentication and generated from block content high-speed caching before Encrypted blocks forwarding unique blocks to the cloud Upload Queue Upload threads Backup Server
Where next? Performance depends heavily on the characteristics of the data itself, and the underlying network/storage (eg. latency) • We would like to study this more We would like to develop a production quality client, and investigate a possible service in a datacentre • we are looking for possible funding/partners
I V N E U R S E I Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted De-duplication Le Zhang <zhang.le@ed.ac.uk> Paul Anderson <dcspaul@ed.ac.uk> LISA 2010
Recommend
More recommend