safety first
play

Safety First New Features in Linux File & Storage Systems Ric - PowerPoint PPT Presentation

Safety First New Features in Linux File & Storage Systems Ric Wheeler rwheeler@redhat.com September 27, 2010 Overview Why Care about Data Integrity Common Causes of Data Loss Data Loss Exposure Timeline Reliability


  1. Safety First New Features in Linux File & Storage Systems Ric Wheeler rwheeler@redhat.com September 27, 2010

  2. Overview Why Care about Data Integrity  Common Causes of Data Loss  Data Loss Exposure Timeline  Reliability Building Blocks  Examples  Grading our Progress  Questions?  2

  3. Why Care about Data Integrity Linux is used to store critical personal data  ● On Linux based desktops and laptops ● Linux based devices like Tivo, cell phones or a home NAS storage device ● Remote backup providers use Linux systems to backup your personal data Linux servers are very common in corporate data centers  ● Financial services ● Medical images ● Banks Used as the internal OS for appliances  3

  4. Linux has one IO & FS Stack One IO stack has to run on critical and non-critical devices  ● Data integrity is the only “safe” default configuration ● Sophisticated users can choose non-default options that might disable data reliability features Trade offs need to be done with care  ● Some configurations can hurt data integrity (like ext3 writeback mode) ● Some expose data integrity issues (non-default support for write barriers) Where possible, the system should auto-configure these options  ● Detection of battery or UPS can allow for mounting without write barriers ● Special tunings for SSD's and high end arrays to avoid barrier operations ● Similar to the way we auto-configure features for other types of hardware 4

  5. Common Causes of Data Loss Uncountable ways to lose data  ● Honesty requires vendors that sell you a storage device should provide some disclaimer about when – not if - you will lose data Storage device manufacturers work hard to  ● Track the causes of data loss or system unavailability of real, deployed systems ● Classify those instances into discrete buckets ● Carefully select the key issues that are most critical (or common) to fix first Tracking real world issues and focusing on fixing high priority issues requires  really good logging and reporting from deployed systems ● Voluntary methods like the kernel-oops project ● Critical insight comes from being able to gather good data, do real analysis and then monitor how your fixes impact the deployed base 5

  6. User Errors Accidental destruction or deletion of data  ● Overwriting good data with bad ● Restoring old backups to good data Service errors  ● Replacing the wrong drive in a RAID array ● Not plugging in the UPS Losing track of where you put it  ● Too many places to store a photo, MP3, etc Misunderstanding or lack of knowledge about how to configure the system  properly Developers need to be responsible for providing easy to use system in order to  minimize this kind of error! 6

  7. Application Developer Errors Some application coders do not worry about data integrity to begin with  ● Occasionally on purpose – speed is critical, data is not (can be rerun) ● Occasionally by ignorance ● Occasionally by mistake For application authors that do care, we make data integrity difficult by giving  them poor documentation: ● rename: How many fsync() calls should we use when renaming a file? ● fsync: just the file? File and parent directory? ● Best practices that change with type of storage, file system type and file system journal mode Primitives need to be clearly understood & well documented  ● Too many choices make it next to impossible for application developers to deliver reliable software “Mostly works” is not good enough for data integrity  7

  8. OS & Configuration Errors Configuration Errors  ● Does the system have the write barrier enabled if needed? Bugs in the IO Stack  ● Do we use write barriers correctly to flush volatile write caches? ● Do we properly return error codes so the application can handle failures? ● Do we log the correct failure in /var/log/messages in a way that can point the user to a precise component? 8

  9. Hardware Failures Hard disk failures  Power supply failures  DRAM failures  Cable failures  9

  10. Disasters Name your favorite disaster  ● Fire, Flood, Blizzards.... ● Power outage ● Terrorism No single point of failure requires that any site has a method to have a copy at  some secondary location ● Remote data mirroring can be done at the file level or block level ● Backup and storage of backup images off site ● Buy expensive hardware support for remote replication, etc 10

  11. Data Loss ExposureTimeline Rough Timeline:  ● State 0: Data creation ● State 1: Stored to persistent storage ● State 2: Component failure in your system ● State 3: Detection of the failed component ● State 4: Data repair started ● State 5: Data repair completed Minimizing the time spent out of State 1 is what storage designers lose sleep  over! 11

  12. What is the expected frequency of disk failure? Hard failures  ● Total disk failure ● Read or write failure ● Usually instantaneously detected Soft failures  ● Can happen at any time ● Usually detection requires scrubbing or scanning the storage ● Unfortunately, can be discovered during RAID rebuild Note that this is not just a rotating disk issue  ● SSDs wear out, paper and ink fade, CDs deteriorate, etc 12

  13. How long does it take you to detect a failure? Does your storage system detect latent errors in  ● Hours? Days? Weeks? ● Only when you try to read the data back? Most storage systems do several levels of data scanning and scrubbing  ● Periodic reads (proper read or read_verify commands) to insure that any latent errors are detected ● File servers and object based storage systems can do whole file reads and compare data to a digital hash for example ● Balance is needed between frequent scanning/scrubbing and system performance for its normal workload 13

  14. How long does it take you to repair a failure? Repair the broken storage physically  ● Rewrite a few bad sectors (multiple seconds)? ● Replace a broken drive and rebuild the RAID group (multiple hours)? Can we repair the damage done to the file system?  ● Are any files present but damaged? ● Do I need to run fsck? ● Very useful to be able to map an IO error back to a user file, metadata or unallocated space Repair the logical structure of the file system metadata  ● Fsck time can take hours or days ● Restore any data lost from backup Users like to be able to verify file system integrity after a repair  ● Are all of my files still on the system? ● Is the data in those files intact and unchanged? ● Can you tell me precisely what I lost? 14

  15. Exposure to Permanent Data Loss Combination of the factors described:  ● Robustness of storage system ● Rate of failure of components ● Time to detect the failure ● Time to repair the physical media ● Time to repair the file system metadata (fsck) ● Time to summarize for the user any permanent loss If the time required to detect and repair is larger than your failure rate, you will  lose data! 15

  16. Storage Downtime without Data Loss Counts Unavailable data loses money  ● Banking transactions ● Online transactions Data unavailability can be really mission critical  ● X-rays are digital and used in operations ● Digital maps used in search Horrible performance during repair is downtime in many cases  ● Minimizing repair time minimizing this loss as well 16

  17. How Many Concurrent Failures Can Your Data Survive? Protection against failure is expensive  Storage systems performance  Utilized capacity  Extra costs for hardware, power and cooling for less efficient storage systems  Single drive can survive soft failures  A single disk is 100% efficient  RAID5 can survive 1 hard failure & soft failures  RAID5 with 5 data disks and 1 parity disk is 83% efficient  RAID6 can survive 2 hard failures & soft failures  RAID6 with 4 data disks and 2 parity disks is only 66% efficient!  Fancy schemes (erasure encoding schemes) can survive many failures  Any “k” drives out of “n” are sufficient to recovery data  Popular in cloud and object storage systems  17

  18. Example: MD RAID5 & EXT3 RAID5 gives us the ability to survive 1 hard failure  ● Any second soft failure during RAID rebuild can cause data loss since we need to read each sector of all other disks during rebuild ● Rebuild can begin only when we have a new or spare drive to use for rebuild Concurrent hard drive failures in a RAID group are rare  ● ... but detecting latent (soft) errors during rebuild are increasingly common! MD has the ability to “check” RAID members on demand  ● Useful to be able to de-prioritize this background scan ● Should run once every 2 to 4 weeks RAID rebuild times are linear with drive size  ● Can run up to 1 day for a healthy set of disk drives EXT3 fsck times can run a long time  ● 1TB FS fsck with 45 million files ran 1 hour (reports in the field of run time up to 1 week!) ● Hard (not impossible) to map bad sectors back to user files using ncheck/icheck 18

Recommend


More recommend