Safety First New Features in Linux File & Storage Systems Ric Wheeler rwheeler@redhat.com September 27, 2010
Overview Why Care about Data Integrity Common Causes of Data Loss Data Loss Exposure Timeline Reliability Building Blocks Examples Grading our Progress Questions? 2
Why Care about Data Integrity Linux is used to store critical personal data ● On Linux based desktops and laptops ● Linux based devices like Tivo, cell phones or a home NAS storage device ● Remote backup providers use Linux systems to backup your personal data Linux servers are very common in corporate data centers ● Financial services ● Medical images ● Banks Used as the internal OS for appliances 3
Linux has one IO & FS Stack One IO stack has to run on critical and non-critical devices ● Data integrity is the only “safe” default configuration ● Sophisticated users can choose non-default options that might disable data reliability features Trade offs need to be done with care ● Some configurations can hurt data integrity (like ext3 writeback mode) ● Some expose data integrity issues (non-default support for write barriers) Where possible, the system should auto-configure these options ● Detection of battery or UPS can allow for mounting without write barriers ● Special tunings for SSD's and high end arrays to avoid barrier operations ● Similar to the way we auto-configure features for other types of hardware 4
Common Causes of Data Loss Uncountable ways to lose data ● Honesty requires vendors that sell you a storage device should provide some disclaimer about when – not if - you will lose data Storage device manufacturers work hard to ● Track the causes of data loss or system unavailability of real, deployed systems ● Classify those instances into discrete buckets ● Carefully select the key issues that are most critical (or common) to fix first Tracking real world issues and focusing on fixing high priority issues requires really good logging and reporting from deployed systems ● Voluntary methods like the kernel-oops project ● Critical insight comes from being able to gather good data, do real analysis and then monitor how your fixes impact the deployed base 5
User Errors Accidental destruction or deletion of data ● Overwriting good data with bad ● Restoring old backups to good data Service errors ● Replacing the wrong drive in a RAID array ● Not plugging in the UPS Losing track of where you put it ● Too many places to store a photo, MP3, etc Misunderstanding or lack of knowledge about how to configure the system properly Developers need to be responsible for providing easy to use system in order to minimize this kind of error! 6
Application Developer Errors Some application coders do not worry about data integrity to begin with ● Occasionally on purpose – speed is critical, data is not (can be rerun) ● Occasionally by ignorance ● Occasionally by mistake For application authors that do care, we make data integrity difficult by giving them poor documentation: ● rename: How many fsync() calls should we use when renaming a file? ● fsync: just the file? File and parent directory? ● Best practices that change with type of storage, file system type and file system journal mode Primitives need to be clearly understood & well documented ● Too many choices make it next to impossible for application developers to deliver reliable software “Mostly works” is not good enough for data integrity 7
OS & Configuration Errors Configuration Errors ● Does the system have the write barrier enabled if needed? Bugs in the IO Stack ● Do we use write barriers correctly to flush volatile write caches? ● Do we properly return error codes so the application can handle failures? ● Do we log the correct failure in /var/log/messages in a way that can point the user to a precise component? 8
Hardware Failures Hard disk failures Power supply failures DRAM failures Cable failures 9
Disasters Name your favorite disaster ● Fire, Flood, Blizzards.... ● Power outage ● Terrorism No single point of failure requires that any site has a method to have a copy at some secondary location ● Remote data mirroring can be done at the file level or block level ● Backup and storage of backup images off site ● Buy expensive hardware support for remote replication, etc 10
Data Loss ExposureTimeline Rough Timeline: ● State 0: Data creation ● State 1: Stored to persistent storage ● State 2: Component failure in your system ● State 3: Detection of the failed component ● State 4: Data repair started ● State 5: Data repair completed Minimizing the time spent out of State 1 is what storage designers lose sleep over! 11
What is the expected frequency of disk failure? Hard failures ● Total disk failure ● Read or write failure ● Usually instantaneously detected Soft failures ● Can happen at any time ● Usually detection requires scrubbing or scanning the storage ● Unfortunately, can be discovered during RAID rebuild Note that this is not just a rotating disk issue ● SSDs wear out, paper and ink fade, CDs deteriorate, etc 12
How long does it take you to detect a failure? Does your storage system detect latent errors in ● Hours? Days? Weeks? ● Only when you try to read the data back? Most storage systems do several levels of data scanning and scrubbing ● Periodic reads (proper read or read_verify commands) to insure that any latent errors are detected ● File servers and object based storage systems can do whole file reads and compare data to a digital hash for example ● Balance is needed between frequent scanning/scrubbing and system performance for its normal workload 13
How long does it take you to repair a failure? Repair the broken storage physically ● Rewrite a few bad sectors (multiple seconds)? ● Replace a broken drive and rebuild the RAID group (multiple hours)? Can we repair the damage done to the file system? ● Are any files present but damaged? ● Do I need to run fsck? ● Very useful to be able to map an IO error back to a user file, metadata or unallocated space Repair the logical structure of the file system metadata ● Fsck time can take hours or days ● Restore any data lost from backup Users like to be able to verify file system integrity after a repair ● Are all of my files still on the system? ● Is the data in those files intact and unchanged? ● Can you tell me precisely what I lost? 14
Exposure to Permanent Data Loss Combination of the factors described: ● Robustness of storage system ● Rate of failure of components ● Time to detect the failure ● Time to repair the physical media ● Time to repair the file system metadata (fsck) ● Time to summarize for the user any permanent loss If the time required to detect and repair is larger than your failure rate, you will lose data! 15
Storage Downtime without Data Loss Counts Unavailable data loses money ● Banking transactions ● Online transactions Data unavailability can be really mission critical ● X-rays are digital and used in operations ● Digital maps used in search Horrible performance during repair is downtime in many cases ● Minimizing repair time minimizing this loss as well 16
How Many Concurrent Failures Can Your Data Survive? Protection against failure is expensive Storage systems performance Utilized capacity Extra costs for hardware, power and cooling for less efficient storage systems Single drive can survive soft failures A single disk is 100% efficient RAID5 can survive 1 hard failure & soft failures RAID5 with 5 data disks and 1 parity disk is 83% efficient RAID6 can survive 2 hard failures & soft failures RAID6 with 4 data disks and 2 parity disks is only 66% efficient! Fancy schemes (erasure encoding schemes) can survive many failures Any “k” drives out of “n” are sufficient to recovery data Popular in cloud and object storage systems 17
Example: MD RAID5 & EXT3 RAID5 gives us the ability to survive 1 hard failure ● Any second soft failure during RAID rebuild can cause data loss since we need to read each sector of all other disks during rebuild ● Rebuild can begin only when we have a new or spare drive to use for rebuild Concurrent hard drive failures in a RAID group are rare ● ... but detecting latent (soft) errors during rebuild are increasingly common! MD has the ability to “check” RAID members on demand ● Useful to be able to de-prioritize this background scan ● Should run once every 2 to 4 weeks RAID rebuild times are linear with drive size ● Can run up to 1 day for a healthy set of disk drives EXT3 fsck times can run a long time ● 1TB FS fsck with 45 million files ran 1 hour (reports in the field of run time up to 1 week!) ● Hard (not impossible) to map bad sectors back to user files using ncheck/icheck 18
Recommend
More recommend