I N F O R M A T I O N T E C H N O L O G Y S Y S T E M S A N D S E R V I C E S S T A N F O R D U N I V E R S I T Y Backing UP AFS Using TSM Xueshan Feng Stanford University, March 24 th , 2004 ABSTRACTION AFS is Stanford's enterprise file system. It stores 2.5 TB of data, and serves roughly 40,000 individual users, 3,400 classes, 1,600 departments and groups, and thousands of other applications and systems on campus. A well designed backup system should allow us to backup data and restore it should the original data be lost. This document presents the design and implementation of Stanford's AFS backup system, implemented using a vendor backup management product – TSM. The product is also used on a campus-wide data backup basis for administrative applications and desktops through a common infrastructure. BACKUP REQUIREMENTS Our requirements for backup are simple: We should be able to backup AFS data and restore the data back into AFS. We want to be able to preserve AFS access control lists and have the flexibility to restore entire volume as well as a single file. The backup and restore should be automated as much as possible. Manually handling tapes should not be needed for file restoration. IMPLEMENTATION In 1999 we started working on a backup project to replace the old AFS backup system built around Legato software. At that time our backups and restores were not reliable; the operation relied heavily on staff intervention; file restores could take days and many hours of staff time; and the system did not scale well as AFS usage increased over the years. We selected IBM’s ADSM product as our AFS backup solution, in line with the campus backup systems used for the large administrative applications. ADSM was later combined with Tivoli when they were bought by IBM; the product is now called “Tivoli Storage Manager” (TSM). Here is the Stanford AFS backup implementation using TSM. • Hardware The hardware for Stanford AFS backup system consists of: Two AIX RS6000 H50 servers, each with mirrored system disks and 3GB memory. - One IBM 3494 automated tape library, accessible from the network, with 60 TB of "in- - shelf" capacity. About 200 GB EMC CLARiion disks used as TSM database, event logs, and data staging - spool. The disks are part of campus SAN disk storage infrastructure and accessible from the backup servers via fiber channel card. Data is first backed up to the EMC disk backup spool, and then is moved to the secondary tape pool when it is convenient. The data usually remains for 24 hours in disk backup spool and provides faster restores. • Software OpenAFS Best Practices Workshop Stanford, California, March 24-26, 2004
We use Tivoli's TSM Unix backup-archive client to backup AFS volumes. The AFS component of the product allows file level (instead of volume level) backups and restores. AFS ACLs are also backed up and can be restored with the file. The file level backup treats AFS system just like a Unix system, and we can restore a single file, as well as whole volumes for quicker recovery. • AFS Backup Procedures To backup more than 55,000 volumes in AFS would take days without optimizing and parallelizing the processes. Our challenge was to perform the backups within a few off-peak processing window. Tivoli's AFS backup software processes one volume at a time. In order to perform the daily backup within a limited time frame, a few extra processes were put into place: 1. A nightly job generates an AFS volume inventory, which contains information such as volume name, volume id, last modification time, and last AFS backup volume creation time. This information is stored in an Oracle database. This AFS backup volume is an on-line read-only snapshot and is available for immediate restores, where the lost data is less than 24 hours old. Instead of read-write volumes, the read-only AFS backup volumes are used for tape backups so files in changing can also be backed up through their read-only copy. 2. A nightly script queries the AFS inventory database to generate a list of AFS volumes that are candidates for TSM backup. The lists contain volume name, modification time and backup time. The lists are stored in AFS and are accessible from the TSM backup servers. 3. If there are volumes we do not wish to back-up, we add a suffix of “.nb” to those volume names. This reduces unneeded backup processing. 4. A TSM job on backup serves reads the list of volumes and processes them for backup. This is a script written internally. It compares the volume modification time with the last TSM backup time for the volume: A volume will only be backed up if the modification time is newer than the TSM backup time-stamp. The script also schedules multiple concurrent TSM AFS backup processes to increase throughput. The last backup time and modification time are stored in a database shared among the multiple jobs. The last TSM backup database is updated when a backup job is done. Miscellaneous Issues • AFS Backup Policy Backup policy should factor in the cost of keeping multiple copies, the need for multiple versions, and the organization data retention policy. Current Stanford’s policy for AFS is to keep 365 versions of backup, and once a volume is deleted, it will be kept for another 365 days before being removed from the backup system entirely. • ACL issues The backup job usually runs TSM processes as “ root” user and does not have permission to read AFS files. To resolve this issue, we created a “system:backup” PTS group, which contains one user called "backup". We did one time "sweep" to populate “system:backup read” permission to all volumes in our cell. We also wrapped the AFS "fs" command so that people cannot remove system:backup unintentionally. The backup root job runs as the “backup” user principal, so it has read access to AFS volumes. OpenAFS Best Practices Workshop Stanford, California, March 24-26, 2004
• Exclude data you don't want to backup TSM has a configuration file that allows you to include or exclude specific files from the backup. You can use this file to exclude sensitive data, core-dump files, and AFS cache etc. • Mount points You do not want the backup to cross AFS mount point, which may create loops if an AFS volume has a mount point which points to itself. By default, Tivoli TSM does not cross AFS mount point. You can overwrite this by changing the TSM configuration file and if you do, when TSM comes cross an AFS mount point, it only copies files into that volume’s backup space. • Reporting TSM stores backup information in TSM DB2-like database. It doesn’t come with ready-made reporting at the script level. We wrote our own report script to get a summary of how many volumes are backed up each day and the size of the backups. A typical backup report looked like this: BACKUP REPORT ON rescue2-a Fri Mar 19 14:32:30 PST 2004, up 147 days. ------------------------------------------------------------------------- AFS VOLUME BACKUP REPORT ------------------------------------------------------------------------- Server name: rescue2 Total number of objects inspected: 3,612,499 Total number of objects backed up: 52,606 Total number of objects updated: 180,362 Total number of objects deleted: 0 Total number of objects failed: 950 Total number of bytes transferred: 21,318MB Total AFS volumes processed: 2178 Volumes backed up: users.[a-l], data.* CONCLUSIONS Because of the concurrent processes and the ability to select only those volumes required for back-up, currently only 10% of the 55,000 AFS volumes are processed by TSM each day. The daily backup data average is 50GB and can be finished within a few hours each day. Staff intervention is only needed for moving tapes to off-site storage, or populating new and reused tapes into the tape library. Restoring files can be done from the command line without having to physically locate the tapes – TSM keeps track of where the files are stored and the tape library will mount the tapes as needed for file restoration. We have been using TSM backup system for more than 5 years. During this time period, the AFS disk usage grew from less than 700 GB to 2.6 TB. The system was scaled very well. Although TSM might be expensive, it significantly simplified backup and restore process – one of the most important operations in any IT service organization. OpenAFS Best Practices Workshop Stanford, California, March 24-26, 2004
Recommend
More recommend