Semantic Data Placement for Power Management in Archival Storage - PowerPoint PPT Presentation

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L. Miller Storage Systems Research Center Center for Research in Intelligent Storage University of California, Santa Cruz Monday, November 15, 2010

What is archival data? • Tape back-ups • Compliance records • Sarbanes-Oxley • Government correspondence • Abandoned experimental data • Outdated media • “Filed” documents • Vital records 2 Monday, November 15, 2010

Mission • Save power in archival systems • Disks incur the highest power cost in a datacenter • As disks get faster, power grows as a square • We can save power by reducing the number of spin-ups in archival systems • Spin-ups can consume ~25x the power of idling • Spin-ups reduce device lifetime 3 Monday, November 15, 2010

Saving power • Power management in archival storage typically relies on having few reads • Modern, crawled archives can ʼ t make this assumption • Steady workload types can be exploited • 30% hit rate gives ≥ 10% power savings • Hits: reads that happen on spinning disks 4 Monday, November 15, 2010

“Archival by accident” • Hundreds of exabytes of data are created annually • Flickr, blogs, YouTube, ... • “Write once / Read-maybe” may not hold • Search indexers • Working set changes • Web has archival characteristics • Top 10 websites account for 40% of accesses * • Drop off is exponential, not long tail • Much data becomes archival by accident * The Long Tail Internet Myth: Top 10 domains aren’t shrinking (2006) http://blog.compete.com/2006/12/19/long-%20tail-chris-anderson-top-10-domains 5 Monday, November 15, 2010

Big Idea • Fragmentation on a disk causes a significant drop in performance • “Fragmentation” of a group of files that tend to be accessed together across a large storage system is similarly bad • Defragmentation is hard, but we should at least try to append onto groups where we can! 6 Monday, November 15, 2010

Overview of our method 1. Storage system is divided into access groups 2. Files likely to be accessed together are placed together into an access group 3. When a file in an access group is accessed: 3.1. Its disks are spun up 3.2. The disks are left on for a period of time t to catch subsequent accesses • Goal : Save power by avoiding repeated spin-ups 7 Monday, November 15, 2010

System design • Index Server: • Classification • Cache • Disks: • MAID semantics: usually off • Logically arranged into access groups • Parity is done over an access group 8 Monday, November 15, 2010

System Design: Bootstrap • Start with set of data • Index servers split data into groups • Assumption: Classifications will last for system lifetime • O( n 3 ) • Cheaper, linear methods exist, but... • This only has to be done once! • Stripe data onto access groups • Parity is determined by total desired system cost. 9 Monday, November 15, 2010

System design: writes • Writes are batched by default • File will write at next spin-up • Sooner if write cache fills • If file group is full, split 10 Monday, November 15, 2010

System design: reads • Cache could be simple LRU • If file group is spinning, add to the spin time • Catches subsequent accesses • Power is wasted if no subsequent accesses 11 Monday, November 15, 2010

Splitting an access group • Access groups will grow as files are added • Large access groups lower power gain: split them! • Large access groups are marked for splitting • Wait for next spin-up. • Groups too small to sub-classify • Split randomly • Could potentially use existing split (e.g., path hierarchy) 12 Monday, November 15, 2010

Selecting classification features • Select features to classify with: type, creator, path • Frequently meta-data • Use labels if provided • Pick features with principal component analysis • “What features matter most in differentiating groups of files” • Use expectation maximization: • Expectation: • Calculate log likelihood for eigenvectors in covariance matrix • Maximization: • Maximize over expectations • Re-do expectation step 13 Monday, November 15, 2010

Classification • Without history: • Blind source separation • tf-idf: • With history: • Hierarchical clustering • Make lots of small clusters and progressively combine them • Access prediction • Learn what is likely accessed together • Create a dynamic Bayesian network 14 Monday, November 15, 2010

Definitions • Hit Rate: % of reads that happen on spinning disks • Singletons: % of reads that result in a spin-up with no subsequent hits within t = 50 seconds • Power Saved: % of power saved vs. paying one spin-up cost for every read 15 Monday, November 15, 2010

Data sets • Web access logs for a water management database (DWR) • ~90,000 accesses from [2007-2009] • 2.3 GB dataset • Accesses come pre-labeled with features • E.g. Site, Site Type, District • Washington State records (WA) • ~5,000,000 accesses from [2007-2010] • Accesses are for retrieved records • 16.5 TB dataset • Single category, pre-labeled 16 Monday, November 15, 2010

Access frequencies: DWR With Search Indexers Without Search Indexers Accesses Accesses Days Days • Search indexers can cause significant spikes in archival access logs 17 Monday, November 15, 2010

Access frequencies: WA • Spikes can appear without a clear culprit 18 Monday, November 15, 2010

How can we group the DWR data set? • Clustering is difficult because the directory structure isn ʼ t exposed • We can automatically infer ʻ Site ʼ • Some water files can be parsed to detect signatures • Not generally applicable 19 Monday, November 15, 2010

Power savings • Power savings is strongly dependent on singletons • Hit rate is >30% for all datasets 20 Monday, November 15, 2010

Grouped vs. always on • All our groupings save more power than leaving all disks on • Spike is from indexers 21 Monday, November 15, 2010

Effect of search indexers • Search indexers can alter feature importance • Site subgroup: Search indexers can create singletons 22 Monday, November 15, 2010

Future work • Failure isolation • Refined grouping • Caching entire active access group • Re-allocation of access groups • SLO / priority implementation • More data sets 23 Monday, November 15, 2010

Summary • Files used all the time don ʼ t impact rest of archival system power footprint • Real data has enough closely consecutive accesses to save power (30–60%) • Range indicates we could do better • Grouping data saves significant power (up to 50%) • Archival-by-accident systems are a growing research area 24 Monday, November 15, 2010

Questions? Please come talk to me if you have I/O traces from archival systems Thanks to Ian Adams for help with the traces! Thanks to our sponsors: 25 Monday, November 15, 2010

Semantic Data Placement for Power Management in Archival Storage - PowerPoint PPT Presentation

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L. Miller Storage Systems Research Center Center for Research in Intelligent Storage University of California, Santa Cruz Monday, November 15, 2010

GSCC Amazon.com Archival/Preservation Sources State of Florida

VLSI Placement Sadiq M. Sait & Habib Youssef December 1995 Placement Placement is the

TimberWolf 7.0 Placement Perform TimberWolf placement Based on the given standard cell

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

Management and Archival for Project-Based Courses Promiti Dutta and Alexander Haubold Columbia

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Module 13 Introduction to Semantic Technology, Ontologies and the Semantic Web Module 13 Outline

INCREASING CIRCULATION BOOK DISPLAYS THROUGH 2 Placement PLACEMENT LIBRARY GEOGRAPHY

This copy is for archival purposes only. Please contact the publisher for the original version.

Archival Literacy: Different Users, Different Information Needs, Behaviour and Skills Polona

experience of archival research Nadia Marks, User Experience Researcher Fabiana Barticioti,

The Espy Project The Espy Project Enabling New Access to Archival Materials Enabling New Access

1. Consider the wholesale data in the sheet Wholesale. (a) For the grocery sales in region

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

The leaflet . e x tras Package IN TE R AC TIVE MAP S W ITH L E AFL E T IN R Rich Majer u s

Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Lecture: k-means & mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford

Timing Analysis Timing Path Groups and Types Timing paths are grouped into path groups

DATA SCIENCE AND MACHINE LEARNING I ntroduction to Data Tables Dim itris Fouskakis Associate

Semantic Data Placement for Power Management in Archival Storage - PowerPoint PPT Presentation

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L. Miller Storage Systems Research Center Center for Research in Intelligent Storage University of California, Santa Cruz Monday, November 15, 2010

GSCC Amazon.com Archival/Preservation Sources State of Florida

VLSI Placement Sadiq M. Sait &amp; Habib Youssef December 1995 Placement Placement is the

TimberWolf 7.0 Placement Perform TimberWolf placement Based on the given standard cell

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

Management and Archival for Project-Based Courses Promiti Dutta and Alexander Haubold Columbia

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Module 13 Introduction to Semantic Technology, Ontologies and the Semantic Web Module 13 Outline

INCREASING CIRCULATION BOOK DISPLAYS THROUGH 2 Placement PLACEMENT LIBRARY GEOGRAPHY

This copy is for archival purposes only. Please contact the publisher for the original version.

Archival Literacy: Different Users, Different Information Needs, Behaviour and Skills Polona

experience of archival research Nadia Marks, User Experience Researcher Fabiana Barticioti,

The Espy Project The Espy Project Enabling New Access to Archival Materials Enabling New Access

1. Consider the wholesale data in the sheet Wholesale. (a) For the grocery sales in region

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

The leaflet . e x tras Package IN TE R AC TIVE MAP S W ITH L E AFL E T IN R Rich Majer u s

Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Lecture: k-means &amp; mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford

Timing Analysis Timing Path Groups and Types Timing paths are grouped into path groups

DATA SCIENCE AND MACHINE LEARNING I ntroduction to Data Tables Dim itris Fouskakis Associate

VLSI Placement Sadiq M. Sait & Habib Youssef December 1995 Placement Placement is the

Lecture: k-means & mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford