1
play

1 Querying Irregular Dataset Structure Multi-dimensional Datasets - PDF document

Data Intensive Research Group Very Large Dataset Access and Manipulation: University of Maryland/Johns Hopkins Active Data Repository (ADR) Mike Beynon and DataCutter Umit Catalyurek Chialin Chang Joel Saltz Renato


  1. Data Intensive Research Group Very Large Dataset Access and Manipulation: University of Maryland/Johns Hopkins Active Data Repository (ADR) • Mike Beynon and DataCutter • Umit Catalyurek • Chialin Chang Joel Saltz • Renato Ferreira University of Maryland, College Park • Tahsin Kurc and • Alan Sussman Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Tools to Manage Storage Hierarchy Irregular Multi-dimensional Datasets • Spatial/multi-dimensional multi-scale, • Mass Storage: multi-resolution datasets • Load subset of data from tertiary storage into • Applications select portions of one or more disk cache or client datasets • Access data from distributed data collections • Selection of data subset makes use of spatial • Preprocess close to data sources index (e.g., R-tree, quad-tree, etc.) • Fast secondary storage • Data not used “as-is”, generally preprocessing • Tools for on-demand data product generation, is needed - often to reduce data volumes interactive data exploration, visualization • Target closely coupled sets of processors/disks N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O DataCutter Active Data Repository (ADR) • A suite of Middleware for subsetting and filtering • Set of services for building parallel databases of multi-dimensional datasets stored on archival multi-dimensional datasets storage systems • enables integration of storage, retrieval and processing of multi-dimensional datasets on parallel machines. • Subsetting through Range Queries • can maintain and jointly process multiple datasets. • a hyperbox in dataset’s multi-dimensional space • provides support and runtime system for common • retrieve items with multi-dimensional coordinates in box operations such as • Processing (filtering/aggregations) through • data retrieval, Filters • memory management, • Carry out processing near data, compute servers • scheduling of processing across a parallel machine. • customizable for various application specific processing. N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 1

  2. Querying Irregular Dataset Structure Multi-dimensional Datasets • Irregular datasets • Spatial and temporal resolution may depend • Think of disk-based unstructured meshes, data structures on spatial location used in adaptive multiple grid calculations, sensor data • Physical quantities • indexed by spatial location (e.g., position on earth, position of microscope stage) computed and stored vary with spatial location • Spatial query used to specify iterator • computation on data obtained from spatial query • computation aggregates data - resulting data product size significantly smaller than results of range query N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Processing Irregular Datasets Example -- Interpolation Output grid onto which a projection is carried out Pathology Volume Rendering Applications Specify portion of raw sensor data corresponding Processing Remotely Sensed Data to some search criterion AVHRR Level 1 Data AVHRR Level 1 Data NOAA Tiros- N • As the TIROS-N satellite orbits, the w/ AVHRR sensor Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. Surface/Groundwater One scan line is 409 IFOV’s N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Modeling O Satellite Data Analysis Application Scenarios Application Scenarios (cont.) • Locate TB spatio-temporal region in multi-scale, • Sensor data, fluid dynamics and chemistry multi-resolution PB dataset, project data onto codes to predict condition of waterways (e.g. new spatio-temporal grid Chesapeake bay simulation) and to carry out petroleum reservoir simulation • Ad-hoc queries, data products from satellite sensor data • Predict materials properties using electron microscope computerized tomography sensor • Browse or analyze (multi-resolution) digitized data slides from high power light or electron microscopy • Post-processing, analysis and visualization of data generated by large scientific simulations • 1-50 GBytes per digitized slide, 5-50 slides per case, 100’s of cases per day per hospital N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 2

  3. Processing Remotely Sensed Data Spatial Irregularity AVHRR Level 1 Data AVHRR Level 1 Data NOAA Tiros-N • As the TIROS-N satellite orbits, the AVHRR Level 1B NOAA-7 Satellite 16x16 IFOV blocks. w/ AVHRR sensor Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Latitude Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. One scan line is 409 IFOV’s Longitude Typical Query Output grid onto which a projection Active Data Repository is carried out Specify portion of raw sensor data corresponding to some search criterion N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Application Processing Loop Architecture of Active Data Repository O ← Output dataset, I ← Input dataset Client 2 Client 1 A ← Accumulator (intermediate results) Query (sequential) (parallel) [S I , S O ] ← Intersect(I, O, R query ) Front End foreach o e in S O do Results read o e Application Front End a e ← Initialize(o e ) foreach i e in S I do Query Submission Query Interface read i e Service Service S A ← Map(i e ) ∩ S O foreach a e in S A do Query Execution Query Planning Service Service a e ← Aggregate(i e , a e ) foreach a e in S O do o e ← Output(a e ) Dataset Indexing Attribute Space Data Aggregation Back End Service Service Service Service write o e 3

  4. Loading Datasets into ADR Loading Datasets into ADR • ADR Data Loading Service • A user • Distributes chunks • should decompose dataset into data chunks across the disks in • optionally can distribute chunks across the disks, and the system (e.g., provide an index for accessing them using Hilbert curve • ADR, given data chunks and associated based declustering) minimum bounding rectangles in a set of • Constructs an R-tree files index using bounding boxes of the data • can distribute data chunks across the disks using a chunks Hilbert-curve based declustering algorithm, • can create an R-tree based index on the dataset. Disk Farm N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O Query Execution in Active Data Data Loading Service Repository • User must decompose the dataset into chunks • An ADR Query contains a reference to • For a fully cooked dataset, User • the data set of interest, • moves the data and index files to disks (via ftp, for • a query window (a multi-dimensional bounding box in example) input dataset’s attribute space), • registers the dataset using ADR utility programs • default or user defined index lookup functions, • For a half cooked dataset, ADR • user-defined accumulator, • computes placement information using a Hilbert curve - • user-defined projection and aggregation functions, based declustering algorithm, • how the results are handled (write to disk, or send back • builds an R-tree index, to the client). • moves the data chunks to the disks • ADR handles multiple simultaneous active • registers the dataset queries N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O ADR Query Execution ADR Query Execution query Client Send output to clients Output Handling Global Combine Phase Phase Index lookup Combine partial output results Aggregate local input Generate query plan data into output Initialize output N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Initialization Phase Local Reduction Phase 4

Recommend


More recommend