DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of - PowerPoint PPT Presentation

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr

DataCutter • A suite of Middleware for subsetting and filtering multi-dimensional datasets stored on archival storage systems • Subsetting through Range Queries • a hyperbox defined in the multi-dimensional space underlying the dataset • items whose multi-dimensional coordinates fall into the box are retrieved.

DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters can run anywhere, but intended to run near (i.e., over local area network) storage system • based on filter-stream programming model -- to optimize use of limited resources, such as memory and disk space

DataCutter Client Client Range Segment Query Data Filtering Service Client Interface Service Filter Filter Segment Indexing Data Info. Service Access Service DataCutter Archival Storage Archival Storage System System Segments: (File,Offset,Size) (File,Offset,Size)

DataCutter Architecture • Client Interface Service • Manages client connections and client requests • Manages data and information flow between different services • Indexing Service • Two-level hierarchical indexing -- summary and detailed index files • Customizable -- • Default R-tree index • User can add new indexing methods

DataCutter Architecture • Filtering Service • Manages filters (registered in the system) • Users can add/run new filters • Data Access Service • Manages storage/retrieval of data from the tertiary storage • Low level system dependent I/O operations

DataCutter -- Subsetting • Datasets are partitioned into segments • used to index the dataset, unit of retrieval • Indexing very large datasets • Multi-level hierarchical indexing scheme • Summary index files -- to index a group of segments or detailed index files • Detailed index files -- to index the segments

DataCutter -- Filters • Filters • Specialized user program to process data (segments) before returning them to the client • Filter-stream programming model • Originally developed for Active Disks environment (Acharya, Uysal, and Saltz) • Based on stream abstraction • A stream denotes a supply of data • Streams deliver data in fixed size buffers • Communication of a filter with its environment is restricted to its input and output streams • init, process, finalize interface

A Motivating Scenario Sample Application: Reference DB • generate 3D reconstructed feature list view from new set of sensor ? readings ? Computation Farm • compare features with Data Server reference db Grid Configuration: • remote data server - reference WAN db • sensor host - large raw readings ? Sensor • parallel computation farm ? available Raw Dataset Client PC • 3D reconstruction sensor readings computationally intensive

A Motivating Scenario (2) Application : Reference DB // process relevant raw readings feature list // generate 3D view // compute features of 3D view 3D reconstruction // find similar features in reference db Computation Farm Extract ref // display new view and similar cases Data Server WAN View result Extract raw Sensor 3D reconstruction Extract ref View result Extract raw Raw Dataset Client PC Reference DB sensor readings Raw Dataset

Filters • Filters • communicate with other filters only using streams • cannot change stream endpoints • are allowed to pre-disclose dynamic allocation of memory/scratch space in init phase, before processing phase • Advantages • location independence • easier scheduling of resources • filter stop and restart is defined explicitly in model

Placement • The dynamic assignment of filters to particular hosts for execution is placement (mapping) • Optimization criteria: • Communication • leverage filter affinity to dataset • minimize communication volume on slower connections • co-locate filters with large communication volume • Computation • expensive computation on faster, less loaded hosts

Restructuring Process Application Target Configuration Decompose f 1 Some set f 3 f 5 of filters f 2 f 4 Placement / Schedule Execute Application

Software Infrastructure • Prototype implementation of filter framework • C++ language binding • manual placement • wide-area execution service • one thread for each instantiated filter

Filter Framework class MyFilter : public AS_Filter_Base { public: int init(int argc, char *argv[ ]) { … }; int process(stream_t st) { … }; int finalize(void) { … }; }

Filter Connectivity / Placement [filter.A] outs = stream1 stream3 [placement] [filter.B] A = host1.cs.umd.edu ins = stream1 B = host2.cs.umd.edu outs = stream2 C = host3.cs.umd.edu [filter.C] ins = stream2 stream3 stream3 A C stream1 stream2 B

Execution Service Directory 1. Read name host port Filter lib **** **** **** Application **** **** **** 2. Query Console Specs Directory Daemon Filter/Stream ???.???.???.??? Placement dir.cs.umd.edu:6000 3. Exec AppExec Daemon AppExec Daemon AppExec Daemon EXEC EXEC EXEC Filter lib Filter lib Filter lib filter A filter B filter C Application Application Application host1.cs.umd.edu host2.cs.umd.edu host3.cs.umd.edu

Related Work Application Client/Server AppLeS Level Sockets HPC++ Programming DataCutter NetSolve, Legion Models DSM MPI RPC JavaRMI, Ninf Harmony DCOM, Infrastructure NWS CORBA SRB Services Globus Condor DPSS Pool Resource Grid available User specified Idle Resources Resources Level Resources

Integrating DataCutter with the Storage Resouce Broker

Storage Resource Broker (SRB) • Middleware between clients and storage resources • Remote Access to storage resources. • Various types : • File Systems - UNIX, HPSS, UniTree, DPSS (LBL). • DB large objects - Oracle, DB2, Illustra. • Uniform client interface (API).

Storage Resource Broker (SRB) • MCAT - MetaData Catalog • Datasets (files) and Collections (directories) - inodes and more. • Storage resources • User information - authentication, access privileges, etc. • Software package • Server, client library, UNIX-like utilities, Java GUI • Platforms - Solaris, Sun OS, Digital Unix, SGI Irix, Cray T90.

SRB/DataCutter - Prototype Implementation • Support for Range Queries • Creation of indices over data sets (composed set of data files) • Subsetting of data sets • Search for files or portions of files that intersect a given range query • Restricted filter operations on portions of files (data segments) before returning them to the client (to perform filtering or aggregation to reduce data volume)

SRB/DataCutter System Application (SRB client) File SID DBLobjID Range Query ObjSID Resource User Storage Resource Broker (SRB) Indexing DataCutter Service Filtering Service MCAT SRB I/O and MCAT API Filter Filter Application Meta-data DB2, Oracle, Illustra, ObjectStore HPSS, UniTree UNIX, ftp Distributed Storage Resources

SRB/DataCutter Client Interface • Creating and Deleting Index int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName) int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)

SRB/DataCutter Client Interface • Searching Index -- R-tree index typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */ } sfoMBR; /* Bounding box structure */ typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */ } segmentInfo; /* segment meta-data information */ typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */ } indexSearchResult; /* search result structure */

SRB/DataCutter Client Interface • Searching Index -- R-tree index int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query, indexSearchResult *myresult, int maxSegCount) typedef struct { int dim; double *min, *max; } rangeQuery; int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)

Applying Filters typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */ } segmentData; typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */ } filterDataResult;

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of - PowerPoint PPT Presentation

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr DataCutter A suite of Middleware for subsetting and filtering

Very Large Dataset Access and Manipulation: Active Data Repository (ADR) and DataCutter Joel

1 Querying Irregular Dataset Structure Multi-dimensional Datasets Irregular datasets

The Revised Common Rule Whats in the Interim Final Rule? What Happens Next? Heather Pierce,

The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann

Total Cost of Care (TCOC) Workgroup October 30, 2019 Agenda Administrative Updates 1. User

(did you swipe your ID yet?) Hillcrest Elementary, Feb 5, 6:00-7:30pm Douglas County

Your Next Leadership Role: Choosing the Right Time and the Right Strategy for Transitions Vineet

Transforming your teaching into educa3onal scholarship Sebastian

Welcome to CHEM51C! Hi! Im Stan. Ill be your instructor for this final stretch of

Review of usage accounting and charge models in the UK NGS Mike Jones Overview Draft Technical

Parts of a Circle MP2: Reason abstractly & quantitatively. MP3: Construct viable arguments

4. Multiagent Systems Design Part 6: Coordination (I). Explicit Coordination ems (SMA-UPC)

Testing/Debugging CS 1111 OCTOBER 21, 2019 Warm-up problem Write a function that can take

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

W3C Workshop on Rule Languages for Interoperability Colleen McClintock (cmcclintock@ilog.com) 1

The Perrin-McClintock Resolvent, Solvable Quintics and Plethysms Frank Grosshans In his seminal

Global 3D MHD Simulations of Dynamo Effects Lia Hankla University of Colorado, Boulder

Council of Chapters Officer Appreciation Workshop Alex Hanlon, COCGB Chair Elect 2017 Tuesday,

Admin Today: finish web privacy, start mobile security Friday: Lab #2 due (8pm)

SEEC Toolbox seminars Animal movement modelling with moveHMM Theoni Photopoulou

September 18 - The Impossible Location Fallacy Birds Eye View of SF Bay Area - 1868 Chain of

Direct Current Electricity 14-1 Equivalent Units A = W/V = N/T m C = J/V = N m/V F

Non Elderly Uninsured Rate Federal Politics and Californias Potential to Resist Andrew B.

C OMPARING O BSERVATIONS OF THE A BUNDANCE OF S ODIUM IN M ERCURY S E XOSPHERE 1 Presenter:

Sambuz

Useful Links

Newsletter

Mail Us

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of - PowerPoint PPT Presentation

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr DataCutter A suite of Middleware for subsetting and filtering

Very Large Dataset Access and Manipulation: Active Data Repository (ADR) and DataCutter Joel

1 Querying Irregular Dataset Structure Multi-dimensional Datasets Irregular datasets

The Revised Common Rule Whats in the Interim Final Rule? What Happens Next? Heather Pierce,

The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann

Total Cost of Care (TCOC) Workgroup October 30, 2019 Agenda Administrative Updates 1. User

(did you swipe your ID yet?) Hillcrest Elementary, Feb 5, 6:00-7:30pm Douglas County

Your Next Leadership Role: Choosing the Right Time and the Right Strategy for Transitions Vineet

Transforming your teaching into educa3onal scholarship Sebastian

Welcome to CHEM51C! Hi! Im Stan. Ill be your instructor for this final stretch of

Review of usage accounting and charge models in the UK NGS Mike Jones Overview Draft Technical

Parts of a Circle MP2: Reason abstractly &amp; quantitatively. MP3: Construct viable arguments

4. Multiagent Systems Design Part 6: Coordination (I). Explicit Coordination ems (SMA-UPC)

Testing/Debugging CS 1111 OCTOBER 21, 2019 Warm-up problem Write a function that can take

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

W3C Workshop on Rule Languages for Interoperability Colleen McClintock (cmcclintock@ilog.com) 1

The Perrin-McClintock Resolvent, Solvable Quintics and Plethysms Frank Grosshans In his seminal

Global 3D MHD Simulations of Dynamo Effects Lia Hankla University of Colorado, Boulder

Council of Chapters Officer Appreciation Workshop Alex Hanlon, COCGB Chair Elect 2017 Tuesday,

Admin Today: finish web privacy, start mobile security Friday: Lab #2 due (8pm)

SEEC Toolbox seminars Animal movement modelling with moveHMM Theoni Photopoulou

September 18 - The Impossible Location Fallacy Birds Eye View of SF Bay Area - 1868 Chain of

Direct Current Electricity 14-1 Equivalent Units A = W/V = N/T m C = J/V = N m/V F

Non Elderly Uninsured Rate Federal Politics and Californias Potential to Resist Andrew B.

C OMPARING O BSERVATIONS OF THE A BUNDANCE OF S ODIUM IN M ERCURY S E XOSPHERE 1 Presenter:

Sambuz

Useful Links

Newsletter

Mail Us

Parts of a Circle MP2: Reason abstractly & quantitatively. MP3: Construct viable arguments