SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN - PowerPoint PPT Presentation

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW Yu Su*, Gagan Agrawal*, Jon Woodring† *The Ohio State University †Los Alamos National Laboratory SC’11 UltraVis Workshop, November 13, 2011

Subsetting Large-Scale Data ¨ Data subsetting is needed for efficient scientific large-scale visualization and analysis ¤ Post-processing is still needed for some visualization and analysis scenarios (global time analysis, exploratory visualization, etc.) ¤ Slow I/O and network bandwidth ¤ Memory footprint decreasing per core at exascale ¤ New data generated during analysis process increases the memory footprint Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Data Subsetting in ParaView ¨ Load an entire data set (or spatial subset) ¤ Apply a series filters and/or the data selection interface ¤ Multiple filters keep multiple data copies in memory ¤ New grids may be created increasing memory and time ¨ Rectilinear Grid readers: Slow (can be fast for spatial) ¤ Extract Subset/Slice/Clip Filter: Fast ¤ Threshold Filter: Very Slow (new unstructured grid) ¨ Unstructured Grid readers: Slow ¤ Extract Subset/Slice/Clip Filter: Medium ¤ Threshold Filter: Slow (new unstructured grid) Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

A Faster Solution ¨ Subset at the I/O level ¤ Reduced I/O times and memory footprint ¤ User specifies the subset in one query for both space and value ranges ¨ SQL queries in ParaView reader modules ¤ Standard: A flexible language for specifying subsets ¤ Efficiency: Smaller data load from disk to memory ¤ Functionality: One query is equal to multiple filters Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Broader Research Context ¨ Automatic Data Virtualization Research at The Ohio State University ¨ Virtual Relational/XML view on low-level scientific data ¤ A light-weight database management solution ¤ Support for flat-files and HDF5 in past work ¤ No need to load data in a specific database – the data are able to stay as-is Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

ParaView Reader and SQL Queries user input query parse query retrieve data Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA the rest of the VTK pipeline

Retrieval/Indexing Functionality ¨ Spatial queries are relatively easy ¤ NetCDF and HDF5 support spatial queries in their API ¤ Unstructured data is harder (but feasible) ¨ Value queries harder: Bitmap Indexing! ¤ One truth bit for each data element and value range (1 if datum is in value range, 0 if not) N x B bit matrix n Value bin cardinality B (range quantization) is a tradeoff for indexing time and space – and there are bitmap compression techniques (WAH, multi-hashing, etc.) ¤ Fast bitwise operations on bitmap index determine the data point selection sets from queries Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Experimental Test Setup ¨ POP (Parallel Ocean Program) NetCDF data files ¤ 3600x2400x42 structured grid ¤ 1.4 GB per variable – 4 variables (5.6 GB) ¨ SQL + NetCDF API + Bitmap indexing vs. reader modules + multiple VTK filters (single threaded) ¤ Type 1: Spatial queries (skipped – it’s as fast as the NetCDF API can service a spatial query) ¤ Type 2: Value queries (100 random queries) ¤ Type 3: Space + Value queries (100 random queries) Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Experiment: Value Only Queries ¨ Spatial query ignored in this test ¤ Two-level bitmap indexing ¤ One-level bitmap indexing ¤ Read whole data + multiple VTK threshold filters ¨ 2-level indexing ¤ Coarse grain index for values for a first pass (fewer bins), followed by a finer grain indexing per bin in a second pass ¤ Throws out a large number of candidates on first pass – more efficient than 1-level indexing in many cases Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

m1 – 2-level indexing and read only query data m2 – 1-level indexing and read only query data m3 – read all data and use multiple thresholds grid creation query data size compared to whole data size Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Memory Usage on Value Queries Memory occupied by reading the entire data set Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Experiment: Space + Value Queries ¨ Subsetting on IDs (space) combined with values ¤ 2-level indexing + NetCDF API ¤ 1-level indexing + NetCDF API ¤ VTK NetCDF reader + subset filter + threshold filter n Reader is smart that it only reads requested spatial subsets ¨ Dominant factor for indexing is still in values ¤ SELECT temp FROM DATASET WHERE t_lat=0.5 AND t_lon = 0.5 AND temp<100; ¤ SELECT temp FROM DATASET WHERE temp=0 AND t_lat>0; Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

% of values vs. % of IDs (space) in query result Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Remove Unstructured Grid Generation ¨ The first experiment showed that ParaView/VTK spends a lot of time generating new grids ¤ This could be said for many filters – VTK tends to generate new grids too often, wasting time and space ¨ New filtering idea (seems obvious but it was “Aha!”) ¤ Instead of generating new grids after filtering, mark “unqualified” data values as NaN ¤ Data stay on original grid ¤ No extra grid generation (saves time and space) Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Indexing vs. new Threshold filter (set unqualified values to NaN) Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Conclusion ¨ Use ¡SQL ¡queries ¡to ¡support ¡flexible ¡data ¡subse5ng ¡ ¤ Translate ¡query ¡into ¡opera:ons ¡ ¤ Use ¡NetCDF/HDF5 ¡API ¡to ¡support ¡spa:al ¡subset ¡ ¤ Use ¡bitmap ¡indexing ¡to ¡support ¡value ¡subsets ¡ ¤ No ¡need ¡to ¡move ¡data ¡into ¡a ¡database ¡ ¨ Reduced ¡memory ¡and ¡:me ¡for ¡query ¡ ¤ Mul:-‑level ¡indexing ¡can ¡dras:cally ¡improve ¡:me ¡ ¤ Skipping ¡extra ¡grid ¡genera:on ¡steps ¡in ¡VTK ¡can ¡ improve ¡:me ¡and ¡memory ¡usage ¡as ¡well ¡ Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Questions? Thanks for listening! Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

§ Sl ide 19 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

Bitmap Compression (Bitmaps can be large) ¨ WAH compression ¤ Compress the bit vectors based on continuous 0s or 1s ¤ Can’t subset the by the IDs (spatial dimensions) before the value indexing operation (assuming we don’t add x, y, z to the bitmap index) ¨ Multi-Hash compression ¤ Use multiple hash functions to set 1s for hash(id, value) for each 1 in a bit vector ¤ Supports subsetting over both IDs (space) and values ¤ Hash clashes for (id, value) to same array position (false positives) Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN - PowerPoint PPT Presentation

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW Yu Su, Gagan Agrawal, Jon Woodring *The Ohio State University Los Alamos National Laboratory SC11 UltraVis Workshop, November 13, 2011 Subsetting Large-Scale

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Subsetting and S3 objects Subsetting and S3 objects Programming for Statistical Programming for

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

SQL SQL SQL = Structured Query Language Standard query language for relational

Basic SQL Queries 1 Why SQL? SQL is a very-high-level language Say what to do

Basic SQL Queries 1 Why SQL? SQL is a very-high-level language Say what to do

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Introductjon to SQL Part 1 Single-Table Queries By Michael Hahsler based on slides for CS145

Simple SQL Queries (2) Review SQL the structured query language for relational databases

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Rainbow over the Window(s) ... more colors than you could expect $whoami Peter Daniel

i p s e t Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February

Breaking the Curse of Cardinality on Bitmap Indexes K. John Wu Kurt Stockinger Arie Shoshani

Motivation Many applications of databases manipulate geographical (2-d) data. Others involve

CS5460: Operating Systems Lecture 18: File System Implementation (Ch.10) CS 5460: Operating

Computer Graphics (CS 543) Lecture 8c: Texturing Prof Emmanuel Agu Computer Science Dept.

Android and Bitmaps: How hard could it be? Maksim Lin Manichord Mobile Solutions Intro I'm an

Last Class: File System Abstraction Naming Protection Persistence Fast access

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN - PowerPoint PPT Presentation

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW Yu Su*, Gagan Agrawal*, Jon Woodring *The Ohio State University Los Alamos National Laboratory SC11 UltraVis Workshop, November 13, 2011 Subsetting Large-Scale

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Subsetting and S3 objects Subsetting and S3 objects Programming for Statistical Programming for

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

SQL SQL SQL = Structured Query Language Standard query language for relational

Basic SQL Queries 1 Why SQL? SQL is a very-high-level language Say what to do

Basic SQL Queries 1 Why SQL? SQL is a very-high-level language Say what to do

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Introductjon to SQL Part 1 Single-Table Queries By Michael Hahsler based on slides for CS145

Simple SQL Queries (2) Review SQL the structured query language for relational databases

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Rainbow over the Window(s) ... more colors than you could expect $whoami Peter Daniel

i p s e t Proceedings of NetDev 1.1: The Technical Conference on Linux Networking (February

Breaking the Curse of Cardinality on Bitmap Indexes K. John Wu Kurt Stockinger Arie Shoshani

Motivation Many applications of databases manipulate geographical (2-d) data. Others involve

CS5460: Operating Systems Lecture 18: File System Implementation (Ch.10) CS 5460: Operating

Computer Graphics (CS 543) Lecture 8c: Texturing Prof Emmanuel Agu Computer Science Dept.

Android and Bitmaps: How hard could it be? Maksim Lin Manichord Mobile Solutions Intro I'm an

Last Class: File System Abstraction Naming Protection Persistence Fast access

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW Yu Su, Gagan Agrawal, Jon Woodring *The Ohio State University Los Alamos National Laboratory SC11 UltraVis Workshop, November 13, 2011 Subsetting Large-Scale