Leveraging in-memory Presented by: Tejas Kannan computation: Using - PowerPoint PPT Presentation

Leveraging in-memory Presented by: Tejas Kannan computation: Using Date: 28/11/2018 Spark for textual queries

Traditional Applications Complex textual queries are generally expensive to run on traditional database platforms Client Client Data Access Presentation Business Logic Layer (RDBMS, Logic NoSQL) Client Client 2

Elasticsearch 1 Background Elasticsearch relies on inverted indexes to enhance search efficiency Term Fr Frequency Do Documents choice 1 3 1. winter is coming coming 1 1 2. yours is the fury fury 1 2 is 3 1,2,3 3. the choice is yours the 2 2 winter 1 1 yours 2 2,3 1 Elasticsearch, https://www.elastic.co/ 3 Example from: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Elasticsearch requires… …explicitly marking searchable fields at ingestion time …dedicated index for each searchable field 4

Initial Elasticsearch Benchmark Inverted indexes provide large performance improvements at the expense of additional storage Raw Da Data Siz ize Ela lastic icsearch Siz ize with ith In Inverted ed In Index Suffix Query ry La Latency on on Ela lasticsearch 180 12.25 MB 56.73 MB 160 Query Latency (ms) 140 Not otes on on Ex Experim iment: 120 • 600,000 documents of actor/actress names 100 from IMDb dataset 1 80 • Queries were 2 character strings based on 60 common English names 40 20 • Error bars represent 25 th -50 th -75 th 0 percentiles; data collected from 1000 trials Inverted Without Index Inverted Index 5 1 IMDb Dataset, https://www.imdb.com/interfaces/

Using Spark 1 for Query Execution Instead of requiring explicit indexes, we can try and use Spark as a computation engine for executing complex textual queries Data Access Layer Presentation Business Spark Logic Logic Redis 2 Data Access Layer must use a persistent SparkContext to reduce job overhead 1 Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation . USENIX Association, 2012. 6 2 Redis, https://redis.io/

Why might using Spark + Redis be a good idea? • FiloDB is an open-source database which uses Spark as a computation engine on top of Cassandra for real-time stream analysis 1 • Using Spark with Redis can provide over a 45x increase in performance over Spark + HDFS 2,3 • Spark as a computation engine provides flexibility of query execution • By not requiring indexes for every searchable field, such a system can reduce the memory footprint 1 FiloDB, https://velvia.github.io/Introducing-FiloDB/ 2 Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on . Ieee, 2010. 7 3 Redis Accelerates Spark by over 100 times, https://redislabs.com/press/redis-accelerates-spark-by-over-100-times/

Caching Intermediate Results Using some additional memory, we can cache intermediate results to speed up future queries SELECT * FROM Hospitals WHERE name ends with “ ity ” T2: filter(h => T1: filter(h => h.name.endsWith (‘ ity ’)) h.name.endsWith (‘y’)) Redis RDD RDD Database Cache of all Redis documents with Cache name ending with ‘y’ 8

Caching Intermediate Results Caching patterns can be chosen based on common phrases to maximize effectiveness with limited memory SELECT * FROM Hospitals WHERE name ends with “ ity ” T2: filter(h => T1: filter(h => h.name.endsWith (‘ ity ’)) h.name.endsWith (‘ty’)) Redis RDD RDD Database Cache of all Redis documents with Cache name ending with ‘ty’ 9

Goals of this Project • Create a Spark + Redis Suffix Query ry La Latency on on Ela lasticsearch platform which can 180 handle “prefix,” “suffix,” and “contains” queries 160 Query Latency (ms) Can Spark + Redis fit into here 140 while saving on storage? • Implement a caching 120 feature using a 100 configurable memory limit 80 60 • Benchmark the results against Elasticsearch to 40 compare query latency 20 and memory usage 0 Inverted Without Index Inverted Index 10

Questions? 11

References 1. Elasticsearch, https://www.elastic.co/ 2. Elasticsearch from the Bottom Up, https://www.elastic.co/blog/found- elasticsearch-from-the-bottom-up 3. FiloDB, https://velvia.github.io/Introducing-FiloDB/ 4. IMDb Dataset, https://www.imdb.com/interfaces/ 5. Redis, https://redis.io/ 6. Redis Accelerates Spark by over 100 times, https://redislabs.com/press/redis- accelerates-spark-by-over-100-times/ 7. Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on . Ieee, 2010. 8. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation . USENIX Association, 2012. 12

Leveraging in-memory Presented by: Tejas Kannan computation: Using - PowerPoint PPT Presentation

Leveraging in-memory Presented by: Tejas Kannan computation: Using Date: 28/11/2018 Spark for textual queries Traditional Applications Complex textual queries are generally expensive to run on traditional database platforms Client Client

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Welcome to The Memory Class An Introduction to Memory Problems and the Memory Center Agenda For

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Self-Indexing Inverted Files for Fast Text Retrieval by Alistair Moffat, Justin Zobel Onur

By Bears Fighting in Space By Andrew Roulier, Trey Lavender, Jaidin Jackson, & Christopher

1 / 57 Algebra Based Physics Newton's Law of Universal Gravitation 20151130 www.njctl.org

Connecting relevant video content to audiences CREDENTIALS DECK 1 Hello, Were Vilynx

1Q2011 Earnings Presentation Notes & Disclaimers Discussion of Forward-Looking Statements by

Multi-Band Dipoles G5RV vs ZS6BKW G5RV Louis Varney -- G5RV (SK) 1934 102 ft 3/2 WL

Compact DC/AC Power Inverter Design Proposal Philip Beard Jacob Brettrager Jack Grundemann

Evaluating NW100 Inverter to Support Diesel-Off Operation in Alaskan Wind-Diesel Systems

Leveraging in-memory Presented by: Tejas Kannan computation: Using - PowerPoint PPT Presentation

Leveraging in-memory Presented by: Tejas Kannan computation: Using Date: 28/11/2018 Spark for textual queries Traditional Applications Complex textual queries are generally expensive to run on traditional database platforms Client Client

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Welcome to The Memory Class An Introduction to Memory Problems and the Memory Center Agenda For

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Self-Indexing Inverted Files for Fast Text Retrieval by Alistair Moffat, Justin Zobel Onur

By Bears Fighting in Space By Andrew Roulier, Trey Lavender, Jaidin Jackson, &amp; Christopher

1 / 57 Algebra Based Physics Newton's Law of Universal Gravitation 20151130 www.njctl.org

Connecting relevant video content to audiences CREDENTIALS DECK 1 Hello, Were Vilynx

1Q2011 Earnings Presentation Notes &amp; Disclaimers Discussion of Forward-Looking Statements by

Multi-Band Dipoles G5RV vs ZS6BKW G5RV Louis Varney -- G5RV (SK) 1934 102 ft 3/2 WL

Compact DC/AC Power Inverter Design Proposal Philip Beard Jacob Brettrager Jack Grundemann

Evaluating NW100 Inverter to Support Diesel-Off Operation in Alaskan Wind-Diesel Systems

By Bears Fighting in Space By Andrew Roulier, Trey Lavender, Jaidin Jackson, & Christopher

1Q2011 Earnings Presentation Notes & Disclaimers Discussion of Forward-Looking Statements by