Leveraging in-memory Presented by: Tejas Kannan computation: Using - - PowerPoint PPT Presentation

leveraging in memory
SMART_READER_LITE
LIVE PREVIEW

Leveraging in-memory Presented by: Tejas Kannan computation: Using - - PowerPoint PPT Presentation

Leveraging in-memory Presented by: Tejas Kannan computation: Using Date: 28/11/2018 Spark for textual queries Traditional Applications Complex textual queries are generally expensive to run on traditional database platforms Client Client


slide-1
SLIDE 1

Leveraging in-memory computation: Using Spark for textual queries

Presented by: Tejas Kannan Date: 28/11/2018

slide-2
SLIDE 2

Traditional Applications

Presentation Logic Business Logic Data Access Layer (RDBMS, NoSQL) Client Client Client Client

2

Complex textual queries are generally expensive to run on traditional database platforms

slide-3
SLIDE 3

Elasticsearch1 Background

Elasticsearch relies on inverted indexes to enhance search efficiency

3

  • 1. winter is coming
  • 2. yours is the fury
  • 3. the choice is yours

Term Fr Frequency Do Documents choice 1 3 coming 1 1 fury 1 2 is 3 1,2,3 the 2 2 winter 1 1 yours 2 2,3

1Elasticsearch, https://www.elastic.co/

Example from: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

slide-4
SLIDE 4

Elasticsearch requires… …explicitly marking searchable fields at ingestion time …dedicated index for each searchable field

4

slide-5
SLIDE 5

20 40 60 80 100 120 140 160 180

Query Latency (ms)

Suffix Query ry La Latency on

  • n Ela

lasticsearch

Initial Elasticsearch Benchmark

Inverted Index

5

Without Inverted Index

Inverted indexes provide large performance improvements at the expense of additional storage

Raw Da Data Siz ize Ela lastic icsearch Siz ize with ith In Inverted ed In Index 12.25 MB 56.73 MB

Not

  • tes on
  • n Ex

Experim iment:

  • 600,000 documents of actor/actress names

from IMDb dataset1

  • Queries were 2 character strings based on

common English names

  • Error bars represent 25th-50th-75th

percentiles; data collected from 1000 trials

1IMDb Dataset, https://www.imdb.com/interfaces/

slide-6
SLIDE 6

Using Spark1 for Query Execution

Instead of requiring explicit indexes, we can try and use Spark as a computation engine for executing complex textual queries

6

Presentation Logic Business Logic Data Access Layer Spark Redis2

Data Access Layer must use a persistent SparkContext to reduce job

  • verhead

1Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on

Networked Systems Design and Implementation. USENIX Association, 2012.

2Redis, https://redis.io/

slide-7
SLIDE 7

Why might using Spark + Redis be a good idea?

  • FiloDB is an open-source database which uses Spark as a computation

engine on top of Cassandra for real-time stream analysis1

  • Using Spark with Redis can provide over a 45x increase in performance
  • ver Spark + HDFS2,3
  • Spark as a computation engine provides flexibility of query execution
  • By not requiring indexes for every searchable field, such a system can

reduce the memory footprint

7

1FiloDB, https://velvia.github.io/Introducing-FiloDB/ 2Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee, 2010. 3Redis Accelerates Spark by over 100 times, https://redislabs.com/press/redis-accelerates-spark-by-over-100-times/

slide-8
SLIDE 8

Caching Intermediate Results

SELECT * FROM Hospitals WHERE name ends with “ity”

8

Redis Database RDD RDD Redis Cache

T1: filter(h => h.name.endsWith(‘y’)) T2: filter(h => h.name.endsWith(‘ity’))

Cache of all documents with name ending with ‘y’

Using some additional memory, we can cache intermediate results to speed up future queries

slide-9
SLIDE 9

Caching Intermediate Results

SELECT * FROM Hospitals WHERE name ends with “ity”

9

Redis Database RDD RDD Redis Cache

T1: filter(h => h.name.endsWith(‘ty’)) T2: filter(h => h.name.endsWith(‘ity’))

Cache of all documents with name ending with ‘ty’

Caching patterns can be chosen based on common phrases to maximize effectiveness with limited memory

slide-10
SLIDE 10

20 40 60 80 100 120 140 160 180

Query Latency (ms)

Suffix Query ry La Latency on

  • n Ela

lasticsearch

Goals of this Project

Can Spark + Redis fit into here while saving on storage?

10

Inverted Index Without Inverted Index

  • Create a Spark + Redis

platform which can handle “prefix,” “suffix,” and “contains” queries

  • Implement a caching

feature using a configurable memory limit

  • Benchmark the results

against Elasticsearch to compare query latency and memory usage

slide-11
SLIDE 11

Questions?

11

slide-12
SLIDE 12

References

1. Elasticsearch, https://www.elastic.co/ 2. Elasticsearch from the Bottom Up, https://www.elastic.co/blog/found- elasticsearch-from-the-bottom-up 3. FiloDB, https://velvia.github.io/Introducing-FiloDB/ 4. IMDb Dataset, https://www.imdb.com/interfaces/ 5. Redis, https://redis.io/ 6. Redis Accelerates Spark by over 100 times, https://redislabs.com/press/redis- accelerates-spark-by-over-100-times/ 7. Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee, 2010. 8. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

12