Search Engines and Time Series Databases Corso di Sistemi e - PDF document

Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2017/18 1

Why search engines? • How to add search to NoSQL data stores? – E.g., key-value data stores • How to find documents that match queries? – With text search faster than RDBMs • How to obtain specific features? – Such as highlighting, spatial search, suggestions, guided navigation, … Valeria Cardellini - SABD 2017/18 2 Search engines • Most popular search engines: – Apache Solr – ElasticSearch • ETL process Valeria Cardellini - SABD 2017/18 3

Apache Solr • Scalable, highly reliable and open-source framework for searching data • Built on Apache Lucene – Open-source library for indexing and search – Used by Solr for full-text search • Can index documents written in – XML, JSON, CSV and binary formats • Runs as standalone application service • Provides a REST-like web service that exposes services to manage the lifecycle of documents in the index (indexing, querying, … ) • Used by most popular Web apps (Apple, Instagram, LinkedIn, … ) Valeria Cardellini - SABD 2017/18 4 Solr: key features • Faceting – To group the results based on specific field or defined criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search results by manufacturer or price • Auto-suggest – To present list of possible query terms • Spell check – To suggest corrected spelling of query terms • Highlighting • Document clustering – To group related documents in the search results • Spatial search – To filter search results based on location Valeria Cardellini - SABD 2017/18 5

Solr: key features • Pagination and ranking of search results • Results grouping – To group the results based on a grouping field and return the top documents in each group • Near real-time search – To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news) • More Like This – To identify other documents that are similar to one in a result set Valeria Cardellini - SABD 2017/18 6 Solr feature example Valeria Cardellini - SABD 2017/18 7

Solr components Valeria Cardellini - SABD 2017/18 8 Solr components • Request Handlers: handle a client request at a URL – To query, a GET request to /select handler – To index a document, a POST request to /update handler • Response Writers: serialize and stream response to client • Search Components: part of a Search Handler, a componentized request handler – Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable • Update Handlers: handle an indexing request • Update Processors chain: per-handler componentized chain that handles updates • Query Parsing plugins – Mix and match query types in a single request – Function plugins for Function Query • Text Analysis plugins: Analyzers, Tokenizers, TokenFilters Valeria Cardellini - SABD 2017/18 9

Basic searching • Solr can be queried via – REST clients, curl, wget, Chrome POSTMAN, etc. as well as via native clients available for many programming languages • Example: to search all documents in the index via curl curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*" • Example: to search for a single term curl "http://localhost:8983/solr/techproducts/select?q=foundation" • Example: to search all “electronics” documents in the index curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics" ! See https://bit.ly/2GDLn3G Valeria Cardellini - SABD 2017/18 10 Scaling Solr: SolrCloud • How to provide distributed indexing and search capabilities? – Up to millions of users and millions of indexed documents • SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers – Enables and simplifies horizontal scaling of a search index through replication and sharding – Sharding : incoming queries are distributed to shards in the collection, which respond with merged results – Replication : to handle higher concurrent query load by spreading the requests to multiple servers • No master node to allocate nodes, shards and replicas • SolrCloud uses ZooKeeper for storing shared configuration files and for coordination Valeria Cardellini - SABD 2017/18 11

Solr distributed architecture Valeria Cardellini - SABD 2017/18 12 Elasticsearch • Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents • Search engine based on Apache Lucene • Developed in Java • Distributed – Indices can be divided into shards and each shard can have zero or more replicas – Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) – Rebalancing and routing are done automatically Valeria Cardellini - SABD 2017/18 13

Elastic (ELK) Stack • Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack) • Logstash – Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch • Kibana – Data visualization platform Valeria Cardellini - SABD 2017/18 14 Solr vs. Elasticsearch • Elasticsearch vs Solr on Google Trends • Solr – Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment, wide range of plug-ins are available • Elasticsearch – Newer, but already very widely used – Focus on extracting value from data generally, and not just on search – Part of ELK stack – Schema-free and document-oriented Valeria Cardellini - SABD 2017/18 15

Time series data base (TSDB) • How to analyze DevOps monitoring, application metrics, sensor data from smart factories, smart cities, or smart vehicles? Time series databases ( TSDBs ) – A possible solution, not the only one! • Optimized for handling high-volume time series data – Time series : sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.: • Stock prices (price curve) • Energy consumption (load profile) • Temperature values (temperature trace) • Optimized for providing complex logic to analyze time series data – Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2017/18 16 TSDB: overview • Create, enumerate, update and destroy various time series and organize them in some fashion – Series may be organized hierarchically and have companion metadata – Provide basic calculations on a series as a whole (e.g., multiplying, adding, or combining various time series into a new time series) – Filter on arbitrary patterns (e.g., day of the week, low value, high value) – Provide additional statistical functions that are targeted to time series data Valeria Cardellini - SABD 2017/18 17

TSDB: some products • Some open-source products – CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org • Stores numeric time-series data and render graphs of this data on demand – InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io • Stores its time series in Cassandra – OpenTSDB http://opentsdb.net • Stores its time series in HBase – Riak-TS http://basho.com/products/riak-ts/ • NoSQL key/value store optimized for time series data with masterless architecture (similar to Riak-KV) Valeria Cardellini - SABD 2017/18 18 InfluxDB • Written in Go • Supports high write loads and large data set storage • Conserves space through downsampling – By automatically expiring and deleting unwanted data as well as backup and restore • Provides easy-to-use SQL-like query language for interacting with data • Provides simple, high performing write and query HTTP(S) APIs, e.g.: – To create a database curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb” – To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' Valeria Cardellini - SABD 2017/18 19

Search Engines and Time Series Databases Corso di Sistemi e - PDF document

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Game Engines 1 Overview Game engines are a significant part of the modern games industry

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Engines Previously We talked about the motivation behind vertical search engines,

The Impact of Solid State Drive on Search Engine Cache Management Jiancong Tong Ph.D. candidate

Web site deployment and promotion Now, you are done coding your web site. What do you do next?

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

CS400 Problem Seminar Fall 2000 Assignment 4: Search Engines Handed out: Wed., Oct. 18,

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

Search Engines and Time Series Databases Corso di Sistemi e - PDF document

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Game Engines 1 Overview Game engines are a significant part of the modern games industry

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Engines Previously We talked about the motivation behind vertical search engines,

The Impact of Solid State Drive on Search Engine Cache Management Jiancong Tong Ph.D. candidate

Web site deployment and promotion Now, you are done coding your web site. What do you do next?

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Information Retrieval CS6200 Jesse Anderton College of Computer and Information Science

Team Members Ali Khodaei Kaveh Shahabi Search Engine Sangeetha U Santharam for

CS400 Problem Seminar Fall 2000 Assignment 4: Search Engines Handed out: Wed., Oct. 18,

4. The Internet and the World Wide Web 4.1 History of the Internet 4.2 The World Wide Web and

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation