Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2017/18 1
Why search engines? • How to add search to NoSQL data stores? – E.g., key-value data stores • How to find documents that match queries? – With text search faster than RDBMs • How to obtain specific features? – Such as highlighting, spatial search, suggestions, guided navigation, … Valeria Cardellini - SABD 2017/18 2 Search engines • Most popular search engines: – Apache Solr – ElasticSearch • ETL process Valeria Cardellini - SABD 2017/18 3
Apache Solr • Scalable, highly reliable and open-source framework for searching data • Built on Apache Lucene – Open-source library for indexing and search – Used by Solr for full-text search • Can index documents written in – XML, JSON, CSV and binary formats • Runs as standalone application service • Provides a REST-like web service that exposes services to manage the lifecycle of documents in the index (indexing, querying, … ) • Used by most popular Web apps (Apple, Instagram, LinkedIn, … ) Valeria Cardellini - SABD 2017/18 4 Solr: key features • Faceting – To group the results based on specific field or defined criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search results by manufacturer or price • Auto-suggest – To present list of possible query terms • Spell check – To suggest corrected spelling of query terms • Highlighting • Document clustering – To group related documents in the search results • Spatial search – To filter search results based on location Valeria Cardellini - SABD 2017/18 5
Solr: key features • Pagination and ranking of search results • Results grouping – To group the results based on a grouping field and return the top documents in each group • Near real-time search – To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news) • More Like This – To identify other documents that are similar to one in a result set Valeria Cardellini - SABD 2017/18 6 Solr feature example Valeria Cardellini - SABD 2017/18 7
Solr components Valeria Cardellini - SABD 2017/18 8 Solr components • Request Handlers: handle a client request at a URL – To query, a GET request to /select handler – To index a document, a POST request to /update handler • Response Writers: serialize and stream response to client • Search Components: part of a Search Handler, a componentized request handler – Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable • Update Handlers: handle an indexing request • Update Processors chain: per-handler componentized chain that handles updates • Query Parsing plugins – Mix and match query types in a single request – Function plugins for Function Query • Text Analysis plugins: Analyzers, Tokenizers, TokenFilters Valeria Cardellini - SABD 2017/18 9
Basic searching • Solr can be queried via – REST clients, curl, wget, Chrome POSTMAN, etc. as well as via native clients available for many programming languages • Example: to search all documents in the index via curl curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*" • Example: to search for a single term curl "http://localhost:8983/solr/techproducts/select?q=foundation" • Example: to search all “electronics” documents in the index curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics" ! See https://bit.ly/2GDLn3G Valeria Cardellini - SABD 2017/18 10 Scaling Solr: SolrCloud • How to provide distributed indexing and search capabilities? – Up to millions of users and millions of indexed documents • SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers – Enables and simplifies horizontal scaling of a search index through replication and sharding – Sharding : incoming queries are distributed to shards in the collection, which respond with merged results – Replication : to handle higher concurrent query load by spreading the requests to multiple servers • No master node to allocate nodes, shards and replicas • SolrCloud uses ZooKeeper for storing shared configuration files and for coordination Valeria Cardellini - SABD 2017/18 11
Solr distributed architecture Valeria Cardellini - SABD 2017/18 12 Elasticsearch • Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents • Search engine based on Apache Lucene • Developed in Java • Distributed – Indices can be divided into shards and each shard can have zero or more replicas – Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) – Rebalancing and routing are done automatically Valeria Cardellini - SABD 2017/18 13
Elastic (ELK) Stack • Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack) • Logstash – Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch • Kibana – Data visualization platform Valeria Cardellini - SABD 2017/18 14 Solr vs. Elasticsearch • Elasticsearch vs Solr on Google Trends • Solr – Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment, wide range of plug-ins are available • Elasticsearch – Newer, but already very widely used – Focus on extracting value from data generally, and not just on search – Part of ELK stack – Schema-free and document-oriented Valeria Cardellini - SABD 2017/18 15
Time series data base (TSDB) • How to analyze DevOps monitoring, application metrics, sensor data from smart factories, smart cities, or smart vehicles? Time series databases ( TSDBs ) – A possible solution, not the only one! • Optimized for handling high-volume time series data – Time series : sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.: • Stock prices (price curve) • Energy consumption (load profile) • Temperature values (temperature trace) • Optimized for providing complex logic to analyze time series data – Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2017/18 16 TSDB: overview • Create, enumerate, update and destroy various time series and organize them in some fashion – Series may be organized hierarchically and have companion metadata – Provide basic calculations on a series as a whole (e.g., multiplying, adding, or combining various time series into a new time series) – Filter on arbitrary patterns (e.g., day of the week, low value, high value) – Provide additional statistical functions that are targeted to time series data Valeria Cardellini - SABD 2017/18 17
TSDB: some products • Some open-source products – CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org • Stores numeric time-series data and render graphs of this data on demand – InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io • Stores its time series in Cassandra – OpenTSDB http://opentsdb.net • Stores its time series in HBase – Riak-TS http://basho.com/products/riak-ts/ • NoSQL key/value store optimized for time series data with masterless architecture (similar to Riak-KV) Valeria Cardellini - SABD 2017/18 18 InfluxDB • Written in Go • Supports high write loads and large data set storage • Conserves space through downsampling – By automatically expiring and deleting unwanted data as well as backup and restore • Provides easy-to-use SQL-like query language for interacting with data • Provides simple, high performing write and query HTTP(S) APIs, e.g.: – To create a database curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb” – To write data curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' Valeria Cardellini - SABD 2017/18 19
Recommend
More recommend