revealing elasticsearch
play

Revealing Elasticsearch Implementation, Integration, and Execution - PowerPoint PPT Presentation

Revealing Elasticsearch Implementation, Integration, and Execution Objective: Get access to a cluster, index documents, find them, and present them. Web developers Data scientists Target audience Report developers


  1. Revealing Elasticsearch Implementation, Integration, and Execution

  2. Objective: Get access to a cluster, index documents, find them, and present them.

  3. Web developers ● Data scientists ● Target audience Report developers ● Technologists ● Infrastructure/DevOPS ●

  4. What is Elasticsearch? Written in Java ○ Open source ■ Cross platform ■ Based on Lucene and Apache Solr ○ Scaled, real-time search & analytics ○ Full RESTful API ○ Plugin ecosystem ○ SDKs for Java, .NET, many more ○ Eventually consistent ○

  5. An Elastic Timeline

  6. Elasticsearch History 2010 2011 2012 2013 2014 2015 2016 2017 1.x 0.x $104M in funding 2.x 5.x Elastic Cloud Prelert

  7. Getting Started

  8. Objective: All you need is an endpoint http://localhost:9200/_search

  9. Getting Out of the Gates Option 1 (*Ix) Option 2 (Windows) Option 3 (Cloud) Apt-get the latest version of Download the latest Create a free account with Elasticsearch (5.2.1) from version of Elasticsearch Elastic.co elastic.co from elastic.co Create a free account with Run bin/elasticsearch Run bin\elasticsearch.bat Amazon Web Services Curl http://localhost:9200 Many other providers

  10. Cluster Overview

  11. Objective: Understand how data is stored and transactions are scaled

  12. Standard Configuration A typical production cluster will contain 3 ● nodes (installations) Additional nodes can be brought ○ online through discovery A typical node will contain 5 primary ● shards and 5 replica shards Data is replicated across all nodes so loss ● of a node will not affect cluster A master node is commonly specified to ● handle routing of requests Data is also serialized to disk and can be ● recovered

  13. Storage: A cluster with 3 nodes of 32GB RAM machines has 32GB of cache.

  14. Questions?

  15. Indexing Data

  16. Objective: All you need is Postman

  17. Inverted Indexes Elasticsearch uses a Find all the unique words that appear in document ● structure called an List documents in which word (token) appears ● inverted index Reduces total search size Find all documents in ● which token exists Ranks documents based on occurrences ● Cases are removed in tokens ● Word stemming & casing Stemming algorithm drops “ing”, “ly”, “s”, etc ● All inverted indexes are normalized ● Normalization Custom analyzers can be applied to documents ●

  18. Mappings Elasticsearch Mapping Available types: Elasticsearch will attempt to ● Boolean “guess” type mappings as each document is indexed. ● Long ● Double Once created, mappings cannot be changed without re-creating ● Date the index. ● String A custom mapping can be applied before indexing documents.

  19. Analyzers None Language ● ● Standard 33+ languages supported ● ○ Splits the input text on word boundaries Stems words based on language ○ ○ Terms are lower cased Removes language specific “stop” words ○ ○ Whitespace Custom ● ● Breaks text into terms whenever it E.g. Remove “stop” words using a ○ ○ encounters a whitespace character language filter Simple ● Breaks text into terms whenever it ○ encounters a character which is not a letter Terms are lower cased ○

  20. Patient Document Example { JSON format (Javascript Object Notation) ● "patient": { "first_name": "John", Index by PUTting document to index endpoint ● "last_name": "Doe", "dob": 252507600000, (PUT patients/patient/1) ○ "gender": "Male", Last item is unique key (1) "race": "White", ○ "height": 1.8288, Index operation automatically creates an index ● "weight": 90.7185, "eyes": "blue", if it has not been created before "hair": "brown", Elasticsearch “guesses” types as they are "age": 39, ● "tobacco": "no", posted "location": { "lat": 40.762446, Each indexed document is given a version ● "lon": -73.831653 }, number "conditions": [{ Index API optionally allows for optimistic ● "icd10": "M54.5", "description": "Low back pain" concurrency control when the version }, { "icd10": "Z91.018", parameter is specified "description": "Allergy to other foods" Bulk-indexing supported (Bulk API) ● }], "medications": [{ River plugins (Oracle, MSSQL, MySQL) ● "name": "Aspirin", "dosage": 150, "units": "mg", "frequency": 8, "freq_units": "hours" }] } }

  21. Questions?

  22. Querying Documents

  23. Objective: All you need is JSON

  24. QueryDSL Domain-Specific Language Leaf query clauses ● Leaf query clauses look for a particular value in a particular field, such as the match, term or range ○ queries. These queries can be used by themselves. Compound query clauses ● Compound query clauses wrap other leaf or compound queries and are used to combine multiple ○ queries in a logical fashion (such as the bool or dis_max query), or to alter their behavior (such as the constant_score query).

  25. Common Query Types Full Text Joining ● ● Match All Nested ○ ○ Query String Geo ○ ● Term Geo Shape ● ○ Term Geo Distance ○ ○ Range Geo Polygon ○ ○ Exists Specialized ○ ● Regexp More Like This ○ ○ Fuzzy Template ○ ○ Compound Script ● ○ Bool ○ Boosting ○

  26. Sample Bool Query JSON format (Javascript Object ● Notation) { "query": { Search by performing GET ● "bool": { "must": [{ against a specific index "match": { "medications.name": "Aspirin" } /GET patients/_search ○ }], "filter": [{ This query returns all men "term": { ● "gender": "Male" } between the ages of 30 and 50 }, { "range": { who use aspirin "age": { "lte": 50, "gte": 30 } } }] } } }

  27. Query Result Query returns a formatted JSON result indicating the search metrics { "took": 1, "timed_out": false, "_shards": { Took ● "total": 5, Length of time in milliseconds the query "successful": 5, ○ "failed": 0 took to execute and return }, "hits": { Shards ● "total": 1, "max_score": 1.3862944, Number of shards utilized in execution ○ "hits": [{ of the query "_index": "patients", "_type": "patient", Hits ● "_id": "1", "_score": 1.3862944, Total and max score of all results ○ "_source": { Hits[] is an array of resulting ○ "first_name": "John", documents, which can be limited by size "last_name": "Doe", "dob": 252507600000, . . . } }] } }

  28. Aggregates An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. { "query": { "bool": { Bucketing "must": [{ A family of aggregations that build buckets, where each bucket "match": { "gender": "Male" is associated with a key and a document criterion. } }] }, Metric "aggs": { Aggregations that keep track and compute metrics over a set of "medications": { documents. "terms": { "field": "medications.name" } Matrix } } A family of aggregations that operate on multiple fields and } produce a matrix result based on the values extracted from the } requested document fields. Pipeline Aggregations that aggregate the output of other aggregations and their associated metrics

  29. Query Result A bucket aggregation finds all { . . . documents matching the query (in "aggregations" : { "medications" : { this case all males) and aggregates "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ the results into key and doc_count { "key" : "Aspirin", fields. "doc_count" : 2465 }, { "key" : "Omeprazole", Only documents matching the "doc_count" : 1824 }, { initial query will be considered for "key" : "Lisinopril", "doc_count" : 1121 }, aggregation. ] } } }

  30. Statistical Aggregates The aggregations in this family compute metrics based on values { "query": { "bool": { extracted in one way or another "must": [{ "match": { from the documents that are being "gender": "Male" } }] aggregated. The values are typically }, "aggs": { extracted from the fields of the "age_stats": { "extended_stats": { "field": "age" document (using the field data), } } but can also be generated using } } } scripts.

Recommend


More recommend