amundsen a data discovery platform from lyft
play

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - PowerPoint PPT Presentation

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft


  1. Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft

  2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Demo • Architecture • Summary 2

  3. Data platform users General Analysts Data Scientists Data Modelers Product Engineers Experimenters Managers Managers Data Platform 3

  4. Core Infra high level architecture Custom apps 4

  5. Data Discovery 5

  6. Hi! I am a n00b Data Scientist! • My first project is to analyze and predict Data council Attendance • Where is the data? • What does it mean? 6

  7. Status quo • Option 1: Phone a friend! • Option 2: Github search 7

  8. Understand the context • What does this field mean? Does attendance data include employees? ‒ Does it include revenue? ‒ • Let me dig in and understand 8

  9. Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;

  10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 10

  11. Data Scientists spend upto 1/3rd time in Data Discovery... • Data discovery Lack of ‒ understanding of what data exists, where, who owns it, who uses it, and how to request access. 11

  12. Audience for data discovery 12

  13. Data Discovery - User personas General Analysts Data Scientists Data Modelers Product Engineers Experimenters Managers Managers Data Platform 13

  14. 3 Data Scientist personas Power user Noob user Manager ● All info in their head ● Lost ● Dependencies ● Get interrupted a lot ● Ask “power users” a landing on time due to questions lot of questions ● Communicating with stakeholders

  15. Data Discovery answers 3 kinds of questions Search based Lineage based Network based Where is the I am changing a data I want to follow a power table/dashboard for X? model, who are the owner user in my team. What does it contain? and most common users? Does this analysis already This table’s delivery was I want to bookmark tables of exist? delayed today, I want to interest and get a feed of notify everyone data delay, schema change, downstream. incidents.

  16. Meet Amundsen First person to discover the South Pole - Norwegian explorer, Roald Amundsen 16

  17. Landing page optimized for search

  18. Search results ranked on relevance and query activity

  19. How does search work? 19

  20. Relevance - search for “apple” on Google Low relevance High relevance 20

  21. Popularity - search for “apple” on Google Low popularity High popularity 21

  22. Striking the balance Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent ● Querying activity users] ● Dashboarding ● Different weights for automated vs adhoc querying 22

  23. Back to mocks... 23

  24. Search results ranked on relevance and query activity

  25. Detailed description and metadata about data resources

  26. Data Preview within the tool

  27. Computed stats about column metadata Disclaimer: these stats are arbitrary.

  28. Built-in user feedback

  29. Demo 29

  30. Open source in mind • Pluggable code to each micro-services via Python entry point, etc • Pluggable API endpoint via Blueprint • Build your ingestion pipeline like a Lego brick

  31. Amundsen’s architecture 31

  32. Other Microservices ML Frontend Service Security Feature Service Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 32

  33. 1. Frontend Service 33

  34. Other Microservices ML Frontend Service Security Feature Service Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 34

  35. Amundsen table detail page

  36. 2. Metadata Service 36

  37. Other Microservices ML Frontend Service Security Feature Service Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 37

  38. 2. Metadata Service • A thin proxy layer to interact with graph database Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas ‒ • Support Rest API for other services pushing / pulling metadata directly 38

  39. Trade Off #1 Why choose Graph database 39

  40. Why Graph database?

  41. Why Graph database?

  42. Trade Off #2 Why not propagate the metadata back to source 42

  43. Why not propagate the metadata back to source 43

  44. Why not propagate the metadata back to source ? ? 44

  45. Why not propagate the metadata back to source 45

  46. 3. Search Service 46

  47. Other Microservices ML Frontend Service Security Feature Service Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 47

  48. 3. Search Service • A thin proxy layer to interact with the search backend Currently it supports Elasticsearch as the search backend. ‒ • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48

  49. Challenge #1 How to make the search result more relevant? 49

  50. How to make the search result more relevant? • Define a search quality metric Click-Through-Rate (CTR) over top 5 results ‒ • Search behaviour instrumentation is key • Couple of improvements: Boost the exact table ranking ‒ Support wildcard search (e.g. event_* ) ‒ Support category search (e.g. column: is_line_ride ) ‒ 50

  51. 4. Data Builder 51

  52. Other Microservices ML Frontend Service Other Feature Services Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 52

  53. Challenge #1 Various forms of metadata 53

  54. Metadata Sources @ Lyft 54

  55. Metadata - Challenges • No Standardization : No single data model that fits for all data resources A data resource could be a table, an Airflow DAG or a dashboard ‒ • Different Extraction : Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55

  56. Challenge #2 Pull model vs Push model 56

  57. Pull model vs. Push model Pull Model Push Model ● Periodically update the index by pulling from ● The system (e.g. database) pushes the system (e.g. database) via crawlers. metadata to a message bus which downstream subscribes to. Crawler Database Data graph Database Message Data graph queue Scheduler 57

  58. Pull model vs. push model Pull Model Push Model ● Onus of integration lays on data graph ● Onus of integration lies on database ● No interface to prescribe, hard to maintain ● Message format serves as the interface crawlers ● Allows for near-real time indexing Crawler Database Data graph Database Message Data graph queue Scheduler 58

  59. Pull model vs. push model Pull Model Push Model ● Onus of integration lays on data graph ● Onus of integration lies on database ● No interface to prescribe, hard to maintain ● Message format serves as the interface crawlers ● Allows for near-real time indexing Crawler Database Data graph Database Message Data graph queue Preferred if Preferred if ● Near-real time indexing is important ● Waiting for indexing is ok ● Clean interface doesn’t exist ● Working with “strapped” teams ● Other tools like Wherehows are moving ● There’s already an interface towards Push Model 59

  60. 4. Databuilder

  61. Databuilder in action

  62. How are we building data? Databuilder

  63. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs

  64. What’s next? 64

  65. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ • Many organizations have similar problems Collaborating with ING, WeWork and more ‒ We plan to announce open source soon ‒ 65

  66. Impact - Amundsen at Lyft Generally Available (GA) release Beta release (internal) Alpha release 66

  67. Summary 67

  68. Adding more kinds of data resources Dashboards Data sets People Streams Schemas Workflows Phase 3 Phase 2 Phase 1 (In Scoping) (In development) (Complete)

  69. Summary • Data Discovery adds 30+% more productivity to Data Scientists • Metadata is key to the next wave of big data applications • Amundsen - Lyft’s metadata and data discovery platform • Blog post with more details: go.lyft.com/datadiscoveryblog 69

Recommend


More recommend