Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big Data - 05/18/2017 1
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 2
Ten-X At a Glance 3
Who am I? Karthik Karuppaiya Sr. Engineering Manager, Data and Analytics @karthikkrk https://www.linkedin.com/in/karthikkrk/ kkaruppaiya@ten-x.com WE ARE HIRING! 4
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 5
CUSTOMER CENTRICITY: BUILDING A 360 DEGREE VIEW Bringing multiple data sets together to truly understand the customer BEHAVIORAL COMMUNICATIONS TRANSACTIONAL PREFERENCES / RELATIONSHIP 3RD PARTY DATA DATA DEMOGRAPHICS DATA ACTIVITY DATA What are they What are all the What are they buying How does this Who are their What has this person doing on interactions they or selling on Ten-X? person prefer relationships with done Ten-X? have with us to work with us? other entities (e.g. online/offline/ and people? bought/sold/own Where are they phone? outside of Ten-X?) doing it? 6 6
Data Sets Data Set Typical Data Format Behavioral Data (Instrumentation) Files/API Transactional Data (OLTP) Kafka/RDBMS Communications(AdWords/Marketo/SalesForce) API Preferences/Demographics (OLTP) RDBMS Relationship Data (OLTP/Third Party) RDBMS, Files, API 3 rd Party activity Data (MLS, REIS, etc..) Files, API, RDBMS Documents and Objects (Pictures, Binary Files, Spreadsheets, etc.. PDFs, etc..) 7
Data Platform Challenges Support mass storage of both structured and unstructured data • Support both batch and real-time streaming data • Support Machine Learning across multiple data sets • Enable discovery and modeling of complex relationships • Ability to join and resolve duplicates across massive amounts of datasets. • 8
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 9
Platform Design Goals Private Data Center • All Open Source Tools • Extremely easy for Business teams to analyze data • Multi-tenant – one platform for all lines of business and all teams • Easily scalable • Keep it Simple, Stupid • 10
The Technologies That Power Our Data Platform! 11 11
Platform Design Hue Tableau SprintBoot OOZIE Ambari Metrics Atlas TinkerPop Ranger Gremlin SPARK PIG HIVE AMBARI Tez MapRed Janus Graph YARN ANSIBLE ELK Elastic Cassandra HDFS HBase Search 12
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 13
Data Pipeline Architecture Analysis Layer Raw Layer Clean Layer Derived Layer 14
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 15
Raw Layer All the data lands here. • Data is never exposed from the Raw Layer. • Data is mostly stored in it’s original format (mostly Text). • 16
Clean Layer All the cleansing rules are applied to the data. • Example: Standardize the Gender data to Male/Female/Unknown for all the records • Example: Standardize the Updated_Timestamp column on all the tables • Data is optimized for storage and querying. • Mostly use ORC file format. • Create External Hive Tables for all the datasets. • Platform users typically have access to the data in the Clean Layer for exploratory purposes. • 17
Derived Layer Data is de-normalized for faster analytics queries. • Helps with cluster resource usage, so same joins are not run repeatedly. • Multiple sources are joined together for a unified view of a customer. • Example: Join Omniture data with the User Profile data to get a complete view • 18
Analysis Layer This is the consumable Layer for APIs, BI Dashboards and Reports. • All the aggregations are performed in this layer. • The relationships and entity resolutions happen in this layer. • Data in Analysis layer is served from appropriate stores, based on the need • JanusGraph: For data that is optimized for graph data model • Cassandra/Hbase: For Key Value type data sets • Elastic Search: For fast searches • 19
API Layer Create Read Only APIs that serves the data to the rest of the organization • Use Mesos and Docker for scalable API layer • Use Spring Boot for faster API Development • Also consumes directly from Kafka for real-time needs • Publishes feedback information to Kafka that goes through the pipeline again • 20
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 21
Lessons Learned Monitor Namenode Check-pointing Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. • Checkpointing with Namenode HA is the recommended way. • If there is no Namenode HA – Secondary namenode is used to do the checkpointing. • When checkpointing fails to happen – the disks get filled up and the namenode gets crashed. • There is potential to lose data, if the edits file gets corrupted for any reason. There are some bugs in the older • versions of HDFS, that might cause edits file to get corrupted. 22
Lessons Learned… Take Backups of all the critical metadata Hive Metastore (MySQL/Postgres) • Namenode fsimage/edits files • Ambari DB (MySQL/Postgres) • Oozie DB (MySQL/Postgres) • Ranger DB (MySQL/Postgres) • 23
Lessons Learned… Monitor the logs regularly and set alerts for runtime errors. Helps identify jobs that fails due to runtime errors, so we • can optimize them – helps with cluster resource usage. Make sure the YARN queues are rightly defined and policies set appropriately Make sure one bad query does not affect rest of the • cluster Group teams together • 24
Monitoring and Alerting 25
Monitoring and Alerting.. 26
Monitoring and Alerting.. 27
Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 28
Data Governance & MDM - Atlas Most important questions that gets asked by the platform users? • What is the source of this data? • What does this column exactly mean? • Who is responsible for populating this data set? • Hive Metastore –> Apache Atlas • 29
Data Exploration - Zeppelin A web-based notebook that enables interactive data analytics. • Let People use their choice of language. • Easy to create Charts and Graphs • 30
Thank you! Q & A @karthikkrk https://www.linkedin.com/in/karthikkrk/ kkaruppaiya@ten-x.com WE ARE HIRING! 31
Recommend
More recommend