Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata - PowerPoint PPT Presentation

Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata Data Conference - New York September 12, 2018

Stitch Fix Personalized styling service serving Men, Women, and Kids Founded in 2011, Led by CEO & Founder, Katrina Lake Employ more than 5,800 nationwide (USA) Algorithms + Humans

About Me

This talk Data Ecosystem ● Data Lineage ● The Need ● Challenges ● Approach ● Architecture ● Questions ●

Data Ecosystem

Data Lineage

The Need and Challenges

Key Terminology Resource Job Structured Data - Hive Table Service defined batch jobs ● ● Postgres Database Performs read/write on resources ● ● ID - Unique identifier Event Service generated Read Resource ● ● Synthesised Write Resource ● ●

Managing a Resource Visibility - Data Scientists need to know what could break. ● Upstream and Downstream to a Resource ○ Effects of Change - If a resource is modified what does it affect? ● Schema change ○ Data type modification ○ Tracing - How did we get to this resource - source to destination? ● Journey of a resource ○ Debugging - How can you reliably debug a large pipeline? ● History - What has been writing to this resource? ● Historical information ○

Upstream and Downstream

Traceability

Challenges - Consistency Multiple services ● Different Job Representations ● Different points of concern ● Extractable information needs to be identified ●

Approach

Simplifying the Data Model Owner (User/ Team) Job Parent Job Read Resource / Write Resource

Augmenting Code Avoid breaking API Changes ● If any, there needs to be better communication ○ Augment with necessary information to pass to Data ● Ingestion pipeline Most of the changes are backend libraries ● Idempotency in workflows ● Behavior ○ Function ○

Architecture

Data Acquisition Event Driven Scheduled Using the Data Ingestion Ad-hoc usage ● ● pipeline Use only if additional ● A Custom S3 Sink to write to information is needed ● Hive table Harder to maintain ● Clients can send lineage ● information

Event Driven

Intermediate Data Collection Resource Attributes Service Data Attributes database owner ● ● table jobId ● ● batchId serviceName ● ● parentId ● Hive Tables

Presto Data Lineage Extract information from Queries ● Currently implemented ● Missing pieces ● Parent- Child relationship ○ Augmenting various clients ○

Spark Data Lineage Adding ability to log reads and ● writes as the happen Move over to Parquet as the ● default FileFormat Augmenting library + clients to ● pass parentage information

Data Refinement Regular cadence of ETLs extracting ● ETL Lineage information Output into clean Postgres Tables ● Postgres DB ETLs for ● Aggregated Metric Extraction ○ Resource Relationships ○

User Interaction Dashboards for Resource Views ● Showing Upstream and Downstream ○ dependencies Static Views ● Metrics from the Warehouse ○ Dynamic Views ● In-flux changes to Resources ○ Custom dashboards can be built ●

Reach Out neeleshssalian@gmail.com

Thank you! https://multithreaded.stitchfix.com/

Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata - PowerPoint PPT Presentation

Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata Data Conference - New York September 12, 2018 Stitch Fix Personalized styling service serving Men, Women, and Kids Founded in 2011, Led by CEO & Founder, Katrina Lake

Spatiotemporal Cell Population Tracking & Cell Population Tracking and Lineage Construction

1.Lineage 2.Consistency Relational 3.Query Mining 4 6 Lineage + Interactions Lineage +

Automating Inventory at Stitch Fix Using Beta Binomial Regression for Cold Start Problems Sally

Segmentation, tracking and lineage analysis of yeast cells in bright field microscopy images

OPUS Tracking Data and Computation Thomas Bytheway Nikilesh Balakrishnan University of

Mentor: Christine E. Edwards A separately evolving metapopulation lineage where lineage

Bonsai: Balanced Lineage Authentication Ashish Gehani Bonsai:Balanced Lineage Authentication

Thermal Flywheeling Alex Woolf, PhD - Principal Data Scientist Lineage Logistics 1 THE NEED FOR

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil

Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + Interactions eugenewu.net Precision

Introducing Maneage: Customizable framework for managing data lineage [RDA Europe Adoption grant

Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier,

Jehoshua (Shuki) Bruck From Screws to Systems The Lineage of BMW It happens in biological

Low rate of lineage High rates of diversification lineage diversification Ancestral trait

The Effective Remote Developer David Copeland Director of Engineering, Stitch Fix @davetron5000

Data Concentrator SVD data multiplexer with FPGA-based tracking Tracking-Meeting Bayrischzell

A STITCH IN TIME SAVES NINE / SCHOOL OF ARTS BARBARA MORRIS PRIZE TEAM Making accessibility

Tracking Chronic Data Over Time: Data Support! December 13, 2016 Welcome! Agenda &

& Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a

Data Provenance for Attributes: Attribute Lineage Dennis Dosso, Susan Davidson, Gianmaria Silvello

A Cross-Stitch Architecture for Joint Registration and Segmentation in Adaptive Radiotherapy

Stitch Aware Detailed Placement for Multiple E-Beam Lithography Yibo Lin 1 , Bei Yu 2 , Yi Zou 1,3

Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata - PowerPoint PPT Presentation

Tracking Data Lineage at Stitch Fix Neelesh Srinivas Salian Strata Data Conference - New York September 12, 2018 Stitch Fix Personalized styling service serving Men, Women, and Kids Founded in 2011, Led by CEO & Founder, Katrina Lake

Spatiotemporal Cell Population Tracking &amp; Cell Population Tracking and Lineage Construction

1.Lineage 2.Consistency Relational 3.Query Mining 4 6 Lineage + Interactions Lineage +

Automating Inventory at Stitch Fix Using Beta Binomial Regression for Cold Start Problems Sally

Segmentation, tracking and lineage analysis of yeast cells in bright field microscopy images

OPUS Tracking Data and Computation Thomas Bytheway Nikilesh Balakrishnan University of

Mentor: Christine E. Edwards A separately evolving metapopulation lineage where lineage

Bonsai: Balanced Lineage Authentication Ashish Gehani Bonsai:Balanced Lineage Authentication

Thermal Flywheeling Alex Woolf, PhD - Principal Data Scientist Lineage Logistics 1 THE NEED FOR

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil

Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + Interactions eugenewu.net Precision

Introducing Maneage: Customizable framework for managing data lineage [RDA Europe Adoption grant

Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier,

Jehoshua (Shuki) Bruck From Screws to Systems The Lineage of BMW It happens in biological

Low rate of lineage High rates of diversification lineage diversification Ancestral trait

The Effective Remote Developer David Copeland Director of Engineering, Stitch Fix @davetron5000

Data Concentrator SVD data multiplexer with FPGA-based tracking Tracking-Meeting Bayrischzell

A STITCH IN TIME SAVES NINE / SCHOOL OF ARTS BARBARA MORRIS PRIZE TEAM Making accessibility

Tracking Chronic Data Over Time: Data Support! December 13, 2016 Welcome! Agenda &amp;

&amp; Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a

Data Provenance for Attributes: Attribute Lineage Dennis Dosso, Susan Davidson, Gianmaria Silvello

A Cross-Stitch Architecture for Joint Registration and Segmentation in Adaptive Radiotherapy

Stitch Aware Detailed Placement for Multiple E-Beam Lithography Yibo Lin 1 , Bei Yu 2 , Yi Zou 1,3

Spatiotemporal Cell Population Tracking & Cell Population Tracking and Lineage Construction

Tracking Chronic Data Over Time: Data Support! December 13, 2016 Welcome! Agenda &

& Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a