handling personal information in linkedin s content
play

Handling Personal Information in LinkedIns Content Ingestion System - PowerPoint PPT Presentation

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software Engineer LinkedIn About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00


  1. Handling Personal Information in LinkedIn’s Content Ingestion System David Max Senior Software Engineer LinkedIn

  2. About Me • Software Engineer at LinkedIn NYC since 2015 • Content Ingestion team • Office Hours – Thursday 11:30-12:00 David Max Senior Software Engineer LinkedIn www.linkedin.com/in/davidpmax/

  3. About LinkedIn New York Engineering • Located in Empire State Building • Approximately 100 engineers and 1000 employees total New York • Multiple teams, front end, back Engineering end, and data science

  4. Disclaimers • I’m not a lawyer • Some details omitted • I am not a spokesperson for official LinkedIn policy

  5. O U R M I S S I O N Create economic opportunity for every member of the global workforce

  6. LinkedIn > 546 M > 70 % members of members reside outside the U.S. • World’s largest professional • More than 200 countries and network territories worldwide

  7. General Data Protection Regulation • Applies to all companies worldwide that process personal data of EU citizens. • Widens definition of personal data. • Introduces restrictive data handling principles. • Enforceable from Ma May 25, 2018 .

  8. Handling Personally Identifiable Information (PII) Data Minimization Consent Retention Deletion Limit personal data Cannot use collected Do not hold data Must delete data upon collection, storage, data for a different longer then necessary request usage purpose

  9. Handling PII in Content Ingestion Content Ingestion Data Protection Babylonia Data Minimization Consent Retention Deletion

  10. What is Content Ingestion? Content Ingestion Babylonia

  11. Babylonia Content Ingestion

  12. Babylonia Content Ingestion

  13. url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: Babylonia https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 Content Ingestion 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

  14. Babylonia Content Ingestion

  15. What is Content Ingestion? • Extracts metadata from web pages • Source of Truth for 3 rd party content • Also contains metadata for some public 1 st party content Babylonia • Used by LinkedIn services for sharing, decorating, and embedding content Content Ingestion • Data also feeds into content understanding and relevance models

  16. How does PII get into Babylonia?

  17. Ingesting 1 st party pages containing publicly viewable member PII • Profile pages • Publish posts • SlideShare content

  18. When a Member Account is Closed What happens What Babylonia needs to do • Babylonia (along with other • Remove scraped data relating to systems) is notified that the the member pages that have been member’s account is closed taken down • Other systems take down the • Notify downstream systems that member’s content might be holding a copy of the (i.e. public profile page, publish data posts, etc.)

  19. Babylonia Datasets Espresso HDFS Datasets Database ETL Babylonia Brooklin Data Content Ingestion Change Events

  20. Downstream and Upstream Datasets Online Service Espresso HDFS Offline Database ETL Brooklin Data Change Events profile profile Near job Line 1 st party article web page publishing

  21. • Need to identify URLs that Challenges of contain a member’s PII. member PII in My post might contain yo ur PII • My your Babylonia • Connection between member and the URL resides in the upstream system

  22. Option #1: Require Upstream Systems to Notify Babylonia Pros Cons • Simple – Babylonia waits to be told • Requires additional work by every system specifically which URLs should be purged that exposes PII in publicly accessible web pages • Babylonia only does extra work when a URL needs to be purged • If the notification is missed, how will Babylonia know? • Puts responsibility where the knowledge is • 1 st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too

  23. Option #2: Actively Refetch Every 1 st Party URL Pros Cons • There are a lot of 1 st party URLs in • Simple logic: Page gone? Purge the page. Babylonia • Requires little additional work from • Continuous polling of all 1 st party URLs upstream systems consumes a lot of resources just for the • Works also for old 1 st party URLs sake of the very few URLs that are actually affected • Extra work to avoid false positives or false negatives

  24. Option #3: Eliminate Member PII in Babylonia Pros Cons • The easiest data to delete is data that isn’t • Babylonia is relied upon by numerous systems to have content for URLs – excluding 1 st party in your system to begin with content will affect member experience • Gets closer to Single Source of Truth (SSOT) • No substitute currently available for all 1 st party content – better for consistency, not only for compliance • Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1 st party content (eg. shortlinks)

  25. Blended Approach • Opt ption 1 - Having upstream systems notify is best, but might miss some pages ption 2 - Active refetch is thorough but • Opt expensive. Must use to catch pages that won’t support notifications • Opt ption 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked

  26. Classification of Ingested URLs 3 rd Party URL Blocked Actively 1 st Party Refetched Whitelisted Notified by Upstream

  27. Option 1 – Upstream Notification • Upstream system sends a Kafka message • Babylonia consumes message and purges data pen source - • Ope kafka.apache.org

  28. Option 2 – Active Refetching Offline Espresso HDFS job Database ETL Refetch Refetch URL table URL table Takedown UPDATE Requests for deleted pages Kafka Refetch Refetch Push messages process job

  29. Option 3 – Whitelist • Block all 1 st party URLs that can’t meet minimal requirements • Mainly must return a 404 for an invalid or deleted URL • Ensures new 1 st party URLs are onboarded before being ingested

  30. Managing PII in Datasets Espresso HDFS Offline Database ETL Datasets

  31. Espresso Datasets What is Espresso? Challenges Espresso Espresso • LinkedIn distributed • Reference to PII not Database Datasets NoSQL database always in the key • Data stored in Avro • ETL snapshots of format (JSON) Espresso Dataset become offline • Indexed by specific Datasets primary key fields

  32. Offline (HDFS) Datasets Challenges HDFS Offline • Files of Avro (JSON) records ETL Datasets • Need to read whole record to see if it has PII • Files not conducive to removing one record from the middle • Dataset can be source for downstream jobs that also need to be purged

  33. WhereHows • Data discovery and lineage tool • Central location for all schema • Document meanings of each column • Trace downstream/upstream lineage of datasets Data Discovery • Tag every column that can contain Which datasets contain member PII? member reference or PII. • Open Source - github.com/linkedin/wherehows

  34. Dali (Data Access at LinkedIn) • Interface for accessing datasets WhereHows Metadata • Combines dataset schema with WhereHows metadata Dali • Defines output virtual dataset while preserving data tags Reader • Supports defining virtual datasets where PII is excluded or obfuscated Raw Dataset

  35. Access Control List (ACL) • Controls access to PII data to known list of authorized systems • We only approve access to systems that it can handle PII properly Access Control • Ensures that member PII can’t leak into untracked systems/datasets Only systems that handle PII properly are allowed access • Acts as a list of downstream services

  36. Keeping Track of Personal Information in Babylonia WhereHows Dali ACL • Field tagging for fields • Downstreams use Dali, • Control the spread of containing PII which preserves the PII data only to WhereHows tagging on authorized readers • Know where the PII is new virtual datasets • Serves as a list of • Keeps tags with the current downstream data as it moves from systems to notify when one dataset to another data is purged

  37. Apache Gobblin • Framework for transforming large datasets • Data lifecycle management • Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged ource - gobblin.apache.org • Ope Open sou

  38. WhereHows and Gobblin • Created tags representing ingested content URLs in WhereHows Tagging in • Enables downstream systems to onboard with Espresso auto WhereHows purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)

  39. Compliance Comes First • Choose an implementation where restriction is the default until proven safe • Whitelisting ensures all allowed 1 st party URLs meets a minimum technical bar for ingestion • Simplicity of active refetching helps keep the bar low enough to include most content safely

  40. Bigger Picture • Added constraints to the system Constraints • Developer restrictions • Made certain kinds of things harder to do

Recommend


More recommend