lightly supervised content modeling for corporate text
play

Lightly Supervised Content Modeling for Corporate Text Analytics - PowerPoint PPT Presentation

Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL INTERNAL USE ONLY. 1 Talk Outline Text Analytics for Customer Services Customer interaction data


  1. Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL — INTERNAL USE ONLY. 1

  2. Talk Outline Text Analytics for Customer Services • Customer interaction data – Structured Data Vs. Textual data • Motivation / Objectives • Rule based approach • Data driven modeling • Machine learning for preprocessing • Modeling topics • Injecting SME knowledge to a topic model • Cool Results EMC CONFIDENTIAL — INTERNAL USE ONLY. 2

  3. Costumer Data – in our classic CRM DB We care… “Something that makes me particularly proud is Customer interactions: the use of Big Data • Service Request analytics to create a • Support Chat detailed picture of service delivery characteristics for Structured Data: continuous improvement .” • Date • Employee id • Product Type • Problem Code Kevin Roche Senior Vice president, • Resolution Code EMC Global Services EMC CONFIDENTIAL — INTERNAL USE ONLY. 3

  4. Customer Data – In the Data Lake Unstructured Data: • Problem Summary ( e.g. “Exchange backup failing”) • Resolution Summary ( e.g. “ hf 23.45 applied, issue solved”) • Chat data: – Customer : “hello” – Helpdesk : “hi tom, how can I help” – Customer: “my daily exchange backup is failing…” – Helpdesk : “did you try to restart the service?” EMC CONFIDENTIAL — INTERNAL USE ONLY. 4

  5. Objectives / Motivation Service organization wish list • Early detection of emerging problems – Example, “we are getting a lot of service requests regarding Exchange 2010 backups in version 2.41 , let’s initiate service pack install in all of these” • Root Cause Analysis – Search for similar problem descriptions and rank the solutions • Identifying call volume drivers – Example, “oh, we are spending 10% of our time in Europe on VM memory problems” • Improving service – According to chat transcripts this employee is slow in identifying code bug issues EMC CONFIDENTIAL — INTERNAL USE ONLY. 5

  6. Rule Based approach The old industry standard 1 • Subject matter experts create key word SME writes rules rules (install - > “install calls”) • Long tuning process (up to 6 months) 2 3 • Usually low recall Evaluate on Identify • High precision requires more and some common documents errors more complex rules e.g. “DB temp unavailable” / “DB temporarily unavailable”… • A strong preprocessing unit + rule creation engine cuts the manual labor from years to months • Let’s the users feel in control EMC CONFIDENTIAL — INTERNAL USE ONLY. 6

  7. The Alternative Data driven machine learning modeling Our dream approach: • Model the data that you have, not the data you think you have • Automate • Don’t automate too much leaving the user out of the loop • Reproducible (quickly integrate to a new business unit) • Quick (analysts can tune the modeling engine) 1 2 3 4 SME Unsupervised Actionable Preprocess Annotation clustering Insights EMC CONFIDENTIAL — INTERNAL USE ONLY. 7

  8. 0 ETL / Technologies ETL Show me the data! Where is the data? This is actually the hardest step. 1) Database column (most likely) 2) Hadoop How to access it? 1) In the Pivotal GreenPlum big data warehouse 2) Pivotal HD (Hadoop by Pivotal) 3) For streams - GemFireXD EMC CONFIDENTIAL — INTERNAL USE ONLY. 8

  9. Preprocess Preprocess Smart dimensionality reduction Classic preprocessing of text: • A friendly tokenization regexp re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s) • Then use the porter stemmer: – “ponies” - > “ poni ” – “expression” - > “express” Drawbacks: • Loses information • Eyesore for the customer EMC CONFIDENTIAL — INTERNAL USE ONLY. 9

  10. Preprocess Preprocess Smart dimensionality reduction Before we start the modeling • Lemmatize instead of stem: – Preprocess text with POS tagg gger er to get the base form – Use NLTK lemmatizers to get the most probable base form • Word Clusters – Create a Deep Learning representation of the words (word2vec) – Extract likely synonyms using heuristics – Allow the SMEs to edit the synonym dictionary – Can be leveraged for query expansion for Search EMC CONFIDENTIAL — INTERNAL USE ONLY. 10

  11. Preprocess – Synonym Extraction Preprocess Smart dimensionality reduction backup 15492 licenses 347 backups 4419 licence 119 backed 386 licences 54 backup's 32 bakcup 28 bakup 14 backuped 14 networker 8703 bacup 8 netwoker 59 backp 8 netwroker 22 buckup 7 netowrker 15 backu 7 networke 13 backus 5 neworker 10 backuo 5 netorker 5 backup1 4 backkup 4 EMC CONFIDENTIAL — INTERNAL USE ONLY. 11

  12. Unsupervised Unsupervised clustering clustering Topic Modeling • Data driven approach requires that we look for the topics present in the data • Topic Modeling has been established as a premier approach in Statistical Machine Learning • Latent Dirichlet Allocation, the mixture model approach by Blei , Ng and Jordan, has been cited by 8,500 academic papers • Input: Text divided to documents • Output: Soft topics • Recipes: Symmetric Vs. Asymmetric prior Redundancy reduction (de-dup) EMC CONFIDENTIAL — INTERNAL USE ONLY. 12

  13. Unsupervised Unsupervised clustering clustering Topic Modeling – In practice • Read: “Care and Feeding of Topic Models” by Boyd -Garber, Mimno and Newman • Asymmetric Priors (Wallach) – Vanilla LDA assumes all topics are as likely – I’ve never encountered such a corpus – Assume the prior for each topic is different and sample as well – Not supported by most big data LDA off the shelf solutions EMC CONFIDENTIAL — INTERNAL USE ONLY. 13

  14. Unsupervised Unsupervised clustering clustering Topic Modeling – In practice • Redundancy (Cohen et al. ) – Copy paste / boiler plate text introduces noise into the topic distribution (see the paper) – These occur a lot in corporate data sets – Remove redundant documents (e.g. 10,000 occurrences of “ SR closed”) – Alternatively sample with Redundancy Aware LDA EMC CONFIDENTIAL — INTERNAL USE ONLY. 14

  15. SME SME Annotation Annotation Inject the domain knowledge • Unsupervised approaches provide us with clusters of documents / words • How can we use this to benefit the business need? • Have the Subject Matter Expert explore the clusters and name them • Provide as many layers of information as possible to make it easy • Coach them first to understand that the precision is never 100% • Allow them to tune the results EMC CONFIDENTIAL — INTERNAL USE ONLY. 15

  16. Actionable Good old Business Intelligence Insights Leverage the tags EMC CONFIDENTIAL — INTERNAL USE ONLY. 16

  17. Actionable Good old Business Intelligence Insights Chat transcripts Analyze chat session. • What’s the topics associated with the conversation? • How quickly do the support person zooms in on the problem? EMC CONFIDENTIAL — INTERNAL USE ONLY. 17

  18. Actionable Good old Business Intelligence Insights Leverage the tags – Combine with structured Topic distribution according to “problem code”. EMC CONFIDENTIAL — INTERNAL USE ONLY. 18

  19. Actionable Good old Business Intelligence Insights Leverage the tags – Combine with structured Topic distribution according to “location”. • Disk replacement is prevalent in the Americas • Hot-fix is more common in Europe • Can zoom in on the requests to analyze EMC CONFIDENTIAL — INTERNAL USE ONLY. 20

  20. Actionable Good old Business Intelligence Insights Easily combine Search with Topics / Machine Learning Use GP-Text for Lucene text searches. • What’s the topics associated with “node”? • What are the structured fields? • GPDB supports Machine Learning – throw in sentiment analysis in 5 minutes EMC CONFIDENTIAL — INTERNAL USE ONLY. 21

Recommend


More recommend