from content publishing to data solutions via machine
play

From Content Publishing to Data Solutions via Machine Learning - PowerPoint PPT Presentation

From Content Publishing to Data Solutions via Machine Learning Presentation to Los Angeles Machine Learning Meetup 2019-02-19 Bradley Allen, Chief Architect, Elsevier Twenty-three years ago Its hard to imagine a sweeter business than


  1. From Content Publishing to Data Solutions via Machine Learning Presentation to Los Angeles Machine Learning Meetup 2019-02-19 Bradley Allen, Chief Architect, Elsevier

  2. Twenty-three years ago “ It’s hard to imagine a sweeter business than publishing academic journals. The editorial content is contributed free of charge by scholars desperate to publish to get tenure. School libraries are automatic customers—professors insist on it. ... Is the party over? It may be nearing its end. The Internet is closing in. ” - Forbes, December 18, 1995

  3. Today: from content publishing to data solutions Read Search Do this this this Cell ScienceDirect Sherpath Fundamentals Scopus Mendeley Gray‘s Anatomy ClinicalKey Knovel Reaxys

  4. Our five main customer segments Clinicians ‘You could use this treatment to save a life’ Researchers ‘This article answers your questions’ Governments ‘This is the research to invest in’ Pharmaceutical ‘This is the cancer treatment you should pursue’ companies Nursing students ‘This is the area you need to improve to qualify’

  5. The challenges our customers face Global research spend is growing every year. 1 Researchers lack the tools they need to be effective. 2 Studies: Predictedspend 3.4 % $1.9TN 70-80% of research asksthe from2015 research in2016 wrong questions or cannot be reproduced Life-saving drugs are expensive to develop. 3 Health providers cannot save lives without the best information. 4 1 / 20 $2.5 BN Preventable medicalerrors: Third largest cause of death in theUS successrate medianpharmaceutical ofdrugs spend perdrug 611k 585k 225k 149k Heart Cancer Medical Respiratory Disease Error Illness 1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization

  6. The assets we have at hand Content Technology Chemistry database 1,000 technologists employed by 500m published experimental facts Elsevier User queries Machine reading 13m monthly users on ScienceDirect 475m facts extracted from ScienceDirect Books Semantic Enhancement 35,000 published books Knowledge on 50m chemicals captured as 11B facts Machine learning Drug Database Over 1,000 predictive models trained on 1.5 100% of drug information from pharmaceutical billion electronic health care events companies updated daily Collaborative filtering: Research 1bn scientific articles added by 2.5m 16% of the world’s research data and researchers analyzed daily to generate over articles published by Elsevier 250m article recommendations

  7. How we think about delivering data solutions If we don’t have data we need, Determine the acquire what we’re missing If we have question (including that data, use case and reuse it personae) Describe the data that If not, use the needs to be produced data we have to to address the question create it From Justin O’Beirne, “Google Maps’ Moat – How far ahead of Apple Maps is Google Maps?”, 2017-12. Retrieved from https://www.justinobeirne.com/google-maps-moat on 2018- 05-31.

  8. Breaking it down into eight simple steps 1 • Market Definition : Determine target market personae & product 3 • Data & Query Specification : Describe features data schemas & features to support 2 use cases • Use Case Definition : Describe tasks performed by personae yielding use cases 5 • Data Enhancement : Extract entities, attributes & relations, map entities to 4 • Data Acquisition : Acquire content & ontologies & taxonomies data in multiple formats from multiple 8 6 • Knowledge Delivery : Deliver query sources • Data Linking : Link extracted entities & visualisation of data with other entities in existing enterprise data 7 • Knowledge Graph Construction : Store mapped & linked data for access & discovery

  9. Knowledge graphs make it all hang together I really believe that the key battleground in any industry is that of its knowledge graph. Google has it for media/advertising, Netflix has it for filmed entertainment, Uber has it for inner city transportation, Facebook has it across social media as well as messaging and the multiples speak for themselves. Tony Askew, Founder/Partner at REV (personal communication, September 29, 2016)

  10. The role that machine learning (ML) plays • Our goal is to drive business by enabling better outcomes through: − Delivery of timely, appropriate advice for decision making & problem solving − Enhanced discovery and query over massive amounts of information • We plan to achieve this by using ML to build knowledge graphs that enable the rapid development of data solutions − Implementing entity/object extraction, relation extraction, entity disambiguation, classification, and sentiment analysis − Based on the scientific & medical literature, experimental data, and the data exhaust associated with the practice of scientific communication & medical practice

  11. Breaking down our ML efforts • Early wins − Deployed systems adding value to existing products and solutions • Roofshots − Task-specific use of ML to improve discoverability, knowledge delivery • Practicalities − Human-in-the-loop NLP pipelines augmented with ML components to scale entity and relation extraction, entity linking for knowledge graph construction • Moonshots − Use of multi-task learning architectures to develop a general-purpose approach to question answering from the scientific and medical literature and from experimental data

  12. Early win: Recognizing decision graphs in medical content • Clinical Key is Elsevier’s flagship medical reference search product • Clinicians prefer “answers” in the form of tables or flowcharts − Eliminates need to page through retrieved content to find actionable information • Clinical Key provides a sidebar section displaying answers, but this feature depends on very labor-intensive manual curation • Solution: automatically classify images in medical content corpus at index time • Benefits: lower cost and improved user experience 12

  13. Early win: Recognizing decision graphs in medical content • Perfect fit for transfer learning approach − Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph − Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc. − Reusing all but the last two Dense layers of a pre-trained model (VGG-CNN, available from Caffe’s “model zoo”) − VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House) − Last layer is a multinomial logistic regression (or softmax) classifier • Model trained on 10,167 images with a 70/30 train/test split • Achieves 93% test set accuracy − Evaluated image + caption text model but did not get a big performance boost • Searchable image base used to support training set and model development

  14. Early win: Generating topic pages from scientific content Definition Take a ScienceDirect article Find occurrences of concepts Take a taxonomy Snippet(s)

  15. Early win: Generating topic pages from scientific content

  16. Roofshot: Extracting clinically useful relationships from medical content

  17. Roofshot: Extracting clinically useful relationships from medical content Three clinical symptoms were considered to be highly suggestive of PE : recent dyspnoea , recent chest pain and unusual tachycardia >75/min. has Clinical Finding Pulmonary Dyspnea Embolism

  18. Roofshot: Extracting clinically useful relationships from medical content CNN implemented in Keras Input: relationship labels and syntax paths linking relation arguments, semantic tagging Syntactic analysis with spaCy on Apache Spark Embed in 64-dimensional space(like Word2vec) Compute 1-dimensional convolution to learn path structure Semantic analysis using FPE annotation Perform final softmax activation to predict one of N relations

  19. Roofshot: Extracting clinically useful relationships from medical content

  20. Roofshot: Extracting clinically useful relationships from medical content

  21. Roofshot: Assistants for the interpretation of pathology imagery What we need What we have Images with their captions Annotated Raw Images Notice the multiple subependymal nodules in fig 3.

  22. Roofshot: Assistants for the interpretation of pathology imagery

  23. Practicality: Building continuous modeling and quality control into our deployment workflows

Recommend


More recommend