DOCUMENT DIGITIZATION Rethinking it with Machine Learning Nischal Harohalli Padmanabha QConAI SFO 2019
“The brain sure as hell doesn’t work by somebody programming in rule.” - Geoffrey Hinton @nischalhp | Document Digitization | QconAI SFO 2019
PROBLEM Understanding unstructured documents and extracting semantic information to automate claims handling. @nischalhp | Document Digitization | QconAI SFO 2019
DOCUMENT CLASS Policy POLICY NUMBER POLICY H 54/16 307 728 Liability Protection CUSTOMER AGENT Renolate GmbH pma Insurance Broker 10115 Berlin 48149 Nurnberg EFFECTIVE DATE OF CHANGE TERMINATION ANNUAL CHARGE 22.12.2016 12:00 22.12.2019 12:00 EUR 424,63 COVERAGES Persons & property damage flat EUR 3.000.000 Financial losses EUR 100.000 Environmental damage basic flat EUR 3.000.000 RISK DESCRIPTION / INSURED LOCATION Private liability insurance comfort plus Dog liability Environmental damage insurance Employees on premises @nischalhp | Document Digitization | QconAI SFO 2019
REWIND
TABULAR INFORMATION EXTRACTION
COURSE OF ACTION - ROUND 1 Writing a Evaluation on known Initial results, gave us a lot of happiness. lot of rules Data @nischalhp | Document Digitization | QconAI SFO 2019
In production 58% accuracy RESULT @nischalhp | Document Digitization | QconAI SFO 2019
In production 58% accuracy RESULT We failed, miserably . Rules became cumbersome & brittle. @nischalhp | Document Digitization | QconAI SFO 2019
Life or death situation for the project (and us engineers) @nischalhp | Document Digitization | QconAI SFO 2019
ADAPTIVE LEARNING THOUGHT PROCESS How does a human Identifies Grouping of Text, to build solve the same Context problem? Eg: Tables, paragraphs, passages Given the context, domain knowledge and semantic understanding of text @nischalhp | Document Digitization | QconAI SFO 2019
Sounds straightforward, right? @nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK @nischalhp | Document Digitization | QconAI SFO 2019
NEXT STEPS
What are our deadlines? Which algorithms to use? How to agile this? What should we feed as Human and computation input to the algorithm? resources required? What to annotate? @nischalhp | Document Digitization | QconAI SFO 2019
COURSE OF ACTION - ROUND 2 Object detection ● Computer Messaging parsing networks ● Custom CNN networks ● Vision Supervised Learning Implementation of Deep Topic modeling ● NLP Custom RNN + CNN networks with ● domain adaptation Computer Using this technique to generate data for Vision supervised training. Wrote implementations of Unsupervised Deep clustering, word / sentence / page / Learning document embeddings Which algorithms NLP to use? @nischalhp | Document Digitization | QconAI SFO 2019
EMPHASIS ON SUPERVISED LEARNING @nischalhp | Document Digitization | QconAI SFO 2019
COURSE OF ACTION - ROUND 2 ] Built an in house Drawing polygon bounding boxes ● Computer Labeling pages ● Annotation System Labeling documents ● Vision Workflows support Complex annotation of passages, NLP phrases, tables, line items, hierarchy huge annotation jobs nature of textual information What should we feed as input to the algorithm? What to annotate? @nischalhp | Document Digitization | QconAI SFO 2019
COURSE OF ACTION - ROUND 2 Data Scientists from Academia ● Data Deep learning engineers ● Research programme with Universities ● Scientists Master Thesis sponsorship at omni:us ● Full stack engineers ● Engineers Data Engineers ● Devops ● Team leads with experience in AI ● Identifying and convincing industry experts to mentor Leadership & ● Devops ● Mentors Human and computation Cloud startup Credits to support memory and GPU training algorithms resources required? ● Mentoring to scale operations ● programmes @nischalhp | Document Digitization | QconAI SFO 2019
COURSE OF ACTION - ROUND 2 What are our Deadlines? Sprint Quick turn Engineer AI systems to run Planning for around of experiments in a systematic Research POC and automated way How to agile this? @nischalhp | Document Digitization | QconAI SFO 2019
In production 94% accuracy RESULT Successful AI delivery @nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK @nischalhp | Document Digitization | QconAI SFO 2019
GO LIVE OR GO HOME
AI IN PRODUCTION Human in the loop, fixes Trained Models the errors and validates Predict corrections Train on the corrections, Continuous improvements @nischalhp | Document Digitization | QconAI SFO 2019
DO NOT IGNORE Domain Knowledge is Educate your Engineer end to end AI essential customers on AI systems to solve business use case, not a dataset @nischalhp | Document Digitization | QconAI SFO 2019
PLATFORM Training Platform Prediction Platform Management Console with human in the of Infrastructure, loop Applications & Users @nischalhp | Document Digitization | QconAI SFO 2019
data and version control the datasets ] COURSE OF ACTION - ROUND 3 System to define data models, annotate data, Annotation manage annotation jobs, audit the annotated System Console connecting the two together Mechanism and system to trigger training, Ability to train retraining of evaluation and versioning of and evaluate different types models, in a managed way models across various infrastructures supporting CPU and GPU Training Platform @nischalhp | Document Digitization | QconAI SFO 2019
upload capabilities ] COURSE OF ACTION - ROUND 3 Async API Rest API that supports asynchronous data for Ingestion User interface to fix prediction errors Validation UI Prediction console connects all. Scaling deep learning models as microservices AI microservices Prediction Platform with human in the Robust data pipelines connecting the services Data Pipelines loop with providing capabilities of high throughput, reliability and retry mechanisms. @nischalhp | Document Digitization | QconAI SFO 2019
systems, consoles and services ] COURSE OF ACTION - ROUND 3 Configuration Central management of configuration of various management Managing users and providing authentication User management and authorisation capabilities for services. Management and monitoring Monitoring infrastructure usage and patterns to console Infrastructure logs setup alerts and notifications Management Console of Infrastructure, Monitoring logs of applications and setting up Application logs dashboards for internal and external Applications & Users stakeholders @nischalhp | Document Digitization | QconAI SFO 2019
TECH STACK CHECK @nischalhp | Document Digitization | QconAI SFO 2019
omni:us platform console | @nischalhp | Document Digitization | QconAI SFO 2019
Learnings
Learnings Very important for an entire organization to believe that AI can solve problems ● Engineer AI products, do not believe that having just AI models are good enough ● Agile for AI works, choose an interpretation that works for your team ● Pay attention to details, domain knowledge and use case to be solved. ● Combination of multiple technologies have to be used to solve use case, not just one ● hammer for all. Do not try to “AI” everything, certain matured technologies are capable of solving ● certain problems well. Use them wisely. Believe in human in the loop, builds trust with business ● Educate internal and external stakeholders around the possibilities and limitations ● of AI. Visualisation is power tool to understand and explain AI to everybody. Use them. ● AI is no more a black box, it can fine tuned, managed and configured appropriately. ● Automate your current processes as much as possible, this gives more room for ● research. @nischalhp | Document Digitization | QconAI SFO 2019
Recommend
More recommend