Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO
Awarded 7 th June 2018
The Many Faces of NLP? • Text Mining - rules, regular expressions, bag of words – deterministic – cannot find anything that hasn’t been defined in the rules – Strong on Positives but typically over-generalises. No ability to find unseens. • Real NLP – the field of computing the structure of language - nka Computational Linguistics • Statistical NLP – NLP plus Machine Learning from examples and then can generalize – non-deterministic – finds what it hasn’t seen • Language Engineering – Building production grade SNLP solutions
California Cancer Registry Problem • 500,000 documents per annum – potentially 1 million in the coming years • 50% unwanted and need to be filtered out • Separation of non-reportable and reportable cancers • Coding the reportables
California CR Project Objectives - Analysis of Histopathology Reports • Develop an automated service to: • Determine Reportability • Codify 5 attributes of – Site – Histology – Grade – Behaviour – Laterality
The Language Engineering Issues • How well can we do it - efficiency • With what volume of training materials - 5000 • In what amount of time – 15 months • To what accuracy - 90% • For what cost – not enough • The SNLP is our infrastructure – it is the Language Modeling using machine learning, and coding to the client deliverables that have to be engineered.
Language Engineering Overview of Tasks • Pathology Reports Classifier – In scope – histopathology – Out-of-scope - Immunohistochemistry & Genetics • Case Identification – Reportability Classifier • Clinical Concept Recognizer • Coding Inference Engine
Two Training Corpora needed to create a Gold Standard for learning • 5000 Reportables in 10 Batches • 5000 Non-Reportbales • From 50+ laboratories • Covering – 133 Histology codes – 140 Site codes • Subsequently - 212 Reportables in Non- Reportables corpus transferred to Batch 11
Pathology Report Type Classifier F- TP FP FN P R Score Classes 4373 57 29 98.71 99.34 99.03 Histopathology 653 29 57 95.75 91.97 93.82 Other 5026 86 86 98.32 98.32 98.32 OVERALL
Reportability Results for Reportability Corpus TP FP FN TN P R F 3510 27 111 1160 99.24 96.93 98.10 Interpreted as 0.76% FP and 3.07% FN (loss of Cancer reports to the Non-cancer class). Acceptable but pride would want us to do better.
Finding Clinical Concepts: Annotating and Tagging • Design a schema of semantic tags • Manually annotate the training corpus with the tags • Build a machine learning Language Model with the training corpus to recognise the tags • Check the consistency of the annotations with the model • Iteratively correct the annotations and the structure of the model to improve it
Language Model Accuracy – RUN 29B & Run 41 - Sample of 34/32 Tags 2017 2018 TP FP FN P R F N F N 18072 50 59 99.72 99.67 99.70 18131 99.87 21943 Site 11172 7 87 99.94 99.23 99.58 11259 99.78 13784 Histology 8247 5 27 99.94 99.67 99.81 8274 99.91 9603 Behaviour 897 1 19 99.89 97.93 98.90 916 98.79 1335 Grade 6351 10 20 99.84 99.69 99.76 6371 99.89 8000 Laterality Total (34 115550 226 921 99.80 99.21 99.51 116471 99.71 141412 tags) 29 16 21 27 >99% 29 10 14 22 >99.5% 34 34 34 32 TOTAL
Coding Reports to ICD O3 – Problem Definition • Separate out each specimen. • Clinical Concept Recognition: Apply the Language Model to tag the correct concepts needed for coding. • CODIFY: Map the tags to the appropriate elements in the ICD O3 definitions for each coded attribute including applying: – SEER multiple primaries rules and – any other local rules. • Evaluate each specimen for its cancer reportability • The Summary - Select the specimens that are required to produce the correct case codes.
Coding Results Overview – All Reportables 2017.v1 # of # of correct Extractable records coded Accuracy Site (4 digits) 3165 3014 95.23% hist_type (4 digits) 3165 3050 96.37% hist_grade 3165 3126 98.77% hist_behavior 3165 3150 99.53% laterality 3165 3096 97.82% TOTAL 15825 15436 487.72% Average 97.54%
FINAL Estimated Efficiency Gains • 100% automated Case finding - Reportability • 72% automated coding at 94% overall accuracy => 28% manual coding. • Up to 90% automatic coding is possible. • Histology codes coverage: 93.9% of supplied examples • Site codes coverage: 94.2% of supplied examples • Automated Coding of code set coverage 97.5% of reports. • Reduce manual errors by 40-80%. • More improvements are being made.
Epilogue – 2018 Revisions • 187 Site codes – 35% increase • 237 Histology codes – 85% increase • 15% increase in reports – Drawn from audits, new samplings • 2018.v1 is being tested by CCR • 2017.v2 is released
The Foreseeable Future • Increase accuracy for more difficult complexity classes -> reduces manual processing • Further investigate tumour stream classifications for improving accuracy • Analyse Immunohistochemistry and Genetics reports • Add more extraction and coding functions – e.g. Biomarkers, recurrence, tumour size, margins, … • Add more document types – Radiology, Nuclear Medicine, • For the CDPH – Other Health topics – First Fractures of osteoarthritis, Adherence to National guidelines for diagnostics • Provide a Subscription Reportability Service for third parties.
• END
Recommend
More recommend