Eliminating the regular expression Datalogue
Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights Tim Delisle
Me, feeling the pain Feeling the Pain
Obsession How might we automate the mundane, painful process of data preparation to get data into the hands of the people who need it!
Data prep means many di fg erent things to di fg erent people Casual data user Data engineer Data scientist
But the process is similar 1 2 3 Semantic + structural Parsing of Translation of data from understanding unstructured data one format to another
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2015
{day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }
Semantic + structural 1 understanding Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017
Semantic + structural 1 understanding Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017
Semantic + structural 1 understanding Dates Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_
Semantic + structural 1 understanding Dates Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_
Parsing of compound/ 2 unstructured data Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017 WD D M Y
Translation of data from one 3 format to another Mardi 22 Mars, 1991 {day: 22, month: “march”, year: 1991 } Tuesday April 8th 1962 {day: 08, month: “april”, year: 1962 } 05/5/2017 {day: 05, month: “may”, year: 2017 }
How do we do this today?!?
Regular expressions
Regex approach Mardi 22 Mars, 1991 ([a - zA-Z]{3,}( |,)) Tuesday April 8th 1962 (\d * )*\w * | (\d * )|(\d * )\/ (\d * ) 05/05/2017
“You can write a million test cases and regexs will still blow up in your hands” Jai Chaudhary, Google
Regular expressions
Regular Regexes… expressions
Regular Impossible to scale! Regexes… expressions
+ M a c h i L n e e a Regular r n i n g expressions
Regex + Machine Learning approach Dates Week Day Day Month Year Mardi 22 Mars, 1991 Length 4 # Letters 0 Week Day Month Day Year # Digits 4 Tuesday March 22nd 1991 # Special chars 0 Month Day Year Index special char -1 03/22/1991 … … Text Numbers Special chars
+ M a c h i L n e e a Regular r n i n g expressions
+ M a c h i L n e e a Hand generated features Regular r n i n g expressions
+ M a c h i L n e e a Hand generated features Regular r n Hard to scale with new classes i n g expressions
+ M a c h i L n e e a Regular r n i n g expressions
Deep Learning
“Convolutional neural networks take advantage of the 2D structure of the input”
Address Phone Number
Phone Number Char Embedding VD CNN Label
Layers: 45 Params: 1,016,101 Test Acc: 94% Highest Error rate classes: Name -> Business Name
Char Embedding ConvNet + Bidirectional LSTM 10 Airport Road SE,Salem,NY,97301 Parsed String AAAAAAAAAAAAAAAAAAUCCCCCUSSUZZZZZ
Layers: 7 Params: 232,121 Val Acc: 99.73%
But the process is similar and can be automated 1 2 3 Semantic + structural Parsing of Translation of data from understanding unstructured data one format to another using VDCNN using ConvNet + using Seq2Seq models LSTMs
Thank you! Ask away.
Recommend
More recommend