Eliminating the regular expression Datalogue Datalogue CEO & - - PowerPoint PPT Presentation
Eliminating the regular expression Datalogue Datalogue CEO & - - PowerPoint PPT Presentation
Eliminating the regular expression Datalogue Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights Tim Delisle Me, feeling the pain Feeling the Pain Obsession How might we automate the mundane, painful
Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights
Tim Delisle
Feeling the Pain
Me, feeling the pain
Obsession How might we automate the mundane, painful process of data preparation to get data into the hands of the people who need it!
Casual data user Data engineer Data scientist
Data prep means many difgerent things to difgerent people
1
Semantic + structural understanding
2
Parsing of unstructured data
3
Translation of data from
- ne format to another
But the process is similar
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2015
{day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }
1 Semantic + structural understanding
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017
1 Semantic + structural understanding
Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017
1 Semantic + structural understanding
Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_ Dates
1 Semantic + structural understanding
Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_ Dates
2 Parsing of compound/ unstructured data
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017 WD D M Y Dates
3 Translation of data from one format to another
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017
{day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }
How do we do this today?!?
Regular expressions
Regex approach
Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/05/2017
([a-zA-Z]{3,}( |,)) (\d*)*\w*| (\d*)|(\d*)\/ (\d*)
“You can write a million test cases and regexs will still blow up in your hands” Jai Chaudhary, Google
Regular expressions
Regular expressions
Regexes…
Regular expressions
Regexes… Impossible to scale!
Regular expressions
+
M a c h i n e L e a r n i n g
Mardi 22 Mars, 1991 Tuesday March 22nd 1991 03/22/1991 Text Numbers Dates Special chars
Week Day Month Day Year Year Day Month Week Day Month Day Year Length 4 # Letters # Digits 4 # Special chars Index special char
- 1
… …
Regex + Machine Learning approach
Regular expressions
+
M a c h i n e L e a r n i n g
Regular expressions
+
M a c h i n e L e a r n i n g
Hand generated features
Regular expressions
+
M a c h i n e L e a r n i n g
Hand generated features Hard to scale with new classes
Regular expressions
+
M a c h i n e L e a r n i n g
Deep Learning
“Convolutional neural networks take advantage of the 2D structure of the input”
Address Phone Number
Phone Number Label VD CNN Char Embedding
Layers: 45 Params: 1,016,101 Test Acc: 94% Highest Error rate classes: Name -> Business Name
Parsed String ConvNet + Bidirectional LSTM Char Embedding
10 Airport Road SE,Salem,NY,97301 AAAAAAAAAAAAAAAAAAUCCCCCUSSUZZZZZ
Layers: 7 Params: 232,121 Val Acc: 99.73%
1
Semantic + structural understanding using VDCNN
2
Parsing of unstructured data using ConvNet + LSTMs
3
Translation of data from
- ne format to another
using Seq2Seq models