Eliminating the regular expression Datalogue Datalogue CEO & - - PowerPoint PPT Presentation

eliminating the regular expression
SMART_READER_LITE
LIVE PREVIEW

Eliminating the regular expression Datalogue Datalogue CEO & - - PowerPoint PPT Presentation

Eliminating the regular expression Datalogue Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights Tim Delisle Me, feeling the pain Feeling the Pain Obsession How might we automate the mundane, painful


slide-1
SLIDE 1

Datalogue

Eliminating the regular expression

slide-2
SLIDE 2

Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights

Tim Delisle

slide-3
SLIDE 3

Feeling the Pain

Me, feeling the pain

slide-4
SLIDE 4

Obsession How might we automate the mundane, painful process of data preparation to get data into the hands of the people who need it!

slide-5
SLIDE 5

Casual data user Data engineer Data scientist

Data prep means many difgerent things to difgerent people

slide-6
SLIDE 6

1

Semantic + structural understanding

2

Parsing of unstructured data

3

Translation of data from

  • ne format to another

But the process is similar

slide-7
SLIDE 7

Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2015

slide-8
SLIDE 8

{day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }

slide-9
SLIDE 9

1 Semantic + structural understanding

Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017

slide-10
SLIDE 10

1 Semantic + structural understanding

Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017

slide-11
SLIDE 11

1 Semantic + structural understanding

Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_ Dates

slide-12
SLIDE 12

1 Semantic + structural understanding

Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_ Dates

slide-13
SLIDE 13

2 Parsing of compound/ unstructured data

Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017 WD D M Y Dates

slide-14
SLIDE 14

3 Translation of data from one format to another

Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017

{day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }

slide-15
SLIDE 15

How do we do this today?!?

slide-16
SLIDE 16

Regular expressions

slide-17
SLIDE 17

Regex approach

Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/05/2017

([a-zA-Z]{3,}( |,)) (\d*)*\w*| (\d*)|(\d*)\/ (\d*)

slide-18
SLIDE 18

“You can write a million test cases and regexs will still blow up in your hands” Jai Chaudhary, Google

slide-19
SLIDE 19

Regular expressions

slide-20
SLIDE 20

Regular expressions

Regexes…

slide-21
SLIDE 21

Regular expressions

Regexes… Impossible to scale!

slide-22
SLIDE 22

Regular expressions

+

M a c h i n e L e a r n i n g

slide-23
SLIDE 23

Mardi 22 Mars, 1991 Tuesday March 22nd 1991 03/22/1991 Text Numbers Dates Special chars

Week Day Month Day Year Year Day Month Week Day Month Day Year Length 4 # Letters # Digits 4 # Special chars Index special char

  • 1

… …

Regex + Machine Learning approach

slide-24
SLIDE 24

Regular expressions

+

M a c h i n e L e a r n i n g

slide-25
SLIDE 25

Regular expressions

+

M a c h i n e L e a r n i n g

Hand generated features

slide-26
SLIDE 26

Regular expressions

+

M a c h i n e L e a r n i n g

Hand generated features Hard to scale with new classes

slide-27
SLIDE 27

Regular expressions

+

M a c h i n e L e a r n i n g

slide-28
SLIDE 28

Deep Learning

slide-29
SLIDE 29
slide-30
SLIDE 30

“Convolutional neural networks take advantage of the 2D structure of the input”

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Address Phone Number

slide-34
SLIDE 34

Phone Number Label VD CNN Char Embedding

slide-35
SLIDE 35

Layers: 45 Params: 1,016,101 Test Acc: 94% Highest Error rate classes: Name -> Business Name

slide-36
SLIDE 36
slide-37
SLIDE 37

Parsed String ConvNet + Bidirectional LSTM Char Embedding

10 Airport Road SE,Salem,NY,97301 AAAAAAAAAAAAAAAAAAUCCCCCUSSUZZZZZ

slide-38
SLIDE 38

Layers: 7 Params: 232,121 Val Acc: 99.73%

slide-39
SLIDE 39

1

Semantic + structural understanding using VDCNN

2

Parsing of unstructured data using ConvNet + LSTMs

3

Translation of data from

  • ne format to another

using Seq2Seq models

But the process is similar and can be automated

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

Thank you! Ask away.