Learning text representations from character-level data Grzegorz - PowerPoint PPT Presentation

Learning text representations from character-level data Grzegorz Chrupa� la Department of Communication and Information Sciences Tilburg University CLIN 2013 Chrupa� la (UvT) Text representations CLIN 2013 1 / 19

Text representations Traditionally focused on word level ◮ Brown or HMM word classes ◮ Collobert and Weston distributed representations ◮ LDA-type soft classes Successfully used as features in ◮ Chunking and named entity recognition ◮ Parsing ◮ Semantic relation labeling Chrupa� la (UvT) Text representations CLIN 2013 2 / 19

Limitations Assuming words as input not always realistic Agglutinative and other morphologically complex languages Naturally occurring text: often mix NL strings comingled with other character data Chrupa� la (UvT) Text representations CLIN 2013 3 / 19

Sample post on Stackoverflow Chrupa� la (UvT) Text representations CLIN 2013 4 / 19

Segmentation of the character stream To define tokenization meaningfully First need to segment and label character data ◮ English ◮ Code block (Java, Python...) ◮ Inline code ◮ ... Chrupa� la (UvT) Text representations CLIN 2013 5 / 19

Test case for inducing text representation Stackoverflow HTML markup as supervision signal Character-level sequence model (CRF) Character n-gram features as baseline → Add text representation features → Learned from raw character data (no labels) Chrupa� la (UvT) Text representations CLIN 2013 6 / 19

Simple Recurrent Neural Network (Elman net) Current input and previous state combined Hidden units to create current state Output is generated by Input/Output current state units Self-supervised t t+1 t-1 Chrupa� la (UvT) Text representations CLIN 2013 7 / 19

Hidden units Hidden units Encode history Hopefully, generalize Chrupa� la (UvT) Text representations CLIN 2013 8 / 19

Sample of nearest neighbors according to cosine of the hidden layer activation in a span of 10.000 characters writing · a · .NET · applicati p": · {"last_share": · 130738 · any · links · with · informati c": · {"last_share": · 130744 d · to · test · a · IP · verificati p": · {"last_share": · 130744 enerate · each · IP · combinati : · {"last_share": · 13073896 · files. · I · have · presentati : · {"last_share": · 13074418 o · $n1.’.’.$n2.’.’.$n3.’.’ able · has · integer · values · a 5. · For · all · these · values · I $n1.’.’.$n2.’.’.$n3++.’.’ t;’; ¶········ echo · $n1.’.’ lots · of · private · methods · a ····· echo · $n1.’.’.$n2.’.’ me · across · any · resources · e ····· echo · $n1.’.’.$n2.’.’ an · add · more · connections · s Chrupa� la (UvT) Text representations CLIN 2013 9 / 19

Generated random text I · only · make · event · glds. so, · on · the · cell · proceedclicks · like · completed, · with · color? ···· st · potention, ‘column’]HeaderException=ID · = · new · Put="True" · MetadataTemplate, · grwTrowerRow="SELECTEMBRow" · on? All · clearBeanLockCollection="#7293df3335b-E9" · /> ············ <Image:DataKey="BackgroundCollectionC2UTID" · onclick="Nore" · Chrupa� la (UvT) Text representations CLIN 2013 10 / 19

Segmentation and labeling of Stackoverflow posts Generate labels from HTML markup From trained RNN model ◮ Run on labeled train and test data ◮ Record hidden unit activations at each position in text ◮ Use as extra features for CRF Chrupa� la (UvT) Text representations CLIN 2013 11 / 19

Labels Block w r o n g ? ¶ t r y O O O O O O O B-BL I-BL I-BL Inline · e r . . / i m g O O O B-IN I-IN I-IN I-IN I-IN I-IN Chrupa� la (UvT) Text representations CLIN 2013 12 / 19

Baseline feature set ...wrong ? ¶ try { ... Unigram ¶ t n g ? Bigram ? ¶ g? Trigram g? ¶ Fourgram ng? ¶ g? ¶ t Fivegram ng? ¶ t Chrupa� la (UvT) Text representations CLIN 2013 13 / 19

Augmented feature set Baseline features 400-unit hidden layer activation ◮ For each of 10 most active units ⋆ Is the activation > 0.5? Chrupa� la (UvT) Text representations CLIN 2013 14 / 19

Data sets Labeled ◮ Train: 1.2 – 10 million characters ◮ Test: 2 million characters Unlabeled ◮ 465 million characters Chrupa� la (UvT) Text representations CLIN 2013 15 / 19

Baseline F-score 69 68 67 ● ● F1 66 65 ● 64 ● 63 2 4 6 8 10 Size of labeled training set in millions of characters Chrupa� la (UvT) Text representations CLIN 2013 16 / 19

Augmented ● 69 ● 68 67 ● ● ● F1 66 ● 65 ● 64 Augmented ● 63 Baseline 2 4 6 8 10 Size of labeled training set in millions of characters Chrupa� la (UvT) Text representations CLIN 2013 17 / 19

Details (best model) Label Precision Recall F-1 All 83.6 59.1 69.2 90.8 90.6 90.7 block 40.8 10.5 16.7 inline Sequence accuracy: 70.7% Character accuracy: 95.2% Chrupa� la (UvT) Text representations CLIN 2013 18 / 19

Conclusion Simple Recurrent Networks learn abstract distributed representations useful for character level prediction tasks. Future work Alternative network architecture: Sutskever et al. 2011, dropout Distributed analog of bag-of-words Test on other tasks/datasets Chrupa� la (UvT) Text representations CLIN 2013 19 / 19

Learning text representations from character-level data Grzegorz - PowerPoint PPT Presentation

Learning text representations from character-level data Grzegorz Chrupa la Department of Communication and Information Sciences Tilburg University CLIN 2013 Chrupa la (UvT) Text representations CLIN 2013 1 / 19 Text representations

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Alex Suciu Northeastern University MIMS Summer School: New Trends in Topology and Geometry

Architectures of MIMO Channel Sounder Jun-ichi TAKADA Tokyo Institute of Technology ICT 2010

DifFuzz: Differential Fuzzing for Side-Channel Analysis Shirin Nilizadeh Yannic Noller Corina

DOTS Signal Channel and Data Channel drafts Interim Meeting

Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides

Kerberos Working Group UTF-8 Stringprep Profile UTF-8 Stringprep Profile Goals and Principles

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Logistics Project 2 Trees IV Minimal submission due Sunday Please dont miss the

Sambuz

Useful Links

Newsletter

Mail Us