low resource natural language processing
play

Low-Resource Natural Language Processing Behnam Sabeti Sharif - PowerPoint PPT Presentation

Low-Resource Natural Language Processing Behnam Sabeti Sharif University of Technology October 2019 Who-am-i? Be Behnam m Sabet eti Ph.D. Candidate at Sharif arif Unive versity rsity of Techno chnology logy Project Manager and NPL


  1. Low-Resource Natural Language Processing Behnam Sabeti Sharif University of Technology October 2019

  2. Who-am-i? Be Behnam m Sabet eti Ph.D. Candidate at Sharif arif Unive versity rsity of Techno chnology logy Project Manager and NPL Expert at Miras s Tech chnolog nologies es Intern ternation tional Does all kind of NLP stuff specially on Persi sian an behnamsabeti behnamsabeti Sharif Data Talks: Low-Resourced NLP 2

  3. NLP @ Miras • Our focus at Miras NLP team is on developing text processing services for Persian: • Document classification • Named entity recognition • Sentiment analysis • Emotion analysis • … • Challenge: • Data! Sharif Data Talks: Low-Resourced NLP 3

  4. Dataset Size (documents) IMDB 50 K SST 10 K Sentiment140 160 K Amazon Product Data 142.8 M Sharif Data Talks: Low-Resourced NLP 4

  5. Problem? • Deep learning models are data hungry • Persian NLP community is not large • We do not have enough public resources • Funding is also limited so we can’t afford building huge resources either Sharif Data Talks: Low-Resourced NLP 5

  6. Get More Date Get Better Data Use Related Data Problem Modeling Sharif Data Talks: Low-Resourced NLP 6

  7. Get Better Data Better Use Related Data Related Problem Modeling Modeling Sharif Data Talks: Low-Resourced NLP 7

  8. Solutions • Self Supervision • Emotion Analysis • Weak Supervision • Document Classification • Transfer Learning • Named Entity Recognition • Multi-Task Learning • Satire Detection • Active Learning • Sentiment Analysis Sharif Data Talks: Low-Resourced NLP 8

  9. Modeling Self Supervision • Straight forward (document, label) modeling is not always your best choice. • Model your problem in an easy-to-acquire-label setting: • Self-supervision • Labels are already in your data: • Language modeling • Word embedding • Emotion Analysis Sharif Data Talks: Low-Resourced NLP 9

  10. Modeling Case Study: Emotion Analysis • Emoji is a good indicator of emotion • Instead of manually label your data use emoji • Your dataset needs no hand-labeling effort! 𝐹𝑛𝑝𝑘𝑗 𝑄𝑠𝑓𝑒𝑗𝑑𝑢𝑗𝑝𝑜 ⟹ 𝐹𝑛𝑝𝑢𝑗𝑝𝑜 𝐵𝑜𝑏𝑚𝑧𝑡𝑗𝑡 Sharif Data Talks: Low-Resourced NLP 10

  11. Image: medium.com/@bjarkefelbo/what- Sharif Data Talks: Low-Resourced NLP 11 can-we-learn-from-emojis

  12. Modeling DeepMoji Model • Predict Emoji • Map Emoji to Emotion Image: medium.com/huggingface/understand ing-emotions-from-keras-to-pytorch Sharif Data Talks: Low-Resourced NLP 12

  13. مراد تسود یلیخ ور زییاپ ! لصف هخآ یبوخ نیا هب !رتدوز ایب... Sharif Data Talks: Low-Resourced NLP 13

  14. Modeling Weak Supervision • Provide noisy labels using a set of heuristics or domain knowledge • Use other weak classifiers • Constraints • Data transformation • Think of a transformation on your data: • Reduce the effort in annotation process Sharif Data Talks: Low-Resourced NLP 14

  15. Modeling Case Study: Document Classification • Latent Dirichlet Allocation is a generative model for topic modeling: • computes a set of topics: each topic is a distribution on words • Computes the distribution of each document on topics • Instead of manually labeling documents, annotate topics! • With this transformation you can get a pretty good result by just labeling a handful of topics Sharif Data Talks: Low-Resourced NLP 15

  16. Image: m-cacm.acm.org/magazines/2012/4/147361- Sharif Data Talks: Low-Resourced NLP 16 probabilistic-topic-models

  17. رد طیارش یلعف قبط شرازگ دحاو یتاعلبطا ،تسیمونوکا نیرتشیب یکسیر هک داصتقا ناریا ار دیدهت ،دنک یم کسیر شخب یکناب و کسیر یسایس تسا. رصع ؛کناب تنواعم یسررب یاه یداصتقا قاتا یناگرزاب نارهت رد یشرازگ هب یواکاو لدم کسیر یروشک هتخادرپ تسا. ساسارب نیا شرازگ لدم کسیر ،یروشک یلدم تسا هک هب روظنم شجنس و هسیاقم کسیر یرابتعا یاهروشک فلتخم طسوت دحاو یتاعلبطا تسیمونوکا یحارط هدش تسا. نیا رازبا ،یلماعت ناکما یزاس یمک کسیر تلبدابم یلام زا هلمج یاه ماو ،یکناب نیمات یلام یراجت و یراذگ هیامرس رد قاروا راداهب ار مهارف دنک یم … Sharif Data Talks: Low-Resourced NLP 17

  18. Related Transfer Learning • Train on a task for which you have enough data • Fine-Tune the trained model on a new task (for which limited data is available) • The source and target tasks need to have common characteristics: • Source: Language modeling, Target: Document Classification • Source: Emotion Detection, Target: Satire Detection • Source: Document Classification, Target: Sentiment Analysis • Document Classification: word based task • Sentiment Analysis: Phrase level and semantic based task Sharif Data Talks: Low-Resourced NLP 18

  19. Related Image: machinelearningmastery.com/transfer- Sharif Data Talks: Low-Resourced NLP 19 learning-for-deep-learning

  20. Related Pre-Trained Models • Train your own model on a source task Or use a Pre-trained model • Pre-Trained model are a good choice because they are trained on HUGE datasets. • Language modeling pre-trained models: • BERT • GPT • XLNet • XLM • CTRL • … Sharif Data Talks: Low-Resourced NLP 20

  21. Image: Sharif Data Talks: Low-Resourced NLP 21 jalammar.github.io/illustrated-bert

  22. Image: medium.com/huggingface/introducing-fastbert-a- Sharif Data Talks: Low-Resourced NLP 22 simple-deep-learning-library-for-bert-models

  23. Related Case Study: Named Entity Recognition • Target Task: Named Entity Recognition • Extract locations, persons, organizations, events and times from text • Source: Multilingual BERT model • Data: 50K hand labeled sentences with NER tags Sharif Data Talks: Low-Resourced NLP 23

  24. Sharif Data Talks: Low-Resourced NLP 24

  25. Related Multi-Task Learning • Train multiple tasks together • More data • Synergic effects in training • Tasks: tweet reconstruction + emoji prediction + satire detection • General features • Emotion features • Satire features • Entails multi objective loss functions Sharif Data Talks: Low-Resourced NLP 25

  26. Image: medium.com/manash-en-blog/multi-task- learning-in-keras-implementation-of-multi-task- classification-loss Sharif Data Talks: Low-Resourced NLP 26

  27. Related Case Study: Satire Detection • Satire dataset: 2K tweets • Emotion dataset: 300K tweets • Reconstruction Tweets: as much as you have! (200M) Sharif Data Talks: Low-Resourced NLP 27

  28. Related Sharif Data Talks: Low-Resourced NLP 28

  29. Satire re Mod odel el Perfo rforman rmance ce (F1) Single task 55 % Multi task 68 % Sharif Data Talks: Low-Resourced NLP 29

  30. Better Active Learning • How to select samples for annotation? • Random • Annotate as much as you can • Smart • Annotate “Better” samples Sharif Data Talks: Low-Resourced NLP 30

  31. Better Sharif Data Talks: Low-Resourced NLP 31

  32. Better Image: www.datacamp.com/community/tutorials Sharif Data Talks: Low-Resourced NLP 32 /active-learning

  33. Better Active Learning • How to select samples for annotation • Random • Smart • Select samples that current model is uncertain about (LC) • Select samples with low margin between category labels (Margin) • Select samples with the highest entropy (Entropy) • Get better performance with fewer samples Sharif Data Talks: Low-Resourced NLP 33

  34. Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 34

  35. Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 35

  36. Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model Least Confident 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 36

  37. Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 37

  38. Better D1 Positive Neutral Negative 0.5 0.05 Margin 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 38

  39. Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 39

  40. Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 1.23 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 1.57 0.29 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 40

  41. Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model Entropy 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 41

Recommend


More recommend