presenting twittir ud
play

Presenting TWITTIR-UD An Italian Twitter Treebank in Universal - PowerPoint PPT Presentation

Presenting TWITTIR-UD An Italian Twitter Treebank in Universal Dependencies Alessandra Teresa Cignarella a,b Cristina Bosco b and Paolo Rosso a a. Universitat Politcnica de Valncia b. Universit degli Studi di Torino Motivation


  1. Presenting TWITTIRÒ-UD An Italian Twitter Treebank in Universal Dependencies Alessandra Teresa Cignarella a,b Cristina Bosco b and Paolo Rosso a a. Universitat Politècnica de València b. Università degli Studi di Torino

  2. Motivation

  3. Motivation 1. Sentiment Analysis and Opinion Mining

  4. Motivation 1. Sentiment Analysis and Opinion Mining → irony, sarcasm, stance, hate speech, misogyny...

  5. Motivation 1. Sentiment Analysis and Opinion Mining → irony, sarcasm, stance, hate speech, misogyny... 2. Dealing with social media texts

  6. Motivation 1. Sentiment Analysis and Opinion Mining → irony, sarcasm, stance, hate speech, misogyny... 2. Dealing with social media texts → hard!!

  7. Motivation 1. Sentiment Analysis and Opinion Mining → irony, sarcasm, stance, hate speech, misogyny... 2. Dealing with social media texts → hard!! 3. Syntax

  8. Motivation 1. Sentiment Analysis and Opinion Mining → irony, sarcasm, stance, hate speech, misogyny... 2. Dealing with social media texts → hard!! 3. Syntax → Universal Dependencies are cool!

  9. Research Questions

  10. Research Questions 1. How can we automatically detect irony ?

  11. Research Questions 1. How can we automatically detect irony ? 2. Could syntax information help in the detection of irony?

  12. Research Questions 1. How can we automatically detect irony ? 2. Could syntax information help in the detection of irony? ...and maybe help in other detection tasks too?

  13. Research Questions 1. How can we automatically detect irony ? 2. Could syntax information help in the detection of irony? ...and maybe help in other detection tasks too? Our approach:

  14. Research Questions 1. How can we automatically detect irony ? 2. Could syntax information help in the detection of irony? ...and maybe help in other detection tasks too? Our approach: Let’s build a corpus and find out!

  15. What is TWITTIRÒ-UD ?

  16. What is TWITTIRÒ-UD ? Treebank

  17. What is TWITTIRÒ-UD ? Treebank Italian

  18. What is TWITTIRÒ-UD ? Twitter Treebank Italian

  19. What is TWITTIRÒ-UD ? Twitter Treebank Italian Universal Dependencies

  20. What is TWITTIRÒ-UD ? Twitter Treebank Italian Sarcasm Universal Dependencies Irony

  21. Related Work

  22. Related Work Social media & Twitter:

  23. Related Work Social media & Twitter: ● Tagging the Twitterverse (Foster et al., 2011) ● The French Social Media Bank (Seddah et al., 2012) ● TWEEBANK (Kong et al., 2014) ● TWEEBANK v2 (Liu et al., 2018) ● Arabic (Albogamy and Ramsay, 2017) ● African-American Englis h (Blodgett et al., 2018) ● Hindi English (Bhat et al., 2018)

  24. Related Work

  25. Related Work

  26. Related Work Two main references for our work:

  27. Related Work Two main references for our work: ● UD_Italian treebank (Simi et al., 2014)

  28. Related Work Two main references for our work: ● UD_Italian treebank (Simi et al., 2014) ● PoSTWITA-UD (Sanguinetti et al., 2018)

  29. Data

  30. Data ● 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)

  31. Data ● 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018) ● fine-grained irony annotation (Karoui et al. 2017)

  32. Data ● 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018) ● fine-grained irony annotation (Karoui et al. 2017) 1. EXPLICIT 2. IMPLICIT

  33. Data ● 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018) ● fine-grained irony annotation (Karoui et al. 2017) 1. ANALOGY 2. EUPHEMISM 3. RHETORICAL QUESTION 1. EXPLICIT 4. OXYMORON or PARADOX 2. IMPLICIT 5. FALSE ASSERTION 6. CONTEXT SHIFT 7. HYPERBOLE or EXAGGERATION 8. OTHER

  34. Data ● 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018) ● fine-grained irony annotation (Karoui et al. 2017) 1. ANALOGY 2. EUPHEMISM 3. RHETORICAL QUESTION 1. EXPLICIT 4. OXYMORON or PARADOX 2. IMPLICIT 5. FALSE ASSERTION 6. CONTEXT SHIFT 7. HYPERBOLE or EXAGGERATION 8. OTHER ● sarcasm annotation (EVALITA 2018)

  35. Annotation

  36. Annotation # text = Presentato il nuovo iPhone. È già al 36% di batteria.

  37. Annotation # text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX

  38. Annotation # text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX # sarcasm = 1

  39. Annotation # text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX # sarcasm = 1 Translation: The new iPhone has been launched. Battery is already at 36%.

  40. Data

  41. Data With the tool UDPipe:

  42. Data With the tool UDPipe: ● tokenization ● lemmatization ● PoS-tagging ● dependency parsing

  43. Data ● dependency parsing } With the tool UDPipe: ● tokenization ● lemmatization ● PoS-tagging

  44. Data ● dependency parsing } With the tool UDPipe: ● tokenization 1,424 tweets! ● lemmatization (17,933 tokens) ● PoS-tagging

  45. Data } With the tool UDPipe: ● tokenization 1,424 tweets! ● lemmatization (17,933 tokens) ● PoS-tagging ● dependency parsing Full release in the UD repository: November 2019

  46. Data

  47. Data

  48. Data

  49. Data 1. Fine-grained annotation for irony

  50. Data 1. Fine-grained annotation for irony

  51. Data 1. Fine-grained annotation for irony 2. Morpho-syntactic information

  52. Issues Encountered and Lessons Learned

  53. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words

  54. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché

  55. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché ● Punctuation irregularly used

  56. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché ● Punctuation irregularly used ● Twitter marks

  57. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché #hashtag ● Punctuation irregularly used ● Twitter marks

  58. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché #hashtag ● Punctuation irregularly used ● Twitter marks @ m e n t i o n

  59. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché #hashtag ● Punctuation irregularly used ● Twitter marks ● No sentence splitting @ m e n t i o n

  60. Issues Encountered and Lessons Learned ● Tokenization errors depending on misspelled words xkè → perché #hashtag ● Punctuation irregularly used ● Twitter marks ● No sentence splitting @ m e n t i o n ● Single-root constraint

  61. Issues Encountered and Lessons Learned

  62. Issues Encountered and Lessons Learned

  63. Issues Encountered and Lessons Learned

  64. Issues Encountered and Lessons Learned

  65. Issues Encountered and Lessons Learned

  66. Issues Encountered and Lessons Learned

  67. Other Highlights

  68. Other Highlights ● Punctuation is indeed exploited more extensively in the two social media datasets rather than in UD_Italian.

  69. Other Highlights ● Punctuation is indeed exploited more extensively in the two social media datasets rather than in UD_Italian. ● Mentions and hashtags have a similar distribution in the two social media datasets.

  70. Other Highlights ● Punctuation is indeed exploited more extensively in the two social media datasets rather than in UD_Italian. ● Mentions and hashtags have a similar distribution in the two social media datasets. ● The use of passive voices ( aux:pass ) is low in PoSTWITA-UD and in TWITTIRÒ-UD, indicating a preference for the exploitation of active voices , as it happens in spoken language.

  71. A Parsing Experiment

  72. A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set.

  73. A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited:

  74. A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited: 1. training UDPipe using only UD_Italian

  75. A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited: 1. training UDPipe using only UD_Italian 2. training UDPipe using only PoSTWITA-UD

  76. A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited: 1. training UDPipe using only UD_Italian 2. training UDPipe using only PoSTWITA-UD 3. training UDPipe using both resources

  77. A Parsing Experiment

  78. A Parsing Experiment

  79. A Parsing Experiment

  80. A Parsing Experiment Results in-line with state of the art (PoSTWITA-UD, Sanguinetti et al., 2018)

  81. Conclusions

Recommend


More recommend