verbal grammars for weather bulletins in isixhosa and
play

Verbal grammars for weather bulletins in isiXhosa and isiZulu - PowerPoint PPT Presentation

Verbal grammars for weather bulletins in isiXhosa and isiZulu Generation and similarity Zola Mahlaza zmahlaza@cs.uct.ac.za Department of Computer Science University of Cape Town September SAICSIT 17 Supervisor: Dr. C. Maria Keet Outline


  1. Verbal grammars for weather bulletins in isiXhosa and isiZulu Generation and similarity Zola Mahlaza zmahlaza@cs.uct.ac.za Department of Computer Science University of Cape Town September SAICSIT ’17 Supervisor: Dr. C. Maria Keet

  2. Outline 2 ◮ Field of study : brief summary. ◮ Identified problem. ◮ Current solution. ◮ Proposed improved solution. ◮ Research questions. ◮ Methodology. ◮ Results. ◮ Conclusion and final remarks.

  3. Background 3 ◮ Natural language processing. ◮ Natural language understanding. ◮ Natural language generation. ◮ Natural language texts from structured representations of data, information, or knowledge. Figure: An example of input and output of an NLG system (Source : Arria NLG plc n.d.)

  4. Background 4 ◮ Met Office (S. Sripada et al. 2014). ◮ Online Trial ended 17 May 2016. ◮ Five-day weather forecast for 10,000 locations worldwide in under 2 minutes. ◮ Different climates & time zone changes. ◮ Based on Arria NLG engine. ◮ Swiss Federal Institute for Snow and Avalanche Research (Winkler, Kuhn, and Volk 2014). ◮ Avalanche warnings. ◮ German, French, Italian, and English. ◮ Catalogue-based system.

  5. Background 5 Table: List of NLG systems that have been developed to produce weather forecasts System name Establishing literature Realisation method Languages Year WMO-based and NATURAL Gkatzia, Lemon, and Rieser 2016 SimpleNLG English 2016 CBR-METEO Adeyanju 2015 String manipulation English 2015 Winkler-Kuhn-Volk’s system Winkler, Kuhn, and Volk 2014 Catalogued phrases German,French,Italian,English 2014 Zhang-Wu-Gao-Zhao-Lv’s system H. Zhang et al. 2011 Not implemented Chinese 2011 pCRU Belz 2008 Statistical methods Possibly all 2007 SumTime-Mousam S. G. Sripada et al. 2002 “Grammar” English 2003 SumTime S. G. Sripada et al. 2002 “Grammar” English 2001 Mitkov’s system Mitkov 1991 (as cited by Sigurd et al. 1992) - - 2001 Autotext - - - 2000 MLWFA Yao, D. Zhang, and Wang 2000 Grammar English, German, Chinese 2000 Siren - - - 2000 Scribe - - - 1999 TREND Boyd 1998 FUF/SURGE English 1998 Multimeteo - - - 1998 ICWF Ruth and Peroutka 1993 Grammar English 1993 IGEN Rubinoff 1992 Grammar English 1992 Kerpedjiev’s system Kerpedjiev 1992 Grammar English 1992 Weathra Sigurd et al. 1992 Grammar English, Swedish 1992 FoG Bourbeau et al. 1990 MTT Models English, French 1990 MARWORDS Goldberg, Kittredge, and Polguere 1988 Grammar English, French 1988 RAREAS Kittredge, Polgu` ere, and Goldberg 1986 - English, French 1986 Glahn’s system Glahn 1970 Templates English 1970

  6. Problem 6 In our examination of the current state and use of Nguni languages, we have observed that there is no fast and large scale producer, automated or otherwise, of weather summaries in said languages.

  7. Currrent reporting 7 ◮ SABC TV station (SABC 1) daily report. ◮ IsiZulu/isiXhosa at 19h00 South African Standard Time (SAST). ◮ IsiNdebele/siSwati report at 17h30 SAST. ◮ Nguni language radio stations (e.g Umhlobo Wenene 1 , Ukhozi 2 , etc). Figure: SABC weather report (Source : SABCNewsOnline) 1 http://www.umhlobowenenefm.co.za/ 2 http://www.ukhozifm.co.za/

  8. Possible solution and challenges 8 ◮ Four NLG systems. ◮ Languages are “verby” (Nurse 2008). ◮ Agglutinating morphology + concordial agreement system. ◮ zizakuhamba (they will walk/leave) → [zi][za][ku]hamb[a]. Figure: Bantu verb structure (Source : Keet and Khumalo 2016).

  9. Possible solution and challenges 9 ◮ Templates are incompatible (Keet and Khumalo 2014;Keet Figure: Example of a database and Khumalo 2017). table with South African ◮ Grammars are solution for domestic bus schedules (Adapted realization. from Gyawali 2016, p.20). ◮ Nguni languages S40 : IsiXhosa The bus [bus number] S41, IsiZulu S42, siSwati S43, departing from [origin] reaches and isiNdebele S44 (Maho [destination] in [duration] . 1999). Figure: Example of template for describing the bus schedules (Source : Gyawali 2016, p.20).

  10. Research questions 10 ◮ How grammatically similar are isiZulu verbs with their isiXhosa counterparts? ◮ Can a singular merged set of grammar rules be used to produce correct verbs for both languages?

  11. Methodology 11 ◮ A corpus to determine the output text requirements (Dale and Reiter 2000). ◮ The weather corpus will be collected from the South African Weather Service (SAWS). ◮ Translated into isiXhosa by members of the School of African Languages and Literature at UCT. ◮ Incrementally develop grammar rules for isiZulu and isiXhosa through literature intensive approach. ◮ The evaluation of the quality of the rules will use an expertise-oriented approach (Rovai 2003, p.117 ; Ross 2010, p.483). ◮ IsiXhosa and isiZulu compared through verb rule parse trees and ‘language’ space using binary similarity measures.

  12. Corpus development 12 Directed to Western Cape regional office ◮ South African Weather Service (SAWS) : No records. After further queries to Tshwane office ◮ SAWS : Forecast for first day of each month in 2015 (Jan 2015 - Dec 2015).

  13. Corpus development 13 ◮ Data Cleaning (“The expected UVB sunburn index”). ◮ Randomly sampled 48 sentences for translation from English to isiXhosa. ◮ School of African Languages & Literature at UCT. “Lipholile kumkhwezo wonxweme apho kulindeleke izibhaxu zenkungu yakusasa ngaphaya kokoliyakuthi gqabagqaba ngamafu kwaye libeshushu okanye litshise kwaye libeneziphango ezithe saa emantla” ◮ 53 verbs, only 27 unique. ‘Verb’ means string not verb root. ◮ 22 indicative, 2 participial, 3 subjunctive. ◮ Near past, present, and near future. ◮ Simple, exclusive, and progressive.

  14. CFG Development 14 ◮ Increment 0: Prefix ◮ Gathering preliminary rules. ◮ Verb generation, correctness classification, and elimination of incorrect verbs. ◮ Increment 1: Prefix + Object Concord + Verb Root + Suffix - Final Vowel ◮ Suffix addition, verb generation and correctness classification. ◮ Elimination of incorrect verbs, verb generation and correctness classification. ◮ Increment 2: Complete verbs ◮ Investigate missing features, add missing features (where necessary), add final vowel, correctness classification. ◮ Elimination of incorrect verbs, verb generation and correctness classification.

  15. CFG Development 15 Indicative and Participial ◮ Verb → NPC 2 A pes OC VR S p ◮ Verb → NPC 0 A pes OC VR S np Figure: Context free grammar rules that generate isiXhosa past tense inductive, and participial verbs. Indicative and Participial ◮ Verb → NPC 0 A pes OC VR S np ◮ Verb → NPC 2 A pes OC VR S p Figure: Context free grammar rules that generate isiZulu past tense inductive, and participial verbs.

  16. CFG IsiXhosa Quality 16 Table: Number of correct and incorrect words generated using the third increment isiXhosa grammar (indicative and participial mood). Correctness is divided into semantic and syntactic categories. Percentage correct Correct Incorrect Total Syntax 97.4% 38 1 39 Past Semantics 51.3% 20 19 39 Syntax 80.0% 28 7 35 Present Semantics 45.7% 16 19 35 Syntax 98.6% 72 1 73 Future Semantics 53.4% 39 34 73

  17. CFG IsiZulu Quality 17 Table: Number of correct and incorrect words generated using the third increment isiZulu grammar (indicative and participial mood). Correctness is divided into semantic and syntactic categories. Percentage correct Correct Incorrect Total Syntax 97.2% 35 1 36 Past Semantics 47.2% 17 19 36 Syntax 88.9% 16 2 18 Present Semantics 55.6% 10 8 18 Syntax 98.6% 72 1 73 Future Semantics 53.4% 39 34 73

  18. CFG Linguist Evaluation 18 ◮ 2 linguists (UCT & UKZN). ◮ 25 isiZulu and isiXhosa verbs from English-isiZulu dictionary (Doke et al. 1990). ◮ - zol - root, 5 pairs of subject and object concords are randomly selected. ◮ Generated 49400 strings using natural language toolkit (NLTK), and sampled 100. ◮ Packaged 99 in spreadsheet, and sent to linguists. ◮ Strings are not subjected to phonological conditioning. ◮ True/False for syntactic correctness, True/False for semantic correctness, and add a comment

  19. CFG Linguist Evaluation 19 Table: Summary of the linguists’ semantic and syntactic correctness evaluation of the isiXhosa and isiZulu generated strings. Percentage correct Correct Incorrect Total Syntax 52% 51 48 IsiXhosa 99 Semantics 58% 57 42 Syntax 23% 16 57 71 isiZulu Semantics 25% 17 52 69

  20. CFG Linguist Evaluation 20 ◮ Significant statistical association between syntactic correctness (two-tailed p=0.0001, Fisher’s exact test) and language. ◮ The same is true for semantic correctness and language (two-tailed p=0.0023, Fisher’s exact test). ◮ Verb phrases without semantic correctness annotation. ◮ Updated values show a strong statistically significant association between the syntactic correctness (two-tailed p < 0 . 0001, Fisher’s exact test) and language.

  21. Similarity Questions and Methods 21 Asking ◮ How grammatically similar are isiZulu verbs with their isiXhosa counterparts? ◮ Can a singular merged set of grammar rules be used to produce correct verbs for both languages? Answer by ◮ Manual scanning ◮ Parse tree analysis ◮ Binary similarity measures

Recommend


More recommend