complexities and challenges for
play

complexities and challenges for language informatics Girish Nath - PowerPoint PPT Presentation

India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U.,


  1. India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U., New Delhi-67 Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 1

  2. In this presentation… 1) Big Data, Language Informatics, Digital Humanities and the desirable goals for ancient manuscripts 2) Levels of digitization and the Complexity involved 3) Standards, tools and technologies required 4) Work done in India in general and at JNU 5) Digitization and beyond 6) Suggestions and conclusion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 2

  3. Big data is here  Big data can get bigger in India  Language informatics  Opportunities for Indian languages  Less resourced and fringe languages  Scheduled languages  Classical and heritage languages Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 3

  4. DH – Digital Humanities  Applying IT for various sub disciplines of humanities  India is a curious case for DH research as we have multitude of languages, literatures, arts, traditions etc  All of these can potentially lead to big data and data oriented informatics and intelligence Talk at Institute of Linguistics, 12/9/2015 Adam Mickiwicz University, Poznan, Poland 4

  5. Indian Language Families and % Speakers IndoAryan - 76.87% Dravidian -20.82% Austro Asiatic - 1.11% Tibeto Burman - 1% Andamanese* - 0% Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 5

  6. Official languages and scripts of India Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 6

  7. Why Sanskrit?  The language with most heritage material  Predominantly Devanagari handwritten texts, but other scripts also used (like Odia, Maithili, Bangla, Grantha, Sharada, Brahmi, other major scripts)  More than 30 million waiting to be digitized  95% estimated to be un the domain of Science & Technology Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 7

  8. Tasks at hand  Digitize manuscripts  Editing/limited processing  Enable search and cross linking  Enable readability  Text processing  Translation  Research & Development  Promotion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 8

  9. The problem Definition of Sanskrit or Indian manuscript?  Sanskrit vs Indian manuscript  Geographical expansion  Whole of South Asia, South East Asia, China, other countries culturally related or where any mss/text/translation is found  Older Rough count (30 million  David Pingree, 6.2 million  NMM)  67% or more in Sanskrit  Estimated loss (several hundred per week  Dominik Wujastyk) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 9

  10. National Manuscript Mission (NMM)  Liberal definition of manuscript  Effort to collect copies of mss  Good survey of libraries in northern India (Orissa, Bihar and Uttar Pradesh  35000 repositories  Cataloguing and Microfilming  Online search for some  Training in ancient scripts Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 10

  11. NMM - problems  poor quality of catalogues,  missing manuscripts, incomplete folios, access issues  No work on creating technology and standards  over dependence on manpower  No retention of trained manpower Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 11

  12. Desirable goals ….  A right mix of human labor and computing technologies  Digitizing, Archiving, search, cross linking  Reading help, Translation  Fundamental Research, experimentations  Promotion (popular media, target younger readers, multilingual delivery, internationalization) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 12

  13. Levels of digitization  Online/interactive catalogues with multilingual/multi-script search  Scanned images  e-books/download  Human transcribed e-texts  e- books/downloadable  OCR transcribed/human edited e-texts  e-books/downloadable Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 13

  14. What is required ?  Standards for  Uni, bi and multimodal data encoding  metadata, storage, search  Tools  Data Input / output mechanisms  Editing, spelling & Grammar checking  Text Readers  Translation  E-learning/Multimedia Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 14

  15. Standards for Digital technologies  do we have one?  How difficult to get standards in India  MSR initiative  Efforts under BIS  Sanskrit POS background  The ILCI corpora and the first National standard in POS Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 15

  16. Input mechanisms  With unicode in most of our major languages, texts can be entered  However, most of the heritage exists as handwritten manuscripts  Do we have a mechanism for it?  The printed text recognition consortium under IIT Delhi Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 16

  17. Input mechanisms…  Oliver Hellwig’s OCR  OLHWR  Consortium under I.I.Sc Bangalore  OLHWR (Hindi) for tablets  Microsoft Windows group (Redmond)  How difficult it is  Resources needed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 17

  18. Efforts at JNU - OLHWR  Microsoft consultancy  Ink collection  Hindi states of UP, Rajasthan, Delhi  2 million ink samples  Lexical Resources  System dictionary (basic wordlist, corpora of newspapers, literature, frequency marked words, offensive words, NEs) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 18

  19. Efforts at JNU – OLHWR…..  Devanagari/Hindi Model  Tablet PCs are no longer the focus in Microsoft. Therefore further development is on hold  We can start from where MS left and adapt it for Sanskrit Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 19

  20. Complexity of Sanskrit handwritten texts  Historical document with scanty information on date/authorship  Physical condition of the manuscript  Quality of the writing in the manuscript  Can be in multiple languages and scripts  Have non linguistic marks Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 20

  21. What if we develop the handwriting OCR for Sanskrit?  Text Readers  Searches  inter-linking  Translation  Research, experimentation Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 21

  22. Next Steps…  Promotion  multimedia content creation  Electronic media  Films, documentaries Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 22

  23. Work done at JNU Critical edition, translation and publication  of rare manuscripts Digitize rare manuscripts  Efforts to promote ancient scripts  Computer Tools and resources for Sanskrit  Machine Translation  E-learning/multimedia presentation of texts  Research on fundamental texts Talk at Institute of Linguistics,  Adam Mickiwicz University, 12/9/2015 Poznan, Poland 23

  24. Tools…  Text To Speech for Sanskrit  NERs  Language analyzers and Generators,  Lexical Resources, Multimedia content  Corpora & Standards  Emotion detection Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 24

  25. Machine Translation  English-Urdu MT released by Microsoft in Feb 2013  English-Sindhi is complete. To be released soon  SaHiT (Sanskrit Hindi Translator - JNU’s rule based Sanskrit Hindi Translator) a simple rule based system for split-prose will be out this year  SHMT (Sanskrit consortium system) – basic version is out  Sanskrit-English Translation (SETrans) being developed using Microsoft Translator Hub platform  English to Gujarati, Maithili, Bengali being developed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 25

  26. Lexical Resources for interpretation  Koshas - Amara, Apte, Halayudha, Mankha, Medini, Nirukta, Nighantus, Ayurveda dictionary  Textual Search – Vedas, Upanishadas, Ayurveda, Mahabharata, Ramayana, Kalidasa Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 26

  27. Corpora & Standards  ILCI consortium – 17 Indian languages (including English) parallel corpora. Sanskrit is going to be added  Tagged Sanskrit corpora, tagset (some of it already published my LDC, U Penn)  LDC (Univ. of Pennsylvania) – 8 languages Multimodal corpora for training security systems (Indian English, Hindi, Urdu, Bangla, Tamil, Malayalam, Pushto, Dari) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 27

Recommend


More recommend