India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U., New Delhi-67 Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 1
In this presentation… 1) Big Data, Language Informatics, Digital Humanities and the desirable goals for ancient manuscripts 2) Levels of digitization and the Complexity involved 3) Standards, tools and technologies required 4) Work done in India in general and at JNU 5) Digitization and beyond 6) Suggestions and conclusion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 2
Big data is here Big data can get bigger in India Language informatics Opportunities for Indian languages Less resourced and fringe languages Scheduled languages Classical and heritage languages Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 3
DH – Digital Humanities Applying IT for various sub disciplines of humanities India is a curious case for DH research as we have multitude of languages, literatures, arts, traditions etc All of these can potentially lead to big data and data oriented informatics and intelligence Talk at Institute of Linguistics, 12/9/2015 Adam Mickiwicz University, Poznan, Poland 4
Indian Language Families and % Speakers IndoAryan - 76.87% Dravidian -20.82% Austro Asiatic - 1.11% Tibeto Burman - 1% Andamanese* - 0% Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 5
Official languages and scripts of India Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 6
Why Sanskrit? The language with most heritage material Predominantly Devanagari handwritten texts, but other scripts also used (like Odia, Maithili, Bangla, Grantha, Sharada, Brahmi, other major scripts) More than 30 million waiting to be digitized 95% estimated to be un the domain of Science & Technology Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 7
Tasks at hand Digitize manuscripts Editing/limited processing Enable search and cross linking Enable readability Text processing Translation Research & Development Promotion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 8
The problem Definition of Sanskrit or Indian manuscript? Sanskrit vs Indian manuscript Geographical expansion Whole of South Asia, South East Asia, China, other countries culturally related or where any mss/text/translation is found Older Rough count (30 million David Pingree, 6.2 million NMM) 67% or more in Sanskrit Estimated loss (several hundred per week Dominik Wujastyk) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 9
National Manuscript Mission (NMM) Liberal definition of manuscript Effort to collect copies of mss Good survey of libraries in northern India (Orissa, Bihar and Uttar Pradesh 35000 repositories Cataloguing and Microfilming Online search for some Training in ancient scripts Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 10
NMM - problems poor quality of catalogues, missing manuscripts, incomplete folios, access issues No work on creating technology and standards over dependence on manpower No retention of trained manpower Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 11
Desirable goals …. A right mix of human labor and computing technologies Digitizing, Archiving, search, cross linking Reading help, Translation Fundamental Research, experimentations Promotion (popular media, target younger readers, multilingual delivery, internationalization) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 12
Levels of digitization Online/interactive catalogues with multilingual/multi-script search Scanned images e-books/download Human transcribed e-texts e- books/downloadable OCR transcribed/human edited e-texts e-books/downloadable Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 13
What is required ? Standards for Uni, bi and multimodal data encoding metadata, storage, search Tools Data Input / output mechanisms Editing, spelling & Grammar checking Text Readers Translation E-learning/Multimedia Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 14
Standards for Digital technologies do we have one? How difficult to get standards in India MSR initiative Efforts under BIS Sanskrit POS background The ILCI corpora and the first National standard in POS Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 15
Input mechanisms With unicode in most of our major languages, texts can be entered However, most of the heritage exists as handwritten manuscripts Do we have a mechanism for it? The printed text recognition consortium under IIT Delhi Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 16
Input mechanisms… Oliver Hellwig’s OCR OLHWR Consortium under I.I.Sc Bangalore OLHWR (Hindi) for tablets Microsoft Windows group (Redmond) How difficult it is Resources needed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 17
Efforts at JNU - OLHWR Microsoft consultancy Ink collection Hindi states of UP, Rajasthan, Delhi 2 million ink samples Lexical Resources System dictionary (basic wordlist, corpora of newspapers, literature, frequency marked words, offensive words, NEs) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 18
Efforts at JNU – OLHWR….. Devanagari/Hindi Model Tablet PCs are no longer the focus in Microsoft. Therefore further development is on hold We can start from where MS left and adapt it for Sanskrit Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 19
Complexity of Sanskrit handwritten texts Historical document with scanty information on date/authorship Physical condition of the manuscript Quality of the writing in the manuscript Can be in multiple languages and scripts Have non linguistic marks Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 20
What if we develop the handwriting OCR for Sanskrit? Text Readers Searches inter-linking Translation Research, experimentation Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 21
Next Steps… Promotion multimedia content creation Electronic media Films, documentaries Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 22
Work done at JNU Critical edition, translation and publication of rare manuscripts Digitize rare manuscripts Efforts to promote ancient scripts Computer Tools and resources for Sanskrit Machine Translation E-learning/multimedia presentation of texts Research on fundamental texts Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 23
Tools… Text To Speech for Sanskrit NERs Language analyzers and Generators, Lexical Resources, Multimedia content Corpora & Standards Emotion detection Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 24
Machine Translation English-Urdu MT released by Microsoft in Feb 2013 English-Sindhi is complete. To be released soon SaHiT (Sanskrit Hindi Translator - JNU’s rule based Sanskrit Hindi Translator) a simple rule based system for split-prose will be out this year SHMT (Sanskrit consortium system) – basic version is out Sanskrit-English Translation (SETrans) being developed using Microsoft Translator Hub platform English to Gujarati, Maithili, Bengali being developed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 25
Lexical Resources for interpretation Koshas - Amara, Apte, Halayudha, Mankha, Medini, Nirukta, Nighantus, Ayurveda dictionary Textual Search – Vedas, Upanishadas, Ayurveda, Mahabharata, Ramayana, Kalidasa Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 26
Corpora & Standards ILCI consortium – 17 Indian languages (including English) parallel corpora. Sanskrit is going to be added Tagged Sanskrit corpora, tagset (some of it already published my LDC, U Penn) LDC (Univ. of Pennsylvania) – 8 languages Multimodal corpora for training security systems (Indian English, Hindi, Urdu, Bangla, Tamil, Malayalam, Pushto, Dari) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 27
Recommend
More recommend