complexities and challenges for language informatics Girish Nath - PowerPoint PPT Presentation

India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U., New Delhi-67 Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 1

In this presentation… 1) Big Data, Language Informatics, Digital Humanities and the desirable goals for ancient manuscripts 2) Levels of digitization and the Complexity involved 3) Standards, tools and technologies required 4) Work done in India in general and at JNU 5) Digitization and beyond 6) Suggestions and conclusion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 2

Big data is here  Big data can get bigger in India  Language informatics  Opportunities for Indian languages  Less resourced and fringe languages  Scheduled languages  Classical and heritage languages Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 3

DH – Digital Humanities  Applying IT for various sub disciplines of humanities  India is a curious case for DH research as we have multitude of languages, literatures, arts, traditions etc  All of these can potentially lead to big data and data oriented informatics and intelligence Talk at Institute of Linguistics, 12/9/2015 Adam Mickiwicz University, Poznan, Poland 4

Indian Language Families and % Speakers IndoAryan - 76.87% Dravidian -20.82% Austro Asiatic - 1.11% Tibeto Burman - 1% Andamanese* - 0% Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 5

Official languages and scripts of India Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 6

Why Sanskrit?  The language with most heritage material  Predominantly Devanagari handwritten texts, but other scripts also used (like Odia, Maithili, Bangla, Grantha, Sharada, Brahmi, other major scripts)  More than 30 million waiting to be digitized  95% estimated to be un the domain of Science & Technology Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 7

Tasks at hand  Digitize manuscripts  Editing/limited processing  Enable search and cross linking  Enable readability  Text processing  Translation  Research & Development  Promotion Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 8

The problem Definition of Sanskrit or Indian manuscript?  Sanskrit vs Indian manuscript  Geographical expansion  Whole of South Asia, South East Asia, China, other countries culturally related or where any mss/text/translation is found  Older Rough count (30 million  David Pingree, 6.2 million  NMM)  67% or more in Sanskrit  Estimated loss (several hundred per week  Dominik Wujastyk) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 9

National Manuscript Mission (NMM)  Liberal definition of manuscript  Effort to collect copies of mss  Good survey of libraries in northern India (Orissa, Bihar and Uttar Pradesh  35000 repositories  Cataloguing and Microfilming  Online search for some  Training in ancient scripts Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 10

NMM - problems  poor quality of catalogues,  missing manuscripts, incomplete folios, access issues  No work on creating technology and standards  over dependence on manpower  No retention of trained manpower Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 11

Desirable goals ….  A right mix of human labor and computing technologies  Digitizing, Archiving, search, cross linking  Reading help, Translation  Fundamental Research, experimentations  Promotion (popular media, target younger readers, multilingual delivery, internationalization) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 12

Levels of digitization  Online/interactive catalogues with multilingual/multi-script search  Scanned images  e-books/download  Human transcribed e-texts  e- books/downloadable  OCR transcribed/human edited e-texts  e-books/downloadable Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 13

What is required ?  Standards for  Uni, bi and multimodal data encoding  metadata, storage, search  Tools  Data Input / output mechanisms  Editing, spelling & Grammar checking  Text Readers  Translation  E-learning/Multimedia Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 14

Standards for Digital technologies  do we have one?  How difficult to get standards in India  MSR initiative  Efforts under BIS  Sanskrit POS background  The ILCI corpora and the first National standard in POS Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 15

Input mechanisms  With unicode in most of our major languages, texts can be entered  However, most of the heritage exists as handwritten manuscripts  Do we have a mechanism for it?  The printed text recognition consortium under IIT Delhi Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 16

Input mechanisms…  Oliver Hellwig’s OCR  OLHWR  Consortium under I.I.Sc Bangalore  OLHWR (Hindi) for tablets  Microsoft Windows group (Redmond)  How difficult it is  Resources needed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 17

Efforts at JNU - OLHWR  Microsoft consultancy  Ink collection  Hindi states of UP, Rajasthan, Delhi  2 million ink samples  Lexical Resources  System dictionary (basic wordlist, corpora of newspapers, literature, frequency marked words, offensive words, NEs) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 18

Efforts at JNU – OLHWR…..  Devanagari/Hindi Model  Tablet PCs are no longer the focus in Microsoft. Therefore further development is on hold  We can start from where MS left and adapt it for Sanskrit Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 19

Complexity of Sanskrit handwritten texts  Historical document with scanty information on date/authorship  Physical condition of the manuscript  Quality of the writing in the manuscript  Can be in multiple languages and scripts  Have non linguistic marks Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 20

What if we develop the handwriting OCR for Sanskrit?  Text Readers  Searches  inter-linking  Translation  Research, experimentation Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 21

Next Steps…  Promotion  multimedia content creation  Electronic media  Films, documentaries Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 22

Work done at JNU Critical edition, translation and publication  of rare manuscripts Digitize rare manuscripts  Efforts to promote ancient scripts  Computer Tools and resources for Sanskrit  Machine Translation  E-learning/multimedia presentation of texts  Research on fundamental texts Talk at Institute of Linguistics,  Adam Mickiwicz University, 12/9/2015 Poznan, Poland 23

Tools…  Text To Speech for Sanskrit  NERs  Language analyzers and Generators,  Lexical Resources, Multimedia content  Corpora & Standards  Emotion detection Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 24

Machine Translation  English-Urdu MT released by Microsoft in Feb 2013  English-Sindhi is complete. To be released soon  SaHiT (Sanskrit Hindi Translator - JNU’s rule based Sanskrit Hindi Translator) a simple rule based system for split-prose will be out this year  SHMT (Sanskrit consortium system) – basic version is out  Sanskrit-English Translation (SETrans) being developed using Microsoft Translator Hub platform  English to Gujarati, Maithili, Bengali being developed Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 25

Lexical Resources for interpretation  Koshas - Amara, Apte, Halayudha, Mankha, Medini, Nirukta, Nighantus, Ayurveda dictionary  Textual Search – Vedas, Upanishadas, Ayurveda, Mahabharata, Ramayana, Kalidasa Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 26

Corpora & Standards  ILCI consortium – 17 Indian languages (including English) parallel corpora. Sanskrit is going to be added  Tagged Sanskrit corpora, tagset (some of it already published my LDC, U Penn)  LDC (Univ. of Pennsylvania) – 8 languages Multimodal corpora for training security systems (Indian English, Hindi, Urdu, Bangla, Tamil, Malayalam, Pushto, Dari) Talk at Institute of Linguistics, Adam Mickiwicz University, 12/9/2015 Poznan, Poland 27

complexities and challenges for language informatics Girish Nath - PowerPoint PPT Presentation

India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U.,

The Benefits and Complexities of f Data Sharing wit ithin Academic Collaboration Think lounge

Beazley Breach Response Select p g Making the connection on data breach complexities P

Subpart F Rules on Taxation of Controlled Foreign Corporations Navigating the Complexities in Tax

The challenges and complexities of communities acquiring and managing a sport centre a

Procurement Cards and Sales Tax Complexities Mastering the Compliance Challenges A Live

CIRCULAR ECONOMY: COMPLEXITIES, TRENDS, CHALLENGES David Newman ISWA President Limassol, Cyprus,

Presentation Transcript IDEA Part B Subgrant Complexities and Compliance: Addressing Calculation

Compliance and Regulatory Requirements Navigating the Complexities in PSAs, ED Call Coverage,

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

BISHOPS DAY in the Region MESSAGE CELEBRATIONS CHALLENGES CHALLENGES Trust CHALLENGES

Distribution Navigating Application and Defense of Valuation Discounts and Complexities of

CERCLA, CWA and State Law: Complexities With Overlapping Authorities Navigating Requirements for

Subrogation, Indemnity and Contribution Navigating the Complexities and Pitfalls of Waiver

Characteristics and Complexities of Fractured Rock Silurian Dolomite, Granite and Schist,

Insurance Subrogation, Indemnity and Hold Harmless Releases: Navigating the Complexities

Article 9 Security Interests: Complexities in Drafting Legal Opinions Determining Scope and

SUPPORTING HISPANIC STUDEN ENT SUCCES ESS IN HIGHER ER EDUCATION HACU CONFERENCE OCTOBER 9,

Plainview-Old Bethpage CSD 2019-2020 POBJFK High School March 19, 2019 Keys to Success

Library Instruction in the 21st Century BoE presentation--Oct. 17, 2016 Ellen Lawrence &

School of Engineering and Technology xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ABOUT SET The School of

OVERVIEW OF EDD EMPLOYMENT TAX EVASION Investigation Division Criminal Tax Evasion Unit EDD

Fraud Happens.but it doesnt have to. A Simple Perspective on how to Protect You and Your

Lessons Learned from Japans NPL Experience Ladies and gentlemen, today I would like to discuss

Chancellor Glen D. Johnson On Line Task Force MOOC Safety and Security Task Force Massive Open

Sambuz

Useful Links

Newsletter

Mail Us

complexities and challenges for language informatics Girish Nath - PowerPoint PPT Presentation

India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U.,

The Benefits and Complexities of f Data Sharing wit ithin Academic Collaboration Think lounge

Beazley Breach Response Select p g Making the connection on data breach complexities P

Subpart F Rules on Taxation of Controlled Foreign Corporations Navigating the Complexities in Tax

The challenges and complexities of communities acquiring and managing a sport centre a

Procurement Cards and Sales Tax Complexities Mastering the Compliance Challenges A Live

CIRCULAR ECONOMY: COMPLEXITIES, TRENDS, CHALLENGES David Newman ISWA President Limassol, Cyprus,

Presentation Transcript IDEA Part B Subgrant Complexities and Compliance: Addressing Calculation

Compliance and Regulatory Requirements Navigating the Complexities in PSAs, ED Call Coverage,

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

BISHOPS DAY in the Region MESSAGE CELEBRATIONS CHALLENGES CHALLENGES Trust CHALLENGES

Distribution Navigating Application and Defense of Valuation Discounts and Complexities of

CERCLA, CWA and State Law: Complexities With Overlapping Authorities Navigating Requirements for

Subrogation, Indemnity and Contribution Navigating the Complexities and Pitfalls of Waiver

Characteristics and Complexities of Fractured Rock Silurian Dolomite, Granite and Schist,

Insurance Subrogation, Indemnity and Hold Harmless Releases: Navigating the Complexities

Article 9 Security Interests: Complexities in Drafting Legal Opinions Determining Scope and

SUPPORTING HISPANIC STUDEN ENT SUCCES ESS IN HIGHER ER EDUCATION HACU CONFERENCE OCTOBER 9,

Plainview-Old Bethpage CSD 2019-2020 POBJFK High School March 19, 2019 Keys to Success

Library Instruction in the 21st Century BoE presentation--Oct. 17, 2016 Ellen Lawrence &amp;

School of Engineering and Technology xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ABOUT SET The School of

OVERVIEW OF EDD EMPLOYMENT TAX EVASION Investigation Division Criminal Tax Evasion Unit EDD

Fraud Happens.but it doesnt have to. A Simple Perspective on how to Protect You and Your

Lessons Learned from Japans NPL Experience Ladies and gentlemen, today I would like to discuss

Chancellor Glen D. Johnson On Line Task Force MOOC Safety and Security Task Force Massive Open

Sambuz

Useful Links

Newsletter

Mail Us

Library Instruction in the 21st Century BoE presentation--Oct. 17, 2016 Ellen Lawrence &