Creating a Handwriting Recognition Corpus for Bushman Languages - PowerPoint PPT Presentation

Creating a Handwriting Recognition Corpus for Bushman Languages Kyle Williams and Hussein Suleman

BUSHMAN PEOPLE ● Bushman people of Southern Africa ● Earliest inhabitants of Earth ● Unique view of the world ● No living speakers of many Bushman languages Digital Libraries Laboratory, University of Cape Town

BLEEK AND LLOYD COLLECTION ● Collection contains notebooks, art and dictionaries ● Bushman culture encoded in metaphorical stories ● Preserving this collection → preserving Bushman culture Digital Libraries Laboratory, University of Cape Town

BLEEK AND LLOYD COLLECTION Digital Libraries Laboratory, University of Cape Town

BLEEK AND LLOYD COLLECTION ● Already have systems for preservation and viewing collection ● Next step involves enhancing use ● Make text searchable ● Index text ● Reprint of text in books ● Text-to-speech ● Need a corpus of transcriptions Digital Libraries Laboratory, University of Cape Town

BUSHMAN TEXT ● Text contains complex diacritics ● Stacked above and below characters ● Span multiple characters Digital Libraries Laboratory, University of Cape Town

BUSHMAN TEXT ● Diacritics cannot be represented using Unicode ● No one left that speaks the |xam language! ● Over 137 different diacritics (more still being found) Digital Libraries Laboratory, University of Cape Town

ENCODING ● Bushman text cannot be encoded using Unicode ● Latex IPA package contains diacritics ● Allows for custom macros to be created ● Stacked, nested, multiple characters ● \uline{a} → ● \xbelow{\uline{a}} → ● \xbelow{aa} → Digital Libraries Laboratory, University of Cape Town

ENCODING Digital Libraries Laboratory, University of Cape Town

XÒÄ'XÒÄ - “TO WRITE” ● An AJAX tool to create a Bushman corpus ● Automatic algorithms ● User input ● Preprocessing ● Line and word segmentation ● Transcription ● Job and user management Digital Libraries Laboratory, University of Cape Town

TEXT SELECTION Digital Libraries Laboratory, University of Cape Town

LINE SEGMENTATION ● Projection profile-based line segmentation ● Count foreground-background transitions for each row ● Minima suggest space between lines ● Could represent space between base character and diacritics ● Gaussian smoothing of projection profile Digital Libraries Laboratory, University of Cape Town

LINE SEGMENTATION Digital Libraries Laboratory, University of Cape Town

WORD SEGMENTATION ● Line slant is automatically corrected ● Connected components in text lines are identified ● Distances between adjacent components are calculated ● Distances above threshold separate words Digital Libraries Laboratory, University of Cape Town

WORD SEGMENTATION Digital Libraries Laboratory, University of Cape Town

TRANSCRIPTION Digital Libraries Laboratory, University of Cape Town

CORPUS CREATION WORKSHOPS ● Workshop held to create Bushman corpus ● 29 data capturers recruited ● 900 pages from 2 authors randomly selected ● 729 pages were segmented into lines and words ● 1547 text lines were transcribed ● 452 text lines could not be transcribed ● Interface didn't support characters, noise, English Digital Libraries Laboratory, University of Cape Town

CORPUS CREATION WORKSHOPS ● Quality and efficiency of data capturers evaluated ● 5 data capturers asked to return ● 1700 more line recruited ● More efficient and potentially fewer errors Digital Libraries Laboratory, University of Cape Town

CORPUS CREATION WORKSHOP Digital Libraries Laboratory, University of Cape Town

USER CONTRIBUTIONS Digital Libraries Laboratory, University of Cape Town

DATA QUALITY ● Quality represented by accuracy and correctness of transcriptions ● Useful in planning for follow on workshops ● Random transcriptions by each user evaluated by research assistant ● Wrong diacritics, characters, etc. ● Average of 0.48 errors per text line ● Acceptable for lay persons? Digital Libraries Laboratory, University of Cape Town

EFFICIENCY VS QUALITY Digital Libraries Laboratory, University of Cape Town

CONCLUSIONS ● Creation of corpora for historical texts is often difficult due to complexities of script ● Semi-automatic tool allowed for more efficient and less expensive creation of corpus ● Currently being used in handwriting recognition study ● Applicable to other historical collections Digital Libraries Laboratory, University of Cape Town

THANK YOU Questions? Digital Libraries Laboratory, University of Cape Town

Creating a Handwriting Recognition Corpus for Bushman Languages - PowerPoint PPT Presentation

Creating a Handwriting Recognition Corpus for Bushman Languages Kyle Williams and Hussein Suleman BUSHMAN PEOPLE Bushman people of Southern Africa Earliest inhabitants of Earth Unique view of the world No living speakers of many

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical

Handwriting and Presentation Policy Nelson Handwriting provides a clear, practical framework for

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical

End-to-end, Full Page, Handwriting Recognition Curtis Wigington, Brian Davis, Chris Tensmeyer,

Marine Academy Primary Handwriting and Presentation Policy Handwriting and Presentation Policy

Handwriting and Presentation Policy Manageable and effective ways of teaching handwriting and

Introduction Here at Easington Lane Primary School we strive to improve pupil s handwriting and

Collierley Primary School Guidance on Handwriting and Presentation Handwriting Programmes of

Handwriting and Presentation Policy Rationale Handwriting and presentation are fundamental skills

St. Matt hews C .E. Primary School Handwriting and Presentation Policy Values Handwriting is

St. Matt hews CE Primary School Handwriting and Presentation Policy Values Handwriting is a

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Markov Models for Handwriting Recognition DAS 2012 Tutorial, Gold Coast, Australia Gernot

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Handwriting SEN Parent Network Meeting 4 th December 2017 This session will cover The

Minoan linguistic resources: The Linear A digital Corpus The Hong Kong Institute of Education ,

Presentation to the IGG 29 th August 2018 Agenda Long Term No Access I-SEM Smart Metering and

Python 3 Divya Pai Object Oriented Analysis and Design S Python: History S It was started

TkDND: a cross-platform dragndrop package Georgios Petasis Software and Knowledge

SUSE Storage Solutions SUSECON 2019 April 2, 2019 Mike Dilio & Sanjeet Singh Agenda

Large-scale GPU Deep Learning Platform Design and Case Analysis Zhang Qing Alfie Lew YOUR

Agenda Unified Planning Assumptions & Study Plan Isabella Nicosia Associate Stakeholder

Oslejsek R. , Toth D., Eichler Z., Burska K. LAB OF SOFTWARE ARCHITECTURES AND INFORMATION

Sambuz

Useful Links

Newsletter

Mail Us

Creating a Handwriting Recognition Corpus for Bushman Languages - PowerPoint PPT Presentation

Creating a Handwriting Recognition Corpus for Bushman Languages Kyle Williams and Hussein Suleman BUSHMAN PEOPLE Bushman people of Southern Africa Earliest inhabitants of Earth Unique view of the world No living speakers of many

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical

Handwriting and Presentation Policy Nelson Handwriting provides a clear, practical framework for

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical

End-to-end, Full Page, Handwriting Recognition Curtis Wigington, Brian Davis, Chris Tensmeyer,

Marine Academy Primary Handwriting and Presentation Policy Handwriting and Presentation Policy

Handwriting and Presentation Policy Manageable and effective ways of teaching handwriting and

Introduction Here at Easington Lane Primary School we strive to improve pupil s handwriting and

Collierley Primary School Guidance on Handwriting and Presentation Handwriting Programmes of

Handwriting and Presentation Policy Rationale Handwriting and presentation are fundamental skills

St. Matt hews C .E. Primary School Handwriting and Presentation Policy Values Handwriting is

St. Matt hews CE Primary School Handwriting and Presentation Policy Values Handwriting is a

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Markov Models for Handwriting Recognition DAS 2012 Tutorial, Gold Coast, Australia Gernot

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Handwriting SEN Parent Network Meeting 4 th December 2017 This session will cover The

Minoan linguistic resources: The Linear A digital Corpus The Hong Kong Institute of Education ,

Presentation to the IGG 29 th August 2018 Agenda Long Term No Access I-SEM Smart Metering and

Python 3 Divya Pai Object Oriented Analysis and Design S Python: History S It was started

TkDND: a cross-platform dragndrop package Georgios Petasis Software and Knowledge

SUSE Storage Solutions SUSECON 2019 April 2, 2019 Mike Dilio &amp; Sanjeet Singh Agenda

Large-scale GPU Deep Learning Platform Design and Case Analysis Zhang Qing Alfie Lew YOUR

Agenda Unified Planning Assumptions &amp; Study Plan Isabella Nicosia Associate Stakeholder

Oslejsek R. , Toth D., Eichler Z., Burska K. LAB OF SOFTWARE ARCHITECTURES AND INFORMATION

Sambuz

Useful Links

Newsletter

Mail Us

SUSE Storage Solutions SUSECON 2019 April 2, 2019 Mike Dilio & Sanjeet Singh Agenda

Agenda Unified Planning Assumptions & Study Plan Isabella Nicosia Associate Stakeholder