MT Dev Devel elopment opment Expe peri rience ence of of Vi - PowerPoint PPT Presentation

MT Dev Devel elopment opment Expe peri rience ence of of Vi Viet etnam nam VU Tat Thang, ang, Ph.D. Institute of Information Technology Vietnamese Academy of Science and Technology vtthang@ioit.ac.vn NAC, April 1 st 2013

Thang VU  2002 ~ 2005: IOIT, Vietnam Speech Processing Problems:  Hybrid model of ANN/HMM for Speech recognition system  HMM-based approach for Vietnamese LVCSR  Fujisaki model in Vietnamese synthesis  2005 ~ 2008: JAIST, Japan  The Restoration of bone-conducted speech  2008: ATR SLC, Japan  2008 ~ 2010: NICT SLC, Japan Speech Translation Problem  Vietnamese LVCSR, Tone recognition  HMM-based Vietnamese speech synthesis  Machine Translation (Vietnamese – English)  2010 ~: IOIT, Vietnam  Research/Development in some National Projects: 2007-2010: VLSP - Vietnamese language and speech processing 2011-2014: S2s – English-Vietnamese and Vietnamese-English Speech translation in Specific domain NAC, April 1 st 2013

Outline  Vietnamese Language  Some Results in MT from Vietnam  Experience with VLSP Project  Experience with S2s Project NAC, April 1 st 2013

54 ethnic groups in Vietnam Language groups  Mon-Khmer  Tay-Thai  Tibeto- Burman  Malayo- Polysian  Kadai  Mong-Dao  Han NAC, April 1 st 2013

Vietnamese language  Vietnamese language was established a long time ago  Chinese characters was used for a long time  Unique writing system of Vietnam called Chu Nom ( 字喃 ) in the 10 th century  Romanced script to represent the Qu ố c Ng ữ Nam quốc sơn hà Nam đế cư since the beginning of the 南国山河南帝居 20 th century Over Mountains and Rivers of the South, Reigns the Emperor of the South NAC, April 1 st 2013

Vietnamese language  Vietnamese is an analytic language (words are composed of a single morpheme).  ngôn ngữ (analytic), lang-gua-ge (synthetic), 言語 (synthetic)  Vietnamese does not use morphological marking of case, gender, number, and tense.  Trưa nay tôi ăn ba th ằ ng tôm  Syntax conforms to Subject Verb Object word order  Cái thằng chồng em nó chẳng ra gì. FOCUS CLASSIFIER husband I he not turn.out what “That husband of mine, he is good for nothing.” NAC, April 1 st 2013

VLSP Project 2007-2010 Typica cal l products cts Objectives:  1. Basic research on methods for processing Vietnamese language and speech Resources and tools 2. Build and develop several typical products for VLSP for public end-users. Computa tation tion methods ds 3. Build and develop indispensable resources and tools for the VLSP development  All the tools are constructed based on the same view of words, label assignment, sentences, and resources.  Using statistical and machine learning methods to build the tools with the corpora.  Tools and resources are to be given to the public NAC, April 1 st 2013

NLP groups Group Experience National Center for Rule-based MT -> The only MT commercial systems in Technology Progress Vietnam (EVTRAN3.0, VETRAN3.0) Univ. of Natural Sciences, Transfer based MT using Bitext Transfer Learning VNUHCM doing dictionary, bilingual corpus. HCM Univ. of Technology, Since 1989 with various trails. SMT since 2002, PBT VNUHCM and phrase extraction from Penn Treebank (since 2003) JAIST SMT since 2007, improving the rule-based MT system using statistical techniques. Text alignment, biText, tools: POS Tagging, Chunking, UNS - VNU Hanoi Parsing Lexicography Center dictionary, corpora. ColTech – VNU Hanoi Focus on SMT, and improve the rule-based MT system using statistical techniques. HUT Develop tools: POS Tagging, Chunking, Parsing Develop tools: spelling, POS Tagging, Chunking, Parsing, Danang Univ dictionary: French-Vietnamese-French (Papillon Project) NAC, April 1 st 2013

Development of Text Corpora  VietTreeBank  10.000 items of fully annotated corpus  Word Boundary  POS Tagging  Syntax Labeling  Text corpus with 1 million syllables with word boundary  Web-based tool for access and updated sentences with POSTaging  Billingual Corpora: English-Vitenamese  100.000 sentence pair (including 60.000 parallel sentence pair and 40.000 comparable sentence pair)  Vietnamese Machine Readable Dictionary  35.000 items with fully lexical, syntax and semantic information,  Cover all of model Vietnamese Words NAC, April 1 st 2013

Development of NLP Tools  Word boundary detection  Accuracy about 99%  Text corpus with XML format, boundary lebelling  POS Tagging  Accuracy >90%  Common rule of POS Tagging with VietTreeBank  Training on 10.000 sentences with labelling  Chunking  Accuracy >85%  Syntax Parser  Accuracy >80% NAC, April 1 st 2013

Project target products SP8.1 Speech analysis tools SP6.1 SP6.2 SP6.3 Corpora for Corpora for Corpora for speech recognition speech synthesis specific words SP1 Apllicationoriented SP3 systems based on English-Vietnamese Vietnamese speech translation system SP7.4 SP7.3 recognition & synthesis E-V corpora of aligned Vietnamese treebank sentences SP7.1 SP7.2 English-Vietnamese Viet dictionary dictionary SP2 SP4 Speech recognition SP8.2 IREST: Internet use SP8.3 system with Vietnamese word support system large vocabulary Vietnamese POS tagger Segmentation SP8.5 SP8.4 Vietnamese syntax Vietnamese chunker analyser To be standard for long term SP5 development Vietnamese spelling checker NAC, April 1 st 2013

SP7.3: Viet Treebank  A Treebank or parsed corpus is a text corpus S in which each sentence has been parsed, i.e. annotated with syntactic structure. NP VP  English : Penn Treebank (4.5M words) and many P V NP others;  Chinese : Penn Chinese Treebank (507K words), Ông gi à đi T Sinica Treebank (61,087 trees, 361K words) ; nhanh qu á  Japanese : ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks;  Korean : Korean Treebank Viet Treebank (5078 trees, 54K words) Viet Viet word Viet POS Viet  Viet Treebank : syntactic segmenter tagger chunker parser  10,000 trees  1,000,000 morphemes Viet machine translation, info extraction, etc. NAC, April 1 st 2013

SP7.3: Viet Treebank  Study various existing treebanks, modern theories for syntax and Vietnamese language  Build guidelines for word segmentation, POS, and syntax  “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả” (“the house is in jumble” and “at home the door is not closed”)  “Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn” (She keeps her beauty” and “this painting has better color”)  Build the tools  Labeling Agreement between labelers (95%) NAC, April 1 st 2013

Setting up the “standards” for VLSP  An appropriate view from different research group  Challenge: Standards for sustainable development  Guideline for  Words recognition and description: morphological, syntactic, semantic criteria  Label set: noun phrase, verb phrase, clause, …  sentence split  i.e: 36 word labels in English, from Penn Treebank (1989) 30 word labels in Chinese, from Chinese TreeBank (1998) 47 word labels in Thai, from Orchid corpus (1997)  How many POS Tag for Vietnamese? NAC, April 1 st 2013

SP3: Machine translation and EVSMT1.0 SMT: statistical machine Vietnamese- English translation English Text Bilingual Text Statistical Analysis Statistical Analysis Broken English Vietnamese English Died the old man too fast The old man Ông già đi nhanh quá The old man too fast died died too fast The old man died too fast Old man died the too fast (Slides 31-32 adapted from tutorial on SMT, K. Knight and P. Koehn) NAC, April 1 st 2013

SP3: Machine translation and EVSMT1.0 SMT: statistical machine Vietnamese- English translation English Text Bilingual Text Statistical Analysis Statistical Analysis Broken English Vietnamese English Translation Language Model Model Decoding Algorithm Argmax P(v|e) x P(e) NAC, April 1 st 2013

SP3: Machine translation and EVSMT1.0 SMT core Issues in Vietnamese SMT English Vietnamese Decoder (search problem) sentence sentence  Corpus building MOSES  Language Modeling Translation Model  Translation Model (phrase-based) Language Model -GIZA++ SRILM -MOSES  Decoder -MERT  Others Pre-processing - Standardization Pre-processing Pre-processing - Word segmentation (VNsegmenter) - POS tagger (CRF Postagger, Vietnamese- Vietnamese VnQtag) English - Morphological corpus Parallel corpus analyser (morpha) SMT Resource processing - Pre-process (sentence splitter, tokenizer, etc.), Web crawler - Sentence alignment tools Raw materials Automatic extract (documents, books, …) parallel text from the Web Corpus collecting and building NAC, April 1 st 2013

Conclusion  Complete the first phase in VLSP infrastructure.  Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora.  Work in collaboration and sharing  Look for investment from the government and industry for the next phase, and for collaboration. NAC, April 1 st 2013

Thank you ! 20 NAC, April 1 st 2013

MT Dev Devel elopment opment Expe peri rience ence of of Vi - PowerPoint PPT Presentation

MT Dev Devel elopment opment Expe peri rience ence of of Vi Viet etnam nam VU Tat Thang, ang, Ph.D. Institute of Information Technology Vietnamese Academy of Science and Technology vtthang@ioit.ac.vn NAC, April 1 st 2013 Thang VU

Ex Expe peri rience ence sh shar arin ing g on on br broa oade der r is issu sues

CITY O CITY OF DET F DETROIT OIT COMMUNITY COMMUNITY DEV DEVEL ELOPMENT OPMENT BL BLOCK

De Devel elopment opment of of Di District strict Rul ule e 4460 460 (Petr troleum

Li Livestoc estock k De Devel elopment opment fo for So Socio o Ec Econ onom omic c

presentation19 growth Economics, 6th ed., 2016, Prof. Dr. P. Zamaros Devel elopment opment

DA DAY-NR NRLM LM Sust staina inable ble Dev evel elopment opment Goa oals s DAY-

FOR FOR SP SPATIAL TIAL DEV DEVEL ELOP OPMENT MENT FRAMEW FRAMEWORKS ORKS (SDFs) (SDFs)

2019 ACCE CCELER LERATIN TING G INFRAST INFRASTRUCTU CTURE E DEV DEVEL ELOPM OPMENT

DEV DEVEL ELOPMENT ENT OF IN-SI SITU MERCURY REMEDIATION ON AP APPROACHE CHES B BAS ASED

Aside ide from om paym yment nt: : the expe perienc rience of acquis quisition ition and

Yo Your r Path ath to o the e Student udent-Athlet thlete e Expe perience rience NCAA

THE E DOT OTS S FROM OM WO WORK K COM OMP P TO O EXPE PERIE RIENCE NCE RATING ING

Decision making in Endless Space 2 Background Our Studio 40 20 Dev 30 Dev 60 Dev 12 Dev

SINDH ENTERPRISE DEVELOPMENT FUND SINDH ENTERPRISE DEVEL SINDH ENTERPRISE DEVELOPMENT FUND SINDH

Pipeline Emergency Responders Initiative (PERI) Louisiana Gas Association Pipeline Safety

SELF SE SELF SE SE SELF SE SELF LF-INJECTION LF LF LF-INJECTION INJECTION INJECTION

TDMoEthernet Aggregator Agenda Introduction Applications Advantage 1: Faster and

Kadi Sarva Vishwavidyalaya Faculty of Engineering & Technology First Year Bachelor of

Understanding the Need, Basic Guidelines, Content & Process for Value Education State of

ANNUAL PARISH MEETING 17th April 2019 A G E N DA 1. Apologies for absence. 2. Approve Minutes

1 One wOrld Objectives Optional extension When students have thought of some famous people,

Kennington Green head house & landscaping Engagement summary 23 March-20 April c.5,000

Multifunctional Perennial Cropping Systems design preferences of landowners in Central Illinois

Innovative Marketing Or: How I Learned To Stop Worrying and Love Social Media MarketingStrategist

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

MT Dev Devel elopment opment Expe peri rience ence of of Vi - PowerPoint PPT Presentation

MT Dev Devel elopment opment Expe peri rience ence of of Vi Viet etnam nam VU Tat Thang, ang, Ph.D. Institute of Information Technology Vietnamese Academy of Science and Technology vtthang@ioit.ac.vn NAC, April 1 st 2013 Thang VU

Ex Expe peri rience ence sh shar arin ing g on on br broa oade der r is issu sues

CITY O CITY OF DET F DETROIT OIT COMMUNITY COMMUNITY DEV DEVEL ELOPMENT OPMENT BL BLOCK

De Devel elopment opment of of Di District strict Rul ule e 4460 460 (Petr troleum

Li Livestoc estock k De Devel elopment opment fo for So Socio o Ec Econ onom omic c

presentation19 growth Economics, 6th ed., 2016, Prof. Dr. P. Zamaros Devel elopment opment

DA DAY-NR NRLM LM Sust staina inable ble Dev evel elopment opment Goa oals s DAY-

FOR FOR SP SPATIAL TIAL DEV DEVEL ELOP OPMENT MENT FRAMEW FRAMEWORKS ORKS (SDFs) (SDFs)

2019 ACCE CCELER LERATIN TING G INFRAST INFRASTRUCTU CTURE E DEV DEVEL ELOPM OPMENT

DEV DEVEL ELOPMENT ENT OF IN-SI SITU MERCURY REMEDIATION ON AP APPROACHE CHES B BAS ASED

Aside ide from om paym yment nt: : the expe perienc rience of acquis quisition ition and

Yo Your r Path ath to o the e Student udent-Athlet thlete e Expe perience rience NCAA

THE E DOT OTS S FROM OM WO WORK K COM OMP P TO O EXPE PERIE RIENCE NCE RATING ING

Decision making in Endless Space 2 Background Our Studio 40 20 Dev 30 Dev 60 Dev 12 Dev

SINDH ENTERPRISE DEVELOPMENT FUND SINDH ENTERPRISE DEVEL SINDH ENTERPRISE DEVELOPMENT FUND SINDH

Pipeline Emergency Responders Initiative (PERI) Louisiana Gas Association Pipeline Safety

SELF SE SELF SE SE SELF SE SELF LF-INJECTION LF LF LF-INJECTION INJECTION INJECTION

TDMoEthernet Aggregator Agenda Introduction Applications Advantage 1: Faster and

Kadi Sarva Vishwavidyalaya Faculty of Engineering &amp; Technology First Year Bachelor of

Understanding the Need, Basic Guidelines, Content &amp; Process for Value Education State of

ANNUAL PARISH MEETING 17th April 2019 A G E N DA 1. Apologies for absence. 2. Approve Minutes

1 One wOrld Objectives Optional extension When students have thought of some famous people,

Kennington Green head house &amp; landscaping Engagement summary 23 March-20 April c.5,000

Multifunctional Perennial Cropping Systems design preferences of landowners in Central Illinois

Innovative Marketing Or: How I Learned To Stop Worrying and Love Social Media MarketingStrategist

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Kadi Sarva Vishwavidyalaya Faculty of Engineering & Technology First Year Bachelor of

Understanding the Need, Basic Guidelines, Content & Process for Value Education State of

Kennington Green head house & landscaping Engagement summary 23 March-20 April c.5,000