CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks Sangjie Duanzhu, Cizhen Jiacuo, Cairang Jia Key Laboratory of Tibetan Information Processing and Machine Translation Qinghai Normal University May 2020 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 1 / 17
Outline CRF for tag inference May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) 5 Acknowledgment 4 Conclusion Results Datasets 3 Experiments Model architecture Outline Tagging schemes 2 Tibetan Word segmentation with neural networks Our work TWS researching background TWS researching background Tibetan Word Segmentation Word formation Tibetan alphabet and word-formation 1 Introduction to Tibetan Word Segmentation 2 / 17
Introduction to Tibetan Word Segmentation Tibetan alphabet and word-formation Tibetan alphabet Tibetan alphabet Sanskrit alphabet Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 3 / 17
Introduction to Tibetan Word Segmentation syllables, among them the fjrst contains 2 May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) this syllable contains 7 characters The syllable composition can get complexer, e.g. characters and the second contains 3 characters This word means “peace”, which is composed of 2 Word formation • Syllables are separated by a special character • A syllable is composed of one or more • A word is composed of syllables Word formation 4 / 17 character, ་ ( tseg )
Introduction to Tibetan Word Segmentation due to the fact that the errors propagate in a multi-stage NLP pipeline May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) There are many pots in the restaurant. Tibetan Word Segmentation Example • The performance of TWS would have a crucial impact on many download stream tasks, with Tibetan NLP workfmow • Tibetan Word segmentation (TWS) is usually the fjrst and essential sub-task to tackle delimiter between words in Tibetan • Difgerent from European languages such as English, there is no presence of explicit Tibetan Word Segmentation 5 / 17 ཇ་ཁང་ (restaurant)/ ནང་ (inside)/ ན་ (functional case) / བ་མ་ (pots)/ མང་ (many)/ ། (punctuation)/ ནམ་མཁ (sky)/ ར་ (functional case)/ འཕཱུར (fly)/ ། (punctuation)/ → Fly into the sky.
Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Given the signifjcance of TWS researchers began to address it using maximum matching methods back in 1999 [Tsering, 1999] • Dictionary-based, rule-based or hybrid of these two approaches became the main methods in this fjeld later on • Currently, traditional statistical models, e.g. HMM, CRF or EM are the primary choice of implementation for TWS system • In recently years, with the widespread adoption of deep learning methods in the NLP, Tibetan NLP community start to embrace the new paradigm shift • Some initial work has been done to explore TWS with neural network [Li et al., 2018] Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 6 / 17
Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of knowledge hand-crafted with great care by linguistic experts • statistical methods hold strong assumptions on conditional independence and the input of discrete representation of basic language units, which Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17
Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of knowledge hand-crafted with great care by linguistic experts • statistical methods hold strong assumptions on conditional independence and the input of discrete representation of basic language units, which • limit capacity for feature selection • limit capacity for modeling contextual signals • lead to moderate amount of semantic information loss • impose constraints on the modeling capacity Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17
Introduction to Tibetan Word Segmentation Our work Our work • We used pre-trained models for both character-level and syllable level contextual representations to better capture semantic information • Combination of CNN and Bi-LSTM network stack is used to fully capture sentence-level representation • A subsequent CRF layer is appended to serve as the inference component of our model, to tag syllables • In experiments, the accuracy, recall rate and F score reach 93.4% ,95.4%and 94.1% on test set, surpassing our base models by a large margin Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 8 / 17
Tibetan Word segmentation with neural networks Tagging schemes May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) BMESN without tsheg BMESN with tsheg ( Fly into the sky ) BMES Tagging schemes Tokenized sentence Tagged sentence Original sentence Difgerent tagging schemes for TWS • To eliminate this, an extra tag could be introduced to label agglutinated Tibetan syllables tagging syllables, thus Tibetan syllable is not necessarily the smallest language unit which requires • BMES tagging scheme is commonly used in both TWS and CWS tasks Tagging schemes 9 / 17 • In Tibetan many functional suffjxes such as འི འང ས འོ ར could be agglutinated with certain • BMES scheme will potentially produce a large amount of invalid Tibetan character combinations such as ནམ་མཁ ནམ་མཁར་འ�ར། ནམ་མཁ ར་ འ�ར ། ནམ [ B ] ་ [ M ] མཁ [ E ] ར [ B ] ་ [ E ] འ�ར [ S ] ། [ S ] ནམ [ B ] ་ [ M ] མཁར་ [ N ] འ�ར [ S ] ། [ S ] ནམ [ B ] མཁར [ N ] འ�ར [ S ] ། [ S ]
Tibetan Word segmentation with neural networks Model architecture May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) ���� ���������� ���� ��������� ��������������� �������������� ��������������� ���� ���������� ���� ��������� 10 / 17 inference layer to predict the corrected tag for given syllable Model architecture • CNN is applied on characters for given syllables to capture character-level information • The output of CNN is then fed into subsequent Bi-LSTM networks, which encode Tibetan sentences based on syllable level signals • The output of Bi-LSTM are passed into the fjnal ���� ���������� ���� ��������� B N S SOFTMAX/CRF SOFTMAX/CRF SOFTMAX/CRF CONCAT CONCAT CONCAT LSTM LSTM LSTM LSTM LSTM LSTM CONCAT CONCAT CONCAT ནམ མཁར འ&ར
Tibetan Word segmentation with neural networks features to model the conditional probability of the output y for a given input x May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) CRF for tag inference 11 / 17 • Implements sequential dependencies in the predictions, which allow unconstrained CRF for tag inference • Word segmentation could be formalized as a sequence labeling task which not only requires modeling of input token to its corresponding label, but also the dependencies between predicted labels y i − 1 y i y i +1 B N S ནམ མཁར འ&ར x i − 1 x i x i +1
Experiments Data set for pre-training syllable and character representations 32.23 6.03 10 Overall 27.61 7.34 26.79 32.29 5.97 10 Embedding type 7.26 Total Token Unique token Syllable embedding 14M 10206 Character embedding 60M 306 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 26.77 27.71 Datasets 27.60 Dataset Syllable tag distribution and data sizes for training/validation/test sets Dataset B (%) M (%) E (%) S (%) N (%) Data size (sentences,K) Training 7.35 Testing 26.79 32.29 5.98 150 Validation 27.66 7.29 26.83 32.33 5.89 10 12 / 17
Experiments Random 90.9 89.6 89.7 Pretrained 92.0 90.4 90.5 CNN+LSTM+CRF 92.5 LSTM+CRF 91.3 90.0 Pretrained 93.4 94.2 94.1 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 Random 87.5 Results 88.6 Results Experimental results of four types of models Models Embedding Accuracy (%) Recall (%) CRF - 89.0 86.6 89.5 LSTM+SOFTMAX Random 89.7 87.9 88.6 Pretrained 90.1 13 / 17 F 1 (%)
Conclusion • Recently, Transformer-based models in NLP have truly changed the way researchers work May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) future work. models to encode Tibetan syllable or even character sequence, but we will leave this for with text data. There is potential to further improve our model by using Transformer tagging, and NER Conclusion Tibetan sequence labeling framework for Tibetan word segmentation, part-of-speech task. We plan to use this work as a basis in the future to study and implement a general • Due to limited labeled data, the model can only be used for Tibetan word segmentation CNN + LSTM + CRF neural architecture preform best on the test data set architectures, and compared it with traditional statistical methods, and fjnally verify the • In this work, we explore Tibetan Word Segmentation model with multiple neural network 14 / 17
Recommend
More recommend