From P¯ an . inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck Institute for the History of Science, Berlin First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1
Overview 1. Research context 2. An XML vocabulary for P¯ an .inian rules 3. From P¯ an .inian rules to an FST 4. Implications: remarks on linguistic description First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.2
Research context Ongoing work on modeling components of Sanskrit grammar according to P¯ an .inian principles nominal inflection verbal inflection (using Dh¯ atup¯ at .ha ) stem formation (perfect stem, participial stems. . . ) morphophonology (sandhi) First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.3
Methodology How closely to follow P¯ an .ini? Practical concerns dictate an incremental approach. We are obliged to interpret P¯ an .ini. Research results concerning both Indian grammatical methods and facts of the Sanskrit language will emerge from computational studies. First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.4
Building blocks of an XML model The rules model not only a P¯ .inian s¯ an utra, but also its context and its interpretation. An XML schema A sound-based encoding (SLP1) A regular expression dialect (PCREs) First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.5
✰ ✂ ✏ ✕✣ ✏ ✸ ✤ ✏ ✥✦ ✢ ✏ ✂ ✏ ✧ ✏ ✂ ✕✷ ✏ � ★ ✏ ✕ ✕✗ ✏ ✂ ✏ ✏ ✕✙ ✂ ✏ ✚ ✏ ✂ ✛ ✏ ✂ ✜ ✂ ✏ ✕✖ ✏ ✳ ✥✭ ✏ ✏ ✕✲ ✮ ✂ ✂ ✂ ✏ ✯ ✏ ✂ ✱ ✂ ✏ ✏ ✂ ✏ ✏ ✥✶ ✩ ✏ ✂ ✏ ✁✪ ✂ ✥✬ ✵ ✂ ✫ ✏ ✂ ✏ ✴ ✏ ✘ ☛ ✄ ✟ ✍ �✁ ✝ ✠ ✂ ✒ ✆ ✌ ✏ ✂ ✟ �✁ ☎ ✓ ✞ ✏ ✡ ✁ ✂ ✄ ✟ ✌ ☞ ✁✔ ✏ ✂ �✁✂ ☞✁ ✟ ✂ The SLP1 encoding a a ¯ i ¯ ı u u ¯ a A i I u U ¯ r ¯ r l l f F x X * e ai o au e E o O k kh g gh n ˙ ✎✑✏ k K g G N c ch j jh ñ c C j J Y t .h t d d .h n . . . w W q Q R t th d dh n t T d D n p ph b bh m p P b B m y r l v y r l v ´ s s s h . S z s h * anusv¯ ara = M ; visarga = H First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.6
The rule element 8.3.23 mo ’nusv¯ arah . <rule source="m" target="M" rcontext="[@(wb)][@(hal)]" ref="A.8.3.23"/> (We may need more than one rule to express a s¯ utra.) First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.7
The macro element We need some means for translating P¯ an .ini’s metalanguage, e. g. sound classes ( praty¯ ah¯ ara s): <macro name="JaS" value="JBGQDjbgqd" c="voiced stop"/> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.8
The mapping element 1.1.2 ade˙ n gun . ah . <mapping name="guna" ref="A.1.1.2"> <map from="@(a)" to="a"/> <map from="@(i)" to="e"/> <map from="@(u)" to="o"/> <map from="@(f)" to="a"/> <map from="@(x)" to="a"/> </mapping> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.9
The function element <function name="gunate"> <rule source="[@(a)@(i)@(u)]" target="%(guna($1))"/> <rule source="[@(f)@(x)]" target="%(guna($1)) %(semivowel($1))"/> </function> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.10
Applying a function 6.1.87 ¯ ad gun . ah . <rule source="[@(a)][@(wb)]([@(ik)])" target="!(gunate($1))" ref="A.6.1.87"/> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.11
Implementing the modeled rules The XML model captures some of the structure of P¯ an .ini’s grammar. But the obvious serial application of the rules is computationally inefficient. The rules can be automatically translated into regular expressions for compilation into a finite state transducer using tools such as xfst (Xerox) or fsa (van Noord). The relation between the underlying strings and the surface strings is a regular relation. First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.12
The replace operator Rules may be translated into regular expressions employing the replace operator (Karttunen 1995). ( a | A )( | # )( a | A ) → a ( a | A )( | # )( i | I ) → e ( a | A )( | # )( u | U ) → o ( a | A )( | # )( f | F ) → ar ( a | A )( | # )( x | X ) → al First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.13
Context-dependent replacement Documented algorithms exist for the translation of context-dependent replacements into FSTs (Mohri & Sproat 1996). 6.1.109 e˙ nah . pad¯ ant¯ adati <rule source="a" target="’" lcontext="[@(eN)][@(wb)]" ref="6.1.109"/> a → ’ / ( e | o )( | # ) First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.14
An FST for 6.1.109 6.1.109 e˙ . pad¯ ant¯ nah adati e, o ? e, o , # � 0 � 1 � 2 e, o ? ?, a:’ First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.15
A composed FST for external sandhi 37 s¯ utras constitute core rules for external sandhi XML: 48 rules, 61 macros, 16 mappings, 3 functions compiled regular expressions are ~268KB composed transducer has 4,994 states, 417,814 arcs First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.16
Comparing two approaches Serial application of rules: S ¯ FORM UTRA tat ca 8.2.39 tad ca 8.4.40, 44 taj ca 8.4.55 tac ca tacca First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17
Comparing two approaches A unique path through the transducer: <t:t><a:a><t:c><" ":c><c: ǫ ><a:a> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17
Limitations of segmentalism Segments are atomic, and enumerating them limits linguistic generalization. Features overlap segments. It was J. R. Firth’s insight that “some phonological properties are not uniquely ‘placed’ with respect to particular segments within a larger unit” (Anderson, 1985, 185). Coarticulation “can be detected in almost every phoneme sequence in normal speech” (Goodglass, 1993, 62). First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.18
✁ � Positions of the Indian grammarians P¯ an .ini moved beyond the vik¯ ara system of earlier linguistic thinkers (Cardona 1965, 311). Use of abbreviations ( praty¯ ah¯ ara s) for sound classes and the principle of s¯ avarn . ya ( A. 1.1.50) emphasize featural analysis. Segments contain subsegments (e. g. /r / contains r: MBh. 3.452.1 ff. Pitch is a property of the syllable ( R Pr. 3.9) or spreads to adjacent consonants ( TPr. 1.43). First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.19
✁ ✁ N-retroflexion in finite state modeling Non-final /n/ is realized as n . after { r , ¯ r , r, s . } despite intervening vowels, semivowels, gutturals/velars, labials, or anusv¯ ara. <rule source="n" target="R" lcontext="[fFrz] [#@(aw)@(ku)@(pu)M]*" rcontext=".*[@(ac)]" ref="8.4.1-2"/> First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.20
✁ ✁ N-retroflexion examples There is a regular relation between a set of underlying and surface strings that includes the following pairs: UNDERLYING SURFACE br m . hana br m . han . a ‘making big/strong’ arabhyam¯ ¯ ana arabhyam¯ ¯ an . a ‘being commenced’ nis . anna nis . an . n . a ‘sitting’ First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.21
✁ A prosody of retroflexion When R is projected onto the linear phonematic plane, n . occurs within its extension (Allen 1951, 943). R b r m . han . a R a- ¯ rabhyam¯ an . a R ni- s . an . n . a First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.22
� How to represent length? /dev¯ at/ ([+long] segment) /deva t/ (phoneme of length) /devaat/ (two phonemes) First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.23
Autosegmental approaches to length d e v a t [DBL] d e v a t C V C V V C First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.24
Recommend
More recommend