some useful tasks involving language
play

Some useful tasks involving language Find all phone numbers in a - PDF document

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such as Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. Find multiple adjacent occurrences of the


  1. Some useful tasks involving language • Find all phone numbers in a text, e.g., occurrences such as Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. • Find multiple adjacent occurrences of the same word in a text, as in I read the the book. Detmar Meurers: Intro to Computational Linguistics I OSU, LING 684.01 • Determine the language of the following utterance: French or Polish? Czy pasazer jadacy do Warszawy moze jechac przez Londyn? 2 More useful tasks involving language Regular expressions • A regular expression is a description of a set of strings, i.e., a • Look up the following words in a dictionary: language. laughs, became, unidentifiable, Thatcherization • They can be used to search for occurrences of these strings • A variety of unix tools (grep, sed), editors (emacs), and programming • Determine the part-of-speech of words like the following, even if you languages (perl, python) incorporate regular expressions. can’t find them in the dictionary: • Just like any other formalism, regular expressions as such have no conurbation, cadence, disproportionality, lyricism, parlance linguistic contents, but they can be used to refer to linguistic units. ⇒ Such tasks can be addressed using so-called finite-state machines. ⇒ How can such machines be specified? 3 4 The syntax of regular expressions (1) The syntax of regular expressions (2) Regular expressions consist of • counters • optionality: ? • strings of characters: c , A100 , natural language , 30 years! colou?r • any number of occurrences: * (Kleene star) • disjunction: [0-9]* years – ordinary disjunction: devoured|ate , famil(y|ies) • at least one occurrence: + – character classes: [Tt]he , bec[oa]me [0-9]+ dollars – ranges: [A-Z] (a capital letter) • wildcard for any character: . beg.n for any character in between beg and n • negation: [ ˆ a] (any symbol but a ) [ ˆ A-Z0-9] (not an uppercase letter or number) 5 6

  2. The syntax of regular expressions (3) Regular languages How can the class of regular languages which is specified by regular Operator precedence, from highest to lowest: expressions be characterized? parentheses () Let Σ be the set of all symbols of the language, the alphabet, then: counters * + ? 1. {} is a regular language character sequences 2. ∀ a ∈ Σ : { a } is a regular language disjunction | 3. If L 1 and L 2 are regular languages, so are: (a) the concatenation of L 1 and L 2 : L 1 · L 2 = { xy | x ∈ L 1 , y ∈ L 2 } Note: The various unix tools and languages differ w.r.t. the exact syntax (b) the union of L 1 and L 2 : L 1 ∪ L 2 of the regular expressions they allow. (c) the Kleene closure of L: L ∗ = L 0 ∪ L 1 ∪ L 2 ∪ ... where L i is the language of all strings of length i . 7 8 Properties of regular languages • intersection: L 1 ∩ L 2 set of strings in both L 1 and L 2 • reversal: L R 1 The regular languages are closed under ( L 1 and L 2 regular languages): set of the reversal of all strings in L 1 • concatenation: L 1 · L 2 set of strings with beginning in L 1 and continuation in L 2 • Kleene closure: L ∗ 1 set of repeated concatenation of a string in L 1 • union: L 1 ∪ L 2 set of strings in L 1 or in L 2 • complementation: Σ ∗ − L 1 set of all possible strings that are not in L 1 • difference: L 1 − L 2 set of strings which are in L 1 but not in L 2 9 10 Finite state machines Defining finite state automata Finite state machines (or automata) (FSM, FSA) recognize or generate regular languages, exactly those specified by regular expressions. A finite state automaton is a quintuple ( Q, Σ , E, S, F ) with Example: • Q a finite set of states • Σ a finite set of symbols, the alphabet • Regular expression: colou?r • S ⊆ Q the set of start states • Finite state machine: • F ⊆ Q the set of final states 1 r • E a set of edges Q × (Σ ∪ { ǫ } ) × Q c o l o 0 6 5 4 2 u r The transition function d can be defined as d ( q, a ) = { q ′ ∈ Q |∃ ( q, a, q ′ ) ∈ E } 3 11 12

  3. Language accepted by an FSA Finite state transition networks (FSTN) E ⊆ Q × Σ ∗ × Q is the smallest set such that The extended set of edges ˆ Finite state transition networks are graphical descriptions of finite state machines: ( q, σ, q ′ ) ∈ ˆ • ∀ ( q, σ, q ′ ) ∈ E : E • nodes represent the states • ∀ ( q 0 , σ 1 , q 1 ) , ( q 1 , σ 2 , q 2 ) ∈ ˆ ( q 0 , σ 1 σ 2 , q 2 ) ∈ ˆ • start states are marked with a short arrow E : E • final states are indicated by a double circle • arcs represent the transitions The language L(A) of a finite state automaton A is defined as L ( A ) = { w | q s ∈ S, q f ∈ F, ( q s , w, q f ) ∈ ˆ E } 13 14 Example for a finite state transition network Finite state transition tables Finite state transition tables are an alternative, textual way of describing a b S1 finite state machines: S0 S3 • the rows represent the states c b S2 • start states are marked with a dot after their name b • final states with a colon Regular expression specifying the language generated or accepted by • the columns represent the alphabet the corresponding FSM: ab|cb+ • the fields in the table encode the transitions 15 16 The example specified as finite state transition table Some properties of finite state machines a b c d • Recognition problem can be solved in linear time (independent of the S0. S1 S2 size of the automaton). S1 S3: • There is an algorithm to transform each automaton into a unique S2 S2,S3: equivalent automaton with the least number of states. S3: 17 18

  4. Deterministic Finite State Automata Example: Determinization of FSA A finite state automaton is deterministic iff it has ✗✔ ✗✔ ❄ ❄ • no ǫ transitions and ✖✕ ✖✕ 1 1 PPPP PPPP a ✟ b a ✟ b ✛✘ ✛✘ ✛✘ ✛✘ ✟ ✟ ✟ ✟ q P P q • for each state and each symbol there is at most one applicable ✟ ✟ ✙ ✟ ✙ ✟ c ✲ ✛✘ ✚✙ ✚✙ ✚✙ ✚✙ transition. 2 3 2 PPPP c 3 ❍❍❍❍❍❍❍❍❍❍❍ P q ★ ✥ { 3,5 } ✚✙ d d e Every non-deterministic automaton can be transformed into a a a c a ✛✘ ✛✘ ✛✘ ✛✘ ✤✜ deterministic one: ❄ ✗✔ ❄ ❄ ❄ ❄ e ❇ ✁ ✁ ❥ ❍ ✲ ❇ ✚✙ ✚✙ ★✥ ✁ ✚✙ ✚✙ 4 5 4 5 ✖✕ • Define new states representing a disjunction of old states for each ✟ { 5,6 } ✣✢ ❇ ✁ ✑ ◆ ❇ ✟ ✑ ❩❩❩❩❩ ❩❩❩❩❩ ✡ ✙ ✟ ✑ ✑ ❖ ❈ e non-determinacy which arises. ✑ ✑ ✡ ✛✘ ✛✘ a ✢ ✡ ❈ a ❈ a ✖ ✌ ✓✏ ✑ ✓✏ ✑ c { 4,5 } c ✧✦ ✑ ✰ ❲ ❈ ✑ ✰ ❳❳❳❳❳❳❳❳ ✻ ⑦ ⑦ • Define arcs for these states corresponding to each transition which ✒✑ ✒✑ e c, a ✚✙ ✚✙ 6 ③ 6 is defined in the non-deterministic automaton for one of the disjuncts in the new state names. 19 20 From Automata to Transducers Transducers and determinization A finite state transducer understood as consuming an input and Needed: mechanism to keep track of path taken producing an output cannot generally be determinized. A finite state transducer is a 6-tuple ( Q, Σ 1 , Σ 2 , E, S, F ) with Example: ★ ✘ • Q a finite set of states a:b ✡ ✣ ✓✏ ✡ ❆ ❆ ❯ ✡ ✒✑ • Σ 1 a finite set of symbols, the input alphabet ✯ ✟ ❍❍❍❍❍❍❍❍❍❍ ✟✟✟✟✟✟✟✟✟✟ b:b a :b ✛✘ • Σ 2 a finite set of symbols, the output alphabet ✛✘ ❤ ❥ ❍ ✲ ✚✙ ✚✙ ✘ ✿ • S ⊆ Q the set of start states ✘✘✘✘✘✘✘✘✘✘ ❳❳❳❳❳❳❳❳❳ ✛✘ a :c ③ ❳ c:c ✚✙ • F ⊆ Q the set of final states ✁ ✕ ❆ ✁ ❆ ✁ ❆ ✁ • E a set of edges Q × (Σ 1 ∪ { ǫ } ) × Q × (Σ 2 ∪ { ǫ } ) ❆ ❯ ❆ ✫ ✦ a:c 21 22 Summary Reading assignment 2 • Ch. 1 “Finite State Techniques” of course notes • Ch. 2 “Regular expressions and automata”, Jurafsky & Martin (2000) • Notations for characterizing regular languages: • For a more in-depth discussion of the NLP aspects, take a look at: • Regular expressions – Chapter 1 (Introduction) of E. Roche and Y. Shabes (1987): • Finite state transition networks Finite State Language Processing . MIT Press. • Finite state transition tables – Richard Sproat, “Lexical Analysis”, in Robert Dale, Hermann Moisl, and Harold Somers (eds.) Handbook of NLP . 2000. • Finite state machines and regular languages: Definitions and some properties • Good reference books on the theoretical computer science aspects: • Finite state transducers – “Elements of the theory of computation” H.R. Lewis, C.H. Papadimitriou. Prentice-Hall. 2nd Ed. 1998 – “Introduction to Automata Theory, Languages, and Computation.” John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman. 2nd Ed. 2001. Addison-Wesley. or the 1979 version by John E. Hopcroft and Jeffrey D. Ullman. 23 24

Recommend


More recommend