Some useful tasks involving language More useful tasks involving language • Find all phone numbers in a text, e.g., occurrences such as • Look up the following words in a dictionary: Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. laughs, became, unidentifiable, Thatcherization • Find multiple adjacent occurrences of the same word in a text, as in • Determine the part-of-speech of words like the following, even if you can’t find them in the dictionary: Detmar Meurers: Intro to Computational Linguistics I I read the the book. conurbation, cadence, disproportionality, lyricism, parlance OSU, LING 684.01 • Determine the language of the following utterance: French or Polish? ⇒ Such tasks can be addressed using so-called finite-state machines. Czy pasazer jadacy do Warszawy moze jechac przez Londyn? ⇒ How can such machines be specified? 2 3 Regular expressions The syntax of regular expressions (1) The syntax of regular expressions (2) Regular expressions consist of • A regular expression is a description of a set of strings, i.e., a • counters language. • optionality: ? • strings of characters: c , A100 , natural language , 30 years! colou?r • They can be used to search for occurrences of these strings • any number of occurrences: * (Kleene star) • disjunction: • A variety of unix tools (grep, sed), editors (emacs), and programming [0-9]* years languages (perl, python) incorporate regular expressions. – ordinary disjunction: devoured|ate , famil(y|ies) • at least one occurrence: + – character classes: [Tt]he , bec[oa]me [0-9]+ dollars • Just like any other formalism, regular expressions as such have no – ranges: [A-Z] (a capital letter) linguistic contents, but they can be used to refer to linguistic units. • wildcard for any character: . beg.n for any character in between beg and n • negation: [ ˆ a] (any symbol but a ) [ ˆ A-Z0-9] (not an uppercase letter or number) 4 5 6 The syntax of regular expressions (3) Regular languages Properties of regular languages How can the class of regular languages which is specified by regular The regular languages are closed under ( L 1 and L 2 regular languages): Operator precedence, from highest to lowest: expressions be characterized? • concatenation: L 1 · L 2 parentheses () Let Σ be the set of all symbols of the language, the alphabet, then: set of strings with beginning in L 1 and continuation in L 2 counters * + ? • Kleene closure: L ∗ 1. {} is a regular language 1 set of repeated concatenation of a string in L 1 character sequences 2. ∀ a ∈ Σ : { a } is a regular language • union: L 1 ∪ L 2 disjunction | set of strings in L 1 or in L 2 3. If L 1 and L 2 are regular languages, so are: • complementation: Σ ∗ − L 1 (a) the concatenation of L 1 and L 2 : L 1 · L 2 = { xy | x ∈ L 1 , y ∈ L 2 } set of all possible strings that are not in L 1 Note: The various unix tools and languages differ w.r.t. the exact syntax (b) the union of L 1 and L 2 : L 1 ∪ L 2 of the regular expressions they allow. • difference: L 1 − L 2 (c) the Kleene closure of L: L ∗ = L 0 ∪ L 1 ∪ L 2 ∪ ... where L i is the set of strings which are in L 1 but not in L 2 language of all strings of length i . 7 8 9
• intersection: L 1 ∩ L 2 Finite state machines Defining finite state automata set of strings in both L 1 and L 2 • reversal: L R Finite state machines (or automata) (FSM, FSA) recognize or generate 1 set of the reversal of all strings in L 1 regular languages, exactly those specified by regular expressions. A finite state automaton is a quintuple ( Q, Σ , E, S, F ) with Example: • Q a finite set of states • Σ a finite set of symbols, the alphabet • Regular expression: colou?r • S ⊆ Q the set of start states • Finite state machine: • F ⊆ Q the set of final states r 1 • E a set of edges Q × (Σ ∪ { ǫ } ) × Q c o l o u r 0 6 5 4 2 The transition function d can be defined as d ( q, a ) = { q ′ ∈ Q |∃ ( q, a, q ′ ) ∈ E } 3 10 11 12 Language accepted by an FSA Finite state transition networks (FSTN) Example for a finite state transition network E ⊆ Q × Σ ∗ × Q is the smallest set such that The extended set of edges ˆ Finite state transition networks are graphical descriptions of finite state a b S1 machines: ( q, σ, q ′ ) ∈ ˆ • ∀ ( q, σ, q ′ ) ∈ E : S0 S3 E • nodes represent the states c b S2 • ∀ ( q 0 , σ 1 , q 1 ) , ( q 1 , σ 2 , q 2 ) ∈ ˆ ( q 0 , σ 1 σ 2 , q 2 ) ∈ ˆ • start states are marked with a short arrow E : E b • final states are indicated by a double circle Regular expression specifying the language generated or accepted by • arcs represent the transitions the corresponding FSM: ab|cb+ The language L(A) of a finite state automaton A is defined as L ( A ) = { w | q s ∈ S, q f ∈ F, ( q s , w, q f ) ∈ ˆ E } 13 14 15 Finite state transition tables The example specified as finite state transition table Some properties of finite state machines a b c d Finite state transition tables are an alternative, textual way of describing • Recognition problem can be solved in linear time (independent of the finite state machines: S0. S1 S2 size of the automaton). S1 S3: • the rows represent the states • There is an algorithm to transform each automaton into a unique S2 S2,S3: equivalent automaton with the least number of states. S3: • start states are marked with a dot after their name • final states with a colon • the columns represent the alphabet • the fields in the table encode the transitions 16 17 18
Deterministic Finite State Automata Example: Determinization of FSA From Automata to Transducers A finite state automaton is deterministic iff it has Needed: mechanism to keep track of path taken ✗✔ ✗✔ ❄ ❄ • no ǫ transitions and ✖✕ ✖✕ 1 1 PPPP PPPP A finite state transducer is a 6-tuple ( Q, Σ 1 , Σ 2 , E, S, F ) with a ✟ b a ✟ b ✛✘ ✛✘ ✛✘ ✛✘ ✟ ✟ ✟ ✟ q P P q • for each state and each symbol there is at most one applicable ✟ ✟ ✟ ✙ ✟ ✙ c ✲ ✛✘ ✚✙ ✚✙ ✚✙ ✚✙ • Q a finite set of states transition. 2 3 2 c 3 PPPP ❍❍❍❍❍❍❍❍❍❍❍ P q ★ ✥ d d ✚✙ { 3,5 } • Σ 1 a finite set of symbols, the input alphabet Every non-deterministic automaton can be transformed into a e a a c a ✛✘ ✛✘ ✛✘ ✛✘ ✤✜ deterministic one: ❄ ✗✔ ❄ ❄ ❄ ❄ • Σ 2 a finite set of symbols, the output alphabet e ❇ ✁ ✁ ❍ ❥ ✲ ❇ ★✥ ✁ ✚✙ 4 ✚✙ 5 ✚✙ 4 ✚✙ 5 ✖✕ • Define new states representing a disjunction of old states for each ✟ { 5,6 } ❇ ✣✢ ◆ ❇ ✁ ✟ • S ⊆ Q the set of start states ❩❩❩❩❩ ✑ ❩❩❩❩❩ ✑ ✟ ✙ ✡ ✑ ✑ ❖ ❈ e non-determinacy which arises. ✡ ✑ a ✑ ✡ ✢ ❈ ✛✘ ✛✘ a ❈ a ✖ ✌ ✓✏ ✑ ✓✏ ✑ c { 4,5 } c ✧✦ • F ⊆ Q the set of final states ✑ ✰ ❈ ❲ ✑ ✰ ❳❳❳❳❳❳❳❳ ✻ ⑦ ⑦ • Define arcs for these states corresponding to each transition which ✒✑ ✒✑ e c, a ✚✙ ✚✙ 6 ③ 6 • E a set of edges Q × (Σ 1 ∪ { ǫ } ) × Q × (Σ 2 ∪ { ǫ } ) is defined in the non-deterministic automaton for one of the disjuncts in the new state names. 19 20 21 Transducers and determinization Summary Reading assignment 2 A finite state transducer understood as consuming an input and • Notations for characterizing regular languages: producing an output cannot generally be determinized. • Chapter 1 “Finite State Techniques” of course notes • Regular expressions Example: • Finite state transition networks ★ ✘ • Chapter 2 “Regular expressions and automata” of • Finite state transition tables Jurafsky and Martin (2000) a:b ✓✏ ✡ ✣ ✡ ❆ ❯ ❆ ✡ ✒✑ • Finite state machines and regular languages: Definitions and some ✟ ✯ ❍❍❍❍❍❍❍❍❍❍ ✟✟✟✟✟✟✟✟✟✟ b:b a :b properties ✛✘ ✛✘ ❤ ❍ ❥ ✲ ✚✙ • Finite state transducers ✚✙ ✘ ✿ ✘✘✘✘✘✘✘✘✘✘ ❳❳❳❳❳❳❳❳❳ ✛✘ a :c ③ ❳ c:c ✚✙ ✁ ✕ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ❯ ❆ ✫ ✦ a:c 22 23 24
Recommend
More recommend