grammars
play

Grammars A grammar is a 4-tuple ( N, , S, P ) where Towards more - PowerPoint PPT Presentation

Grammars A grammar is a 4-tuple ( N, , S, P ) where Towards more complex grammar systems Some basic formal language theory N is a finite set of non-terminals is a finite set of terminal symbols , with N = Detmar Meurers:


  1. Grammars A grammar is a 4-tuple ( N, Σ , S, P ) where Towards more complex grammar systems Some basic formal language theory • N is a finite set of non-terminals • Σ is a finite set of terminal symbols , with N ∩ Σ = ∅ Detmar Meurers: Intro to Computational Linguistics I • S is a distinguished start symbol , with S ∈ N OSU, LING 684.01, 15. January 2004 • P is a finite set of rewrite rules of the form α → β , with α, β ∈ ( N ∪ Σ) ∗ and α including at least one non-terminal symbol. 3 Overview A simple example N = { S, NP , V i , V t , V s } , VP • Grammars, or: how to specify linguistic knowledge Σ = { John, Mary, laughs, loves, thinks } • Automata, or: how to process with linguistic knowledge S = S   • Levels of complexity in grammars and automata: NP → John S → NP VP     → The Chomsky hierarchy NP Mary           P = VP → V i V i → laughs → VP V t NP     V t → loves     VP → V s S     V s → thinks   2 4

  2. How does a grammar define a language? Different levels of complexity in grammars and automata Assume α, β ∈ ( N ∪ Σ) ∗ , with α containing at least one non-terminal. Let A, B ∈ N , x ∈ Σ , α, β, γ ∈ (Σ ∪ T ) ∗ , and δ ∈ (Σ ∪ T )+ , then: • A sentential form for a grammar G is defined as: − The start symbol S of G is a sentential form. Type Automaton Grammar − If αβγ is a sentential form and there is a rewrite rule β → δ then Memory Name Rule Name αδγ is a sentential form. 0 Unbounded TM α → β General rewrite 1 Bounded LBA β A γ → β δ γ Context-sensitive • α (directly or immediately) derives β if α → β ∈ P . One writes: 2 Stack PDA A → β Context-free − α ⇒ ∗ β if β is derived from α in zero or more steps 3 None FSA A → xB , A → x Right linear − α ⇒ + β if β is derived from α in one or more steps Abbreviations: – TM: Turing Machine • A sentence is a sentential form consisting only of terminal symbols. – LBA: Linear-Bounded Automaton – PDA: Push-Down Automaton • The language L ( G ) generated by the grammar G is the set of all – FSA: Finite-State Automaton sentences which can be derived from the start symbol S , i.e., L ( G ) = { γ | S ⇒ ∗ γ } 5 7 Processing with grammars: automata Type 3: Right-Linear Grammars and FSAs A right-linear grammar is a 4-tuple ( N, Σ , S, P ) with An automaton in general has three components: P a finite set of rewrite rules of the form α → β , with α ∈ N and β ∈ { γδ | γ ∈ Σ ∗ , δ ∈ N ∪ { ǫ }} , i.e.: • an input tape , divided into squares with a read-write head positioned over one of the squares − left-hand side of rule: a single non-terminal, and − right-hand side of rule: a string containing at most one non-terminal, • an auxiliary memory characterized by two functions as the rightmost symbol − fetch: memory configuration → symbols Right-linear grammars are formally equivalent to left-linear grammars. − store: memory configuration × symbol → memory configuration A finite-state automaton consists of • and a finite-state control relating the two components. – a tape – a finite-state control – no auxiliary memory 6 8

  3. A regular language example: ( ab | c ) ab ∗ ( a | cb )? Type 2: Context-Free Grammars and Push-Down Automata Right-linear grammar: A context-free grammar is a 4-tuple ( N, Σ , S, P ) with   Expr → ab X X → a Y   N = { Expr, X, Y, Z }   Expr → c X     Z → a P a finite set of rewrite rules of the form α → β , with α ∈ N and β ∈ Σ = { a,b,c } P = → → Y b Y Z cb (Σ ∪ N ) ∗ , i.e.: S = Expr       Y → Z Z → ǫ   − left-hand side of rule: a single non-terminal, and − right-hand side of rule: a string of terminals and/or non-terminals Finite-state transition network: A push-down automaton is a c a a 4 2 0 1 − finite state automaton, with a b b c a b − stack as auxiliary memory 3 5 9 11 A context-free language example: a n b n Thinking about regular languages Context-free grammar: Push-down automaton: − A language is regular iff one can define a FSM (or regular expression) ǫ N = { S } for it. 0 1 Σ = { a, b } − An FSM only has a fixed amount of memory, namely the number of states. S = S a + push x b + pop x − Strings longer than the number of states, in particular also any infinite � → � S a S b P = ones, must result from a loop in the FSM. S → ǫ − Pumping Lemma: if for an infinite string there is no such loop, the string cannot be part of a regular language. 10 12

  4. Type 1: Context-Sensitive Grammars and Type 0: General Rewrite Grammar and Turing Machines Linear-Bounded Automata A rule of a context-sensitive grammar • In a general rewrite grammar there are no restrictions on the form – rewrites at most one non-terminal from the left-hand side. of a rewrite rule. – right-hand side of a rule required to be at least as long as the left- hand side, i.e. only contains rules of the form • A turing machine has an unbounded auxiliary memory. α → β with | α | ≤ | β | • Any language for which there is a recognition procedure can be and optionally S → ǫ with the start symbol S not occurring in any β . defined, but recognition problem is not decidable. A linear-bounded automaton is a – finite state automaton, with an – auxiliary memory which cannot exceed the length of the input string. 13 15 A context-sensitive language example: a n b n c n Properties of different language classes Languages are sets of strings, so that one can apply set operations to languages and investigate the results for particular language classes. Context-sensitive grammar: N = { S, B, C } Some closure properties: Σ = { a, b } − All language classes are closed under union with themselves . − All language classes are closed under intersection with regular S = S languages .   S → a S B C,     → − The class of context-free languages is not closed under S a b C,         b B → b b, intersection with itself .   P = → b C b c, Proof: The intersection of the two context-free languages L 1 and L 2     c C → c c,     is not context free:     → C B B C   � a n b n c i | n ≥ 1 and i ≥ 0 � − L 1 = � a j b n c n | n ≥ 1 and j ≥ 0 � − L 2 = 14 16 − L 1 ∩ L 2 = { a n b n c n | n ≥ 1 }

  5. Criteria under which to evaluate grammar formalisms Language classes and natural languages (cont.) There are three kinds of criteria: – linguistic naturalness • Any finite language is a regular language. – mathematical power – computational effectiveness and efficiency • The argument that natural languages are not regular relies on competence as an idealization, not performance. The weaker the type of grammar: – the stronger the claim made about possible languages • Note that even if English were regular, a context-free grammar – the greater the potential efficiency of the parsing procedure characterization could be preferable on the grounds that it is more transparent than one using only finite-state methods. Reasons for choosing a stronger grammar class: – to capture the empirical reality of actual languages – to provide for elegant analyses capturing more generalizations ( → more “compact” grammars) 17 19 Language classes and natural languages Accounting for the facts Natural languages are not regular vs. linguistically sensible analyses (1) a. The mouse escaped. Looking at grammars from a linguistic perspective, one can distinguish b. The mouse that the cat chased escaped. their c. The mouse that the cat that the dog saw chased escaped. . . d. . − weak generative capacity , considering only the set of strings generated by a grammar (2) a. aa b. abba − strong generative capacity , considering the set of strings and their c. abccba syntactic analyses generated by a grammar . . d. . Two grammars can be strongly or weakly equivalent. Center-embedding of arbitrary depth needs to be captured to capture language competence → Not possible with a finite state automaton. 18 20

  6. Example for weakly equivalent grammars Grammar 2 rules: A weekly equivalent grammar eliminating the ambiguity (only licenses second structure).   S1 → if T then S1 ,     S1 → if T then S2 else S1 , Example string:         S1 → a ,         if x then if y then a else b S1 → b ,       S2 → if T then S2 else S2 , S2 → a   Grammar 1:       S2 → b         S → if T then S else S ,   T → x           S → if T then S ,     T → y         S → a   S → b     T → x         T → y   21 23 S First analysis: Reading assignment if then else T S S • Chapter 2 “Basic Formal Language Theory” of our Lecture Notes x b if then T S • Chapter 3 “Formal Languages and Natural Languages” of our y a Lecture Notes S Second analysis: • Chapter 13 “Language and complexity” of Jurafsky and Martin (2000) if then T S x if then else T S S y a b 22 24

Recommend


More recommend