CPSC 313: Chomsky Normal Form October 25, 2020 We want a simple standard format to describe the productions motivation of a grammar so that we can do proofs and constructions more readily on the grammar. We now must show that all grammars can be converted into this normal form. In Chomsky Normal Form (CNF) all productions are of one of the following three types: A → BC ( A, B, C ∈ V ) A → a ( A ∈ V and a ∈ Σ) S → ǫ (where S ∈ V and is the start symbol) S does not occur on the right hand side of any production. (( T, α ) ∈ P ⇒ n S ( α ) = 0) • remove S from the right side of productions • remove epsilon productions not from the start symbol ( A → ǫ and A � = S ) • remove mixing terminals and non-terminals on right hand side ( A → abCD ) • remove more than 1 alphabet symbols on right hand side ( A → aA ) • remove more than 2 variables in its righthand side ( A → ABC ) • remove productions of length one that are to variables ( A → B ) That is, remove any productions that violate the rules. To remove the epsilon transitions we just think about what variables can derive epsilon, and replace them by omiting them in productions that create that variable. For example if the variable A ⇒ ⋆ ǫ , and we have productions like B → ABC then we add production B → BC . We need to include S → ǫ if S ⇒ ⋆ ǫ as the base case. For the unit productions ( A → B ) we just allow A to derive everything B can derive. So if we had A → B and B → CD | aa we just add A → CD and A → aa and remove A → B To remove alphabet symbols and variables in our rules we simply promote symbols to a variable and include a rule from that variable to the symbol as its only derivation. If we had A → aB we can add a new rule X a → a and replace A → aB with A → X a B
Finally we need to get rid of productions that have more than 2 variables on the right hand side (all that we still need to deal with, that is, these are the only non-CNF productions that remain in our grammar) and we do this by chaining them as follows: A → BCDE we replace this with: A → BA 1 A 1 → CA 2 A 2 → DE Removing epsilon productions 1. determine the nullable variable set (variables that can derive ǫ A ⇒ ⋆ ǫ ): (a) let T be the set of all variables A such that A → ǫ is a production in our grammar. (base case) (b) for all right hand sides of a production B → x 1 x 2 . . . x r , if all x i are nullable then B is nullable. Add B to our set T of nullable variables. (recursive case) (c) continues until no new variables can be added to our nullable set. 2. Add new productions where any nullable variables A is replaced by either A or ǫ in all possible ways. For example, suppose that C, D are nullable and A, B are not and we have a production of the form H → CADB then we would add the following productions to our grammar H → ADB and H → CAB and H → AB to accomodate the possibilities of the variables going to the empty string. 3. Remove all productions of the form A → ǫ 4. If S is in our nullable set T then add the production S → ǫ Question: Why is it insufficent to write “ remove all productions A → ǫ except for S ”? An example: C → DE D → FG E → ǫ F → ǫ G → ǫ Round 1: T = { E, F, G } Round 2: T = { D, E, F, G } Round 3: T = { C, D, E, F, G }
Remove Unit Productions A unit production is A → B or more generally A ⇒ ⋆ B 1. remove all ǫ productions 2. for each pair of variables ( A, B ) such that A ⇒ ⋆ B and B → α is a production ( α ∈ ( V ∪ Σ) ⋆ ) we add the production A → α to our grammar 3. remove all unit productions If B can derive some sentinels forms and A can derive B in some number of steps, we allow A to derive all sentinel forms that B can derive directly instead of requiring that A first turn into B through an effective unit production and instead derive the things that B can derive directly as rules in our grammar. After we remove all rules of the form A → B we no longer have any more rules in our grammar of the form A → α where | α | = 1 and α ∈ V ⋆ An algorithm to find pairs ( A, B ) such that A ⇒ ⋆ B is the following: For each variable A create a set T A := { A } Then, we do the following: for all productions of the form X → Y where X ∈ T A then we add Y to T A Example of this algorithm: A → BF | D | a B → CA | b | E C → c D → B | AC | d E → CC F → E | f Initially, we create the following sets: T A = { A } T B = { B } T C = { C } T D = { D } T E = { E } T F = { F } then after running through the set of rules in our grammar once, we get the following for our sets: T A = { A, D } T B = { B, E } T C = { C } T D = { D, B } T E = { E } T F = { F, E } We repeat this recursive step until we are not adding anything new to our sets. T A = { A, D, B } T B = { B, E } T C = { C }
T D = { D, B, E } T E = { E } T F = { F, E } We run it again: T A = { A, D, B, E } T B = { B, E } T C = { C } T D = { D, B, E } T E = { E } T F = { F, E } We then run it again, notice that we don’t change any of the sets, and terminate. Now we have a bunch of sets T X such that for all Y ∈ T X it is the case that X ⇒ ⋆ Y . Note that this corresponds to a unit production since both X and Y are variables. Then we can continue with the algorithm to explicitly add rules for X corresponding to what Y can derive. At this point we have only productions Promoting Symbols to Variables in our grammar of the form: S → ǫ A → a A → α where | α | ≥ 2 ∧ α ∈ ( V ∪ Σ) ⋆ We want to ensure that instead all such productions where | α | ≥ 2 are strings over V ⋆ and not over ( V ∪ Σ) ⋆ We make a new variable for each symbol that appears on the right hand side and a production for new variable to derive the symbol. if a ∈ Σ and we see that α = βaγ then we add a new variable X a to our grammar with a production X a → a as its only production and then we rewrite the production βaγ as βX a γ After we do this for all terminal symbols on the right hand side of productions with length greater or equal to 2, we now have that all such productions are strings over variables alone, not over alphabet symbols. A → BCDE we effectively replace it with Chain long forms into pairs a bunch of variables and productions, each with length two on the right hand side, and each effectively continuing along the chain. A → BX 1 X 1 → CX 2 X 2 → DE Seen as parse trees, the difference is: A B C D E is the original rule and
A B X 1 C X 2 D E is the chained varient After doing this step, we end up with a grammar that has only productions involving variables having length 2. As a result, all productions now are compliant to the format of CNF and at no step in our process did we change the language that the grammar produced. Therefore, we have taken an arbitrary grammar and converted it into CNF.
Recommend
More recommend