CSE 3341: Principles of Programming Languages Syntax Jeremy Morris 1
Syntax vs. Semantics Syntax: What kinds of symbols are allowed in a language? Semantics What do the symbols in a language mean ? 2
Language Terminology Alphabet Finite set of symbols String Sequence of symbols Language Set of strings over an alphabet Grammar Rules that define which strings over an alphabet are in the language and which ones are not 3
Terminology Example Consider the Java programming language Alphabet The tokens in the Java language. if , then , while , do , > , < , String , variable names, etc. Note: Not the individual characters Not your intuitive understanding of the term “alphabet”. String A sequence of tokens from the alphabet Language The set of all syntactically correct Java programs. Grammar The rules for producing syntactically correct Java programs. https://docs.oracle.com/javase/specs/jls/se8/html/index.html (It’s a nearly 800 page book – you don’t need to read it) 4
Language Terminology We typically talk about languages in mathematical terms as sets Alphabet – finite set of symbols Often denoted as Σ String – finite set of symbol sequences Empty string: ε – a sequence of length 0 Σ * - the set of all strings over Σ (including ε ) The * represents the “Kleene closure” – we’ll discuss this more later Σ + - the set of all non-empty strings over Σ The + represents “one or more” where the * represents “zero or more” Language – set of strings Language L ⊆ Σ * Defined by a grammar Probably will not contain everything in Σ * 5
Syntax - Specification We use syntax rules to specify the syntax of a language Language – set of all strings Some rules for non-negative integers: number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 With these we can specify any non-negative integer. 6
Syntax Rule Terminology Terminal symbol Any symbol that represents a member of the alphabet for the language i.e. Any symbol that is in the set of all possible tokens for the language Will only appear on the right hand side of a syntax rule (At least for our purposes – not strictly true) Non-terminal symbol Any symbol that represents a rule to be expanded Non-terminal – meaning “we need to keep going” Can appear on either the left or the right hand side of a syntax rule Meta-symbols Symbols used to write the rules, but not part of the alphabet or non-terminals →, |, *, etc . 7
Terminology Example number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Which of these are terminal symbols? Non-terminal? Meta? 8
Syntax – Types of Grammars Chomsky Hierarchy Outlines how complex formal languages are based on their rules Type-0 – Unrestricted (aka Recursively enumerable) Type-1 – Context-sensitive Type-2 – Context-free Type-3 – Regular We will focus on those last two 9
Regular Languages (aka Regular Expressions) The simplest kind of grammar Requires only 3 kinds of rules: Concatenation Join two things together Alternation Select between two choices “Kleene closure” Repeat something zero or more times. No recursion is allowed If we allow recursion, then we get Context-free grammars 10
Regular Languages (aka Regular Expressions) Assume an alphabet Σ . A regular expression over Σ is: Φ – the empty set ε – the empty string Any member of Σ (i.e. R = { r | r ϵ Σ }) Concatenation If R and S are both regular expressions over Σ , then so is RS RS = {r.s | r ϵ Σ and s ϵ Σ } Alternation If R and S are both regular expressions over Σ , then so is R ∪ S Written as R|S – choose between R or S “Kleene closure” If R is a regular expression over Σ , then so is R* R repeated 0 or more times – R concatenated with itself 11
Regular Languages In syntax rules we can define a regular language like this: number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Another way of saying: Σ = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9} number = {dd*, d ϵ Σ } (There might be a problem with this definition of a natural number – can you spot it?) 12
Regular Languages Another example (from the textbook) Numeric constants number → integer | real integer → digit digit* real → integer exp | decimal (exp | ε ) decimal → digit* (. digit | digit .) digit* exp → (e | E) (+ | - | ε ) integer digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |0 13
Derivations Using syntax rules we can derive strings that are in our language Using the previous set of rules, can we show that 655 is in our language of “numeric constants”? ⇒ integer number ⇒ digit digit* ⇒ 6 digit* ⇒ 6 5 digit* ⇒ 6 5 5 digit* ⇒ 6 5 5 14
Derivations Example Using the rules on the previous slide, determine if the following strings are in the language for numeric constants: 10e5 .65e30 .65e0.30 10.0e5.0 10.0e-5 15
Context-Free Languages The Chomsky Hierarchy mentioned above is a hierarchy All Regular Languages are also Context-Free, but not all Context- Free Languages are Regular Consider the language L = { a n b n | n ≥ 0 } Empty string, ab, aabb, aaabbb, etc. are all in this language aabbb, aaabb, a, etc. are not. Can we derive the rules for this language using only the rules set out for regular languages? No, as it turns out. You can prove this mathematically using a theorem known as the pumping lemma , but that’s outside the scope of this class see CSE 3321 – Formal Languages and Automata But if we allow recursion we can do it easily 16
Context-Free Grammars (CFGs) A grammar that defines a Context-Free language has the same properties as a Regular grammar… Concatenation, Alternation, Kleene Closure …but allows for recursion in its rules Either immediate recursion – the non-terminal on both the right and left hand side of the same rule We’ll see an example of this on the next slide Or mutal recursion – a non-terminal on the left expands a rule that eventually expands that non-terminal We’ll see an example of this in a moment – hang in there 17
Context Free Grammars (CFGs) The following grammar is not Regular, but is Context-Free: expr → number | expr op expr | ( expr ) op → + | - | / | * number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Note the recursion in the rule for expanding expr This grammar is problematic… Let’s derive 1+3*2 using the previous rules 18
Context-Free Grammars We can represent a derivation graphically as a parse tree or syntax tree The root of the tree is the start symbol for the grammar The internal nodes are non-terminal symbols The leaf nodes are terminal symbols expr expr expr op number + expr op expr 1 number * number 2 3 19
Context-Free Grammars Consider these two trees, both derived from the above expr grammar: expr expr op number expr op expr * 2 number number + expr 3 1 expr expr op number + expr op expr 1 number * number 2 3 20
Context-Free Grammars A better, unambiguous grammar: expr → term | expr add_op term term → factor | term mult_op factor factor → number | ( expr ) mult_op → * | / add_op → + | - number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Still not Regular, but Context-Free Recursion is still there 21
Languages in Compilers & Interpreters Stream of Parse Tokenizer/ Next Steps Characters Tree Scanner Stream of Parser tokens 22
Syntax - Specification The previous syntax rules are one type on notation for a syntax. number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Here’s another: <number> ::= <digit> | <digit> <number> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Backus-Naur Form (aka Backus normal form aka BNF) Note that pure BNF does not use Kleene-star or Kleene-plus Other extensions provide shorthand to allow these, but it doesn't change the expressiveness to not have them (see above for how to replace Kleene star) 23
BNF Specification <number> ::= <digit> | <digit> <number> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Special symbols: <, >, | and ::= Reserved (or ‘meta’) symbols Non-terminals Wrapped in <> tags - <digit> or <number> Indicate rules that need to be expanded Terminals Not wrapped in <> tags Indicate “terminal” symbols – no more expansion 24
Recommend
More recommend