Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel - - PowerPoint PPT Presentation

taaltheorie en taalverwerking
SMART_READER_LITE
LIVE PREVIEW

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel - - PowerPoint PPT Presentation

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic, Language, and Computation Winter 2012, lecture 1a Raquel Fernndez TtTv 2012 - lecture 1a 1 / 27 TTTV: Practical Matters Lecturer:


slide-1
SLIDE 1

Taaltheorie en Taalverwerking

BSc Artificial Intelligence

Raquel Fernández Institute for Logic, Language, and Computation

Winter 2012, lecture 1a

Raquel Fernández TtTv 2012 - lecture 1a 1 / 27

slide-2
SLIDE 2

TTTV: Practical Matters

  • Lecturer:

∗ Raquel Fernández – raquel.fernandez@uva.nl ∗ Office: C3.131. Office hours: Mondays 11-13h (by appointment)

  • Tutors: meet them today at the practical session

∗ Sharon Gieske ∗ Elise Koster ∗ Tim van Rossum ∗ Kirsten Teulen

  • Timetable:

∗ We have a slightly irregular timetable. Please check online the schedule for each week – time slots, rooms, etc.

Raquel Fernández TtTv 2012 - lecture 1a 2 / 27

slide-3
SLIDE 3

TTTV: Practical Matters

Every week:

  • Two lectures (a, b)

∗ slides online on Blackboard ∗ readings that need to be done before the lecture

  • Two practical sessions

∗ two groups in different rooms with 2 tutors per group ∗ some sessions will take place in a computer lab

  • Homework exercises

∗ one deadline per week ∗ use practical sessions to resolve doubts about homework

  • Materials

∗ see Course Information on Blackboard, where for each week you’ll find the learning objectives, readings, homework, deadlines, etc.

Raquel Fernández TtTv 2012 - lecture 1a 3 / 27

slide-4
SLIDE 4

Taaltheorie en Taalverwerking 2012

Schedule for Part 1 of the course Lectures Practical Sessions Recommended self-study times (for reading, studying, homework, etc.) Time reserved for other courses (e.g. Linear Algebra). Week 1 Ma Di Wo Do Vr 9-11h 11-13h 13-15h Lecture 1a Practical 15-17h Practical Lecture 1b 17-19h (19h)

Vr 19h: submission deadline HW on regular expressions and finite-state methods

Week 2 Ma Di Wo Do Vr 9-11h 11-13h Lecture 2a 13-15h Practical Practical 15-17h Lecture 2b 17-19h (19h)

Vr 19h: submission deadline HW on formal language theory, syntax and DCGs in Prolog

Week 3 Ma Di Wo Do Vr 9-11h 11-13h 13-15h Lecture 3a Lecture 3b 15-17h Practical 17-19h Practical (17h)

Do 17h: submission deadline HW on more advanced syntax and parsing

Week 4: Deeltoets (Di 13-15h)

Raquel Fernández TtTv 2012 - lecture 1a 4 / 27

slide-5
SLIDE 5

TTTV: Evaluation

  • Weekly homework exercises

20% of the final grade. Can be done in pairs. One deadline per week

  • Two exams

35% + 35% of the final grade. Must be done individually.

  • Presentation and webpage

10% of final grade. Done in groups of 4-5 students. At the end of the course, each team will be expected to give a presentation about a natural language application of their choice and create a webpage summarizing their

  • findings. More details will be provided in class.

Raquel Fernández TtTv 2012 - lecture 1a 5 / 27

slide-6
SLIDE 6

TTTV: Evaluation

You will pass the course only if all the following conditions apply:

  • you have submitted all homework assignments and gotten a minimum

average homework grade of 4.5.

  • you have taken the two exams and gotten a minimum grade of 4.5

for each of them.

  • you have participated in a presentation and your team has gotten a

minimum grade of 4.5.

  • your overall weighted average grade is at least 5.5.

Raquel Fernández TtTv 2012 - lecture 1a 6 / 27

slide-7
SLIDE 7

What is this course about?

Raquel Fernández TtTv 2012 - lecture 1a 7 / 27

slide-8
SLIDE 8

What is this course about?

Taaltheorie ≈ (Theoretical/Formal) Linguistics

Raquel Fernández TtTv 2012 - lecture 1a 7 / 27

slide-9
SLIDE 9

What is this course about?

Taaltheorie ≈ (Theoretical/Formal) Linguistics en Taalverwerking

Raquel Fernández TtTv 2012 - lecture 1a 7 / 27

slide-10
SLIDE 10

What is this course about?

Taaltheorie ≈ (Theoretical/Formal) Linguistics en Taalverwerking ≈ Computational Linguistics ≈ Natural Language Processing ≈ Human Language Technology

Raquel Fernández TtTv 2012 - lecture 1a 7 / 27

slide-11
SLIDE 11

What is this course about?

Main goals of the course:

  • to understand human linguistic abilities

∗ language is a cognitive ability that is exclusively human ∗ recall the Turing test

Raquel Fernández TtTv 2012 - lecture 1a 8 / 27

slide-12
SLIDE 12

What is this course about?

Main goals of the course:

  • to understand human linguistic abilities

∗ language is a cognitive ability that is exclusively human ∗ recall the Turing test

  • to emulate these abilities using computational models

∗ no need to be committed to simulating actual cognitive processes ∗ but cognitive modelling might also be an underlying aim

Raquel Fernández TtTv 2012 - lecture 1a 8 / 27

slide-13
SLIDE 13

What is this course about?

Main goals of the course:

  • to understand human linguistic abilities

∗ language is a cognitive ability that is exclusively human ∗ recall the Turing test

  • to emulate these abilities using computational models

∗ no need to be committed to simulating actual cognitive processes ∗ but cognitive modelling might also be an underlying aim

  • to study how these computational models can be used to get

computers to perform useful tasks involving human language

∗ plenty of practical applications involving natural language processing, for instance: machine translation, email filtering, information retrieval, . . .

Raquel Fernández TtTv 2012 - lecture 1a 8 / 27

slide-14
SLIDE 14

Language and Communication

Raquel Fernández TtTv 2012 - lecture 1a 9 / 27

slide-15
SLIDE 15

Language and Communication

Diagram from Russell & Norvig (2003) Artificial Intelligence: a Modern Approach

Raquel Fernández TtTv 2012 - lecture 1a 9 / 27

slide-16
SLIDE 16

What is this course about?

We’ll focus on the comprehension/hearer’s side.

Raquel Fernández TtTv 2012 - lecture 1a 10 / 27

slide-17
SLIDE 17

What is this course about?

We’ll focus on the comprehension/hearer’s side.

  • First part of the course:

structure

∗ formal language theory ∗ syntax ∗ parsing

  • Second part of the course:

meaning

∗ compositional semantics ∗ lexical semantics ∗ pragmatics and dialogue

N.B: The contents of the course are slightly different from previous years

Raquel Fernández TtTv 2012 - lecture 1a 10 / 27

slide-18
SLIDE 18

Related Courses in the AI Curriculum

We will build on knowledge and skills you have acquired during the first semester of the 1st year:

  • Logisch Programmeren en Zoektechnieken
  • Inleiding Logica

Other language-related courses in subsequent years:

  • 2nd year: Natuurlijke Taalmodellen en Interfaces
  • 3rd year: Discourse

Raquel Fernández TtTv 2012 - lecture 1a 11 / 27

slide-19
SLIDE 19

Taaltheorie en Taalverwerking 2012 Raquel Fern´ andez Content and Overall Learning Objectives The overall goal of this course is to introduce students to the fundamental topics in computational linguistics and to explain how linguistic knowledge can be used for natural language processing and other key problems in artificial intelligence, such as machine translation and conversational agents. By the end of the course, students should be able to:

  • 1. demonstrate an understanding of the basic concepts in formal language theory, by being able to define

formal languages with formalisms and automata and to compare languages, automata, and grammars with different levels of complexity.

  • 2. analyse the syntactic structure of natural language sentences by means of formal grammars and imple-

ment some of those grammars in Prolog.

  • 3. describe and compare different parsing algorithms for syntactic processing.
  • 4. represent the meaning of natural language sentences with logic-based formulas and calculate those

formulas in a systematic and compositional fashion, on paper and in Prolog.

  • 5. describe the main computational tasks associated with word meanings, including the disambiguation
  • f word meanings in context and the computation of relations between words.
  • 6. describe the main computational challenges of modelling language interaction, including dialogue co-

herence and the automatic recognition of speech acts.

  • 7. demonstrate an understanding of the inner workings of key natural language applications, by being able

to explain how different types of linguistic knowledge bear on applications such as machine translation, information retrieval, and dialogue systems. Materials The main resource for the course is the textbook by Jurafsky & Martin (2009). This is a big book which Raquel Fernández TtTv 2012 - lecture 1a 12 / 27

slide-20
SLIDE 20

TTTV: Course Materials

Main resource:

  • Jurafsky & Martin (2009) Speech and Language Processing,

Second Edition, Pearson Education. Draft versions of some chapters will be available on Blackboard. Other materials, such as online articles and book chapters, will be pointed out during the course. ⇒ See Course Information on Blackboard.

Raquel Fernández TtTv 2012 - lecture 1a 13 / 27

slide-21
SLIDE 21

Break

Raquel Fernández TtTv 2012 - lecture 1a 14 / 27

slide-22
SLIDE 22

Overview

This week we’ll look into Formal Language Theory

  • Today:

∗ formal languages: alphabets and strings ∗ regular expressions

  • Next lecture:

∗ finite state automata ∗ finite state methods for simple natural language tasks

Raquel Fernández TtTv 2012 - lecture 1a 15 / 27

slide-23
SLIDE 23

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-24
SLIDE 24

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-25
SLIDE 25

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

  • Let Σ2 = {a, b, c, d, e, f , g} be an alphabet.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-26
SLIDE 26

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

  • Let Σ2 = {a, b, c, d, e, f , g} be an alphabet. Then bee, dad, cabbage, and face are

strings over Σ2, as are fffff and agagag.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-27
SLIDE 27

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

  • Let Σ2 = {a, b, c, d, e, f , g} be an alphabet. Then bee, dad, cabbage, and face are

strings over Σ2, as are fffff and agagag.

  • Let Σ3 = {ba, ca, fa, ce, fe, ge} be an alphabet.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-28
SLIDE 28

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

  • Let Σ2 = {a, b, c, d, e, f , g} be an alphabet. Then bee, dad, cabbage, and face are

strings over Σ2, as are fffff and agagag.

  • Let Σ3 = {ba, ca, fa, ce, fe, ge} be an alphabet. Then face is a string over Σ3 but

bee, dad or cabbage are not.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-29
SLIDE 29

Formal Languages: strings and alphabets

A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet (or a vocabulary).

Examples

  • Let Σ1 = {0, 1} be an alphabet. Then all binary numbers are strings over Σ1.

For instance: 01101, 000001, 1101.

  • Let Σ2 = {a, b, c, d, e, f , g} be an alphabet. Then bee, dad, cabbage, and face are

strings over Σ2, as are fffff and agagag.

  • Let Σ3 = {ba, ca, fa, ce, fe, ge} be an alphabet. Then face is a string over Σ3 but

bee, dad or cabbage are not.

  • Let Σ4 = {♠, △, ♣} be an alphabet. Then ♠♠ and ♣△♣ are strings over Σ4.

Raquel Fernández TtTv 2012 - lecture 1a 16 / 27

slide-30
SLIDE 30

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-31
SLIDE 31

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-32
SLIDE 32

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-33
SLIDE 33

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-34
SLIDE 34

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-35
SLIDE 35

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-36
SLIDE 36

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is 4.

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-37
SLIDE 37

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is 4.

The string of length 0 is called the empty string, denoted ǫ

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-38
SLIDE 38

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is 4.

The string of length 0 is called the empty string, denoted ǫ Given a string s, a substring of s is a string formed by taking contiguous symbols of s in the order in which they occurr in s. An initial substring is called a prefix and a final substring, a suffix.

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-39
SLIDE 39

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is 4.

The string of length 0 is called the empty string, denoted ǫ Given a string s, a substring of s is a string formed by taking contiguous symbols of s in the order in which they occurr in s. An initial substring is called a prefix and a final substring, a suffix.

Examples Let unthinkable be a string over Σ = {a, b, c . . . x, y, z}

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-40
SLIDE 40

Strings and Substrings

The length of a string is the number of token symbols from the alphabet it contains.

Examples

  • the length of face over Σ2 = {a, b, c, d, e, f , g} is 4
  • the length of face over Σ3 = {ba, ca, fa, ce, fe, ge} is 2
  • the length of aaaaaaaa over Σ2 is 8.
  • the length of gegegege over Σ3 is 4.

The string of length 0 is called the empty string, denoted ǫ Given a string s, a substring of s is a string formed by taking contiguous symbols of s in the order in which they occurr in s. An initial substring is called a prefix and a final substring, a suffix.

Examples Let unthinkable be a string over Σ = {a, b, c . . . x, y, z} Then, ǫ, un, unth, unthinkable are prefixes, while ǫ, e, able, thinkable, and unthinkable are suffixes. Other substrings include nthi, inka, bl.

Raquel Fernández TtTv 2012 - lecture 1a 17 / 27

slide-41
SLIDE 41

Some Operations on Strings

Raquel Fernández TtTv 2012 - lecture 1a 18 / 27

slide-42
SLIDE 42

Some Operations on Strings

  • Concatenation: two string s1 and s2 over Σ can be concatenated

(written one after the other) to form a new string s1 · s2 over Σ.

Σ = {a, b} a · b = ab

Raquel Fernández TtTv 2012 - lecture 1a 18 / 27

slide-43
SLIDE 43

Some Operations on Strings

  • Concatenation: two string s1 and s2 over Σ can be concatenated

(written one after the other) to form a new string s1 · s2 over Σ.

Σ = {a, b} a · b = ab

  • Exponent: we can apply an exponent operator n to a string s.

The resulting string sn is obtained by concatenating s with itself n times.

a0 = ǫ, a1 = a, a2 = aa, a3 = aaa. . .

Raquel Fernández TtTv 2012 - lecture 1a 18 / 27

slide-44
SLIDE 44

Some Operations on Strings

  • Concatenation: two string s1 and s2 over Σ can be concatenated

(written one after the other) to form a new string s1 · s2 over Σ.

Σ = {a, b} a · b = ab

  • Exponent: we can apply an exponent operator n to a string s.

The resulting string sn is obtained by concatenating s with itself n times.

a0 = ǫ, a1 = a, a2 = aa, a3 = aaa. . .

  • Kleene star: a special exponent operator ∗ which applied to a

string s denotes any string obtained by concatenating s with itself any number of times.

a∗ = ǫ or a or aa or aaa . . .

Raquel Fernández TtTv 2012 - lecture 1a 18 / 27

slide-45
SLIDE 45

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-46
SLIDE 46

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-47
SLIDE 47

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-48
SLIDE 48

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-49
SLIDE 49

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only
  • the set of strings containing at least one vowel and one consonant

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-50
SLIDE 50

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only
  • the set of strings containing at least one vowel and one consonant
  • the set of strings whose length is less than 9 symbols

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-51
SLIDE 51

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only
  • the set of strings containing at least one vowel and one consonant
  • the set of strings whose length is less than 9 symbols
  • the set {one, two, three, four, five, six, seven, eight, nine, ten}

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-52
SLIDE 52

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only
  • the set of strings containing at least one vowel and one consonant
  • the set of strings whose length is less than 9 symbols
  • the set {one, two, three, four, five, six, seven, eight, nine, ten}
  • the set of all English words

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-53
SLIDE 53

Formal Languages

We use Σ∗ to denote the set of all strings over an alphabet Σ.

→ note that Σ∗ is always infinite, regardless of the number of symbols Σ contains.

We may now define a formal language over an alphabet Σ as any subset of Σ∗

Examples of formal languages Let Σ = {a, b, c . . . x, y, z}. Then Σ∗ is any subset of strings over the Latin alphabet and the following are formal languages belonging to Σ∗:

  • the set of strings consisting of consonants only
  • the set of strings containing at least one vowel and one consonant
  • the set of strings whose length is less than 9 symbols
  • the set {one, two, three, four, five, six, seven, eight, nine, ten}
  • the set of all English words
  • the empty set

Raquel Fernández TtTv 2012 - lecture 1a 19 / 27

slide-54
SLIDE 54

Formal Languages

How can we characterise the language(s) we are interested in?

  • given an alphabet Σ and the infinite set Σ∗ of formal languages it can

give rise to, how can we select a particular formal language?

Raquel Fernández TtTv 2012 - lecture 1a 20 / 27

slide-55
SLIDE 55

Formal Languages

How can we characterise the language(s) we are interested in?

  • given an alphabet Σ and the infinite set Σ∗ of formal languages it can

give rise to, how can we select a particular formal language? For instance: ∗ can we distinguish the set of strings of letters that constitute proper English words? ∗ can we distinguish the set of strings of words that count as well-formed sentences of Dutch?

Raquel Fernández TtTv 2012 - lecture 1a 20 / 27

slide-56
SLIDE 56

Formal Languages

How can we characterise the language(s) we are interested in?

  • given an alphabet Σ and the infinite set Σ∗ of formal languages it can

give rise to, how can we select a particular formal language? For instance: ∗ can we distinguish the set of strings of letters that constitute proper English words? ∗ can we distinguish the set of strings of words that count as well-formed sentences of Dutch?

We have two formal mechanisms at our disposal:

  • formalisms (formal expressions and grammars): sets of rules
  • automata: computational devices for computing languages

Raquel Fernández TtTv 2012 - lecture 1a 20 / 27

slide-57
SLIDE 57

Formalisms and Automata

  • Formalisms and automata allow us to distinguish a formal

language of interest (a set of strings) from other possible languages over a given alphabet.

∗ they capture the patterns that characterise a language ∗ as such, they act as a definition of the language they capture

  • From an abstract point of view, a natural language – like Dutch
  • r English – is a set of strings (of sounds/letters, of words, etc.)
  • Therefore, formalisms and automata can help us to model

aspects of natural languages.

Raquel Fernández TtTv 2012 - lecture 1a 21 / 27

slide-58
SLIDE 58

Formalisms and Automata

  • Formalisms and automata allow us to distinguish a formal

language of interest (a set of strings) from other possible languages over a given alphabet.

∗ they capture the patterns that characterise a language ∗ as such, they act as a definition of the language they capture

  • From an abstract point of view, a natural language – like Dutch
  • r English – is a set of strings (of sounds/letters, of words, etc.)
  • Therefore, formalisms and automata can help us to model

aspects of natural languages. Remainder of today: we’ll look into one formalism to define formal languages: regular expressions

Raquel Fernández TtTv 2012 - lecture 1a 21 / 27

slide-59
SLIDE 59

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern.

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-60
SLIDE 60

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-61
SLIDE 61

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-62
SLIDE 62

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab
  • disjunction (or union) may be written as a|b, a + b or a ∪ b

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-63
SLIDE 63

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab
  • disjunction (or union) may be written as a|b, a + b or a ∪ b
  • the notation Σ∗ can be seen as abbreviating (a|b|...)∗ for any symbol a, b, . . . in Σ.

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-64
SLIDE 64

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab
  • disjunction (or union) may be written as a|b, a + b or a ∪ b
  • the notation Σ∗ can be seen as abbreviating (a|b|...)∗ for any symbol a, b, . . . in Σ.
  • Σn can be seen as abbreviating the concatenation of (a|b|...) with itself n times

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-65
SLIDE 65

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab
  • disjunction (or union) may be written as a|b, a + b or a ∪ b
  • the notation Σ∗ can be seen as abbreviating (a|b|...)∗ for any symbol a, b, . . . in Σ.
  • Σn can be seen as abbreviating the concatenation of (a|b|...) with itself n times
  • an can be used to abbreviate the concatenation of a with itself n times

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-66
SLIDE 66

Regular Expressions

Regular expressions are a formal notation for characterising sets of strings that follow a fairly simple regular pattern. We can construct regular expressions over an alphabet Σ as follows:

Regular expression Languages empty set: ∅ {} empty string: ǫ {ǫ} symbol (∀a ∈ Σ): a {a} If a and b are reg exp, so are: concatenation: a · b {ab} disjunction (or union): (a|b) {a, b} Kleene star (or closure): a∗ {ǫ, a, aa, aaa, aaaaa, . . .}

  • we often ignore the dot in concatenation (a · b) and simply write ab
  • disjunction (or union) may be written as a|b, a + b or a ∪ b
  • the notation Σ∗ can be seen as abbreviating (a|b|...)∗ for any symbol a, b, . . . in Σ.
  • Σn can be seen as abbreviating the concatenation of (a|b|...) with itself n times
  • an can be used to abbreviate the concatenation of a with itself n times
  • a+ can be used to abbreviate a∗a (the set of a-strings with at least 1 a)

Raquel Fernández TtTv 2012 - lecture 1a 22 / 27

slide-67
SLIDE 67

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-68
SLIDE 68

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-69
SLIDE 69

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-70
SLIDE 70

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-71
SLIDE 71

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . }

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-72
SLIDE 72

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c)

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-73
SLIDE 73

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-74
SLIDE 74

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-75
SLIDE 75

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-76
SLIDE 76

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-77
SLIDE 77

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+ {baa, baaaaaa, baaaaaaaaa, . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-78
SLIDE 78

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+ {baa, baaaaaa, baaaaaaaaa, . . .} sunΣ∗

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-79
SLIDE 79

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+ {baa, baaaaaa, baaaaaaaaa, . . .} sunΣ∗ {sun, sunglasses, sunset, sunz, sunaaaaa, sunyxjshiksr . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-80
SLIDE 80

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+ {baa, baaaaaa, baaaaaaaaa, . . .} sunΣ∗ {sun, sunglasses, sunset, sunz, sunaaaaa, sunyxjshiksr . . .} (co)2Σ∗

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-81
SLIDE 81

Regular Expressions: Examples

What kind of strings would the language defined by each of the following regular expressions contain?

Let Σ = {a, b, c, d...x, y, z} Regular expression Language (ab)∗c {c, abc, ababc, abababc, . . .} (a|b)∗c {c, ac, bc, aac, abc, bac, bbc, babc . . . } (a∗c)|(b∗c) {c, ac, aac, aaac, . . . bc, bb, bbbc, . . .} me(o)∗w {mew, meow, meoow, meooow, meooooooooow, . . .} ba(a)+ {baa, baaaaaa, baaaaaaaaa, . . .} sunΣ∗ {sun, sunglasses, sunset, sunz, sunaaaaa, sunyxjshiksr . . .} (co)2Σ∗ {coco, cocoa, coconut, cocoz, coconjsbfx . . .}

Raquel Fernández TtTv 2012 - lecture 1a 23 / 27

slide-82
SLIDE 82

Regular Expressions in Programming Languages

Many programming languages such as Perl, Python, Java, or Unix tools like grep include ways to specify regular expressions. For instance:

Perl notation underlying regular expression ranges: [a-z] (a|b|c|d| . . . |z)

  • ptionality:

colo(u)?r colo(u|ǫ)r digits: \d (0|1|2|3|4|5|6|7|8|9) There are many other options, such as negation, upper- lower-case, white spaces, etc.

Raquel Fernández TtTv 2012 - lecture 1a 24 / 27

slide-83
SLIDE 83

Regular Expressions in Programming Languages

Many programming languages such as Perl, Python, Java, or Unix tools like grep include ways to specify regular expressions. For instance:

Perl notation underlying regular expression ranges: [a-z] (a|b|c|d| . . . |z)

  • ptionality:

colo(u)?r colo(u|ǫ)r digits: \d (0|1|2|3|4|5|6|7|8|9) There are many other options, such as negation, upper- lower-case, white spaces, etc.

These operators are “syntactic sugar”: all regular expressions can be constructed with the basic operations we have seen earlier: concatenation, disjunction, and kleene star, plus the empty string.

Raquel Fernández TtTv 2012 - lecture 1a 24 / 27

slide-84
SLIDE 84

Regular Expressions for String Search

Regular expressions are very often used to search a text or collection of texts for particular types of strings. For instance:

  • we may try to find articles that talk about “globalisation” by searching

for documents that include strings captured by the following regular expression: (anti|ǫ)globali(s|z)(tion|e) or in Perl notation (anti)?globali[sz](tion|e)

Raquel Fernández TtTv 2012 - lecture 1a 25 / 27

slide-85
SLIDE 85

Regular Expressions for String Search

Regular expressions are very often used to search a text or collection of texts for particular types of strings. For instance:

  • we may try to find articles that talk about “globalisation” by searching

for documents that include strings captured by the following regular expression: (anti|ǫ)globali(s|z)(tion|e) or in Perl notation (anti)?globali[sz](tion|e)

  • Google allows you to search by disjunction or concatenation:

Raquel Fernández TtTv 2012 - lecture 1a 25 / 27

slide-86
SLIDE 86

Required Readings

Raquel Fernández TtTv 2012 - lecture 1a 26 / 27

slide-87
SLIDE 87

Required Readings

These readings cover what we have seen today and what we will do in the next lecture.

  • Wintner (2011) Formal Language Theory for Linguists

∗ chapter 1 from section 1.3 to the end, and chapter 2 t/m section 2.1

  • Jurafsky & Martin (2009)

∗ chapter 2 and chapter 3 t/m section 3.5

Raquel Fernández TtTv 2012 - lecture 1a 26 / 27

slide-88
SLIDE 88

Required Readings

These readings cover what we have seen today and what we will do in the next lecture.

  • Wintner (2011) Formal Language Theory for Linguists

∗ chapter 1 from section 1.3 to the end, and chapter 2 t/m section 2.1

  • Jurafsky & Martin (2009)

∗ chapter 2 and chapter 3 t/m section 3.5

Available on Blackboard: under Course Information

  • electronic copies of the readings
  • learning objectives for week 1
  • homework #1

Raquel Fernández TtTv 2012 - lecture 1a 26 / 27

slide-89
SLIDE 89

Practical Sessions

Tutors per group:

  • Group A: Sharon and Elise
  • Group B: Tim and Kirsten

Rooms for today’s practical session:

  • Group A: G0.05
  • Group B: F1.02

Raquel Fernández TtTv 2012 - lecture 1a 27 / 27