Neural Networks: Old and New Ju Sun Computer Science & - - PowerPoint PPT Presentation

neural networks old and new
SMART_READER_LITE
LIVE PREVIEW

Neural Networks: Old and New Ju Sun Computer Science & - - PowerPoint PPT Presentation

Neural Networks: Old and New Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 29, 2020 1 / 32 Logistics Another great reference: Dive into Deep Learning by Aston Zhang and Zachary C. Lipton and Mu Li


slide-1
SLIDE 1

Neural Networks: Old and New

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

January 29, 2020

1 / 32

slide-2
SLIDE 2

Logistics

– Another great reference: Dive into Deep Learning by Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J.

  • Smola. Livebook online: https://d2l.ai/ (comprehensive

coverage of recent developments and detailed implementations based

  • n NumPy)

2 / 32

slide-3
SLIDE 3

Logistics

– Another great reference: Dive into Deep Learning by Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J.

  • Smola. Livebook online: https://d2l.ai/ (comprehensive

coverage of recent developments and detailed implementations based

  • n NumPy)

– Homework 0 will be posted tonight

2 / 32

slide-4
SLIDE 4

Logistics

– Another great reference: Dive into Deep Learning by Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J.

  • Smola. Livebook online: https://d2l.ai/ (comprehensive

coverage of recent developments and detailed implementations based

  • n NumPy)

– Homework 0 will be posted tonight – Waiting list

2 / 32

slide-5
SLIDE 5

Outline

Start from neurons Shallow to deep neural networks A brief history of AI Suggested reading

3 / 32

slide-6
SLIDE 6

Model of biological neurons

Credit: Stanford CS231N

4 / 32

slide-7
SLIDE 7

Model of biological neurons

Credit: Stanford CS231N

Biologically ... – Each neuron receives signals from its dendrites

4 / 32

slide-8
SLIDE 8

Model of biological neurons

Credit: Stanford CS231N

Biologically ... – Each neuron receives signals from its dendrites – Each neuron outputs signals via its single axon

4 / 32

slide-9
SLIDE 9

Model of biological neurons

Credit: Stanford CS231N

Biologically ... – Each neuron receives signals from its dendrites – Each neuron outputs signals via its single axon – The axon branches out and connects via synapese to dendrites

  • f other neurons

4 / 32

slide-10
SLIDE 10

Model of biological neurons

Credit: Stanford CS231N

5 / 32

slide-11
SLIDE 11

Model of biological neurons

Credit: Stanford CS231N

Mathematically ... – Each neuron receives xi’s from its dendrites

5 / 32

slide-12
SLIDE 12

Model of biological neurons

Credit: Stanford CS231N

Mathematically ... – Each neuron receives xi’s from its dendrites – xi’s weighted by wi’s (synaptic strengths) and summed

i wixi

5 / 32

slide-13
SLIDE 13

Model of biological neurons

Credit: Stanford CS231N

Mathematically ... – Each neuron receives xi’s from its dendrites – xi’s weighted by wi’s (synaptic strengths) and summed

i wixi

– The neuron fires only when the combined signal is above a certain threshold:

i wixi + b

5 / 32

slide-14
SLIDE 14

Model of biological neurons

Credit: Stanford CS231N

Mathematically ... – Each neuron receives xi’s from its dendrites – xi’s weighted by wi’s (synaptic strengths) and summed

i wixi

– The neuron fires only when the combined signal is above a certain threshold:

i wixi + b

– Fire rate is modeled by an activation function f, i.e., outputting f (

i wixi + b)

5 / 32

slide-15
SLIDE 15

Artificial neural networks

Brain neural networks

Credit: Max Pixel

6 / 32

slide-16
SLIDE 16

Artificial neural networks

Brain neural networks

Credit: Max Pixel

Artificial neural networks

6 / 32

slide-17
SLIDE 17

Artificial neural networks

Brain neural networks

Credit: Max Pixel

Artificial neural networks Why called artificial?

6 / 32

slide-18
SLIDE 18

Artificial neural networks

Brain neural networks

Credit: Max Pixel

Artificial neural networks Why called artificial? – (Over-)simplification on neural level – (Over-)simplification on connection level

6 / 32

slide-19
SLIDE 19

Artificial neural networks

Brain neural networks

Credit: Max Pixel

Artificial neural networks Why called artificial? – (Over-)simplification on neural level – (Over-)simplification on connection level In this course, neural networks are always artificial.

6 / 32

slide-20
SLIDE 20

Outline

Start from neurons Shallow to deep neural networks A brief history of AI Suggested reading

7 / 32

slide-21
SLIDE 21

Artificial neurons

8 / 32

slide-22
SLIDE 22

Artificial neurons

f

  • i

wixi + b

  • = f (w⊺x + b)

8 / 32

slide-23
SLIDE 23

Artificial neurons

f

  • i

wixi + b

  • = f (w⊺x + b)

We shall use σ instead of f henceforth.

8 / 32

slide-24
SLIDE 24

Artificial neurons

f

  • i

wixi + b

  • = f (w⊺x + b)

We shall use σ instead of f henceforth.

Examples of activation function σ

8 / 32

slide-25
SLIDE 25

Artificial neurons

f

  • i

wixi + b

  • = f (w⊺x + b)

We shall use σ instead of f henceforth.

Examples of activation function σ

Credit: [Hughes and Correll, 2016]

8 / 32

slide-26
SLIDE 26

Neural networks

One neuron: σ (w⊺x + b)

9 / 32

slide-27
SLIDE 27

Neural networks

One neuron: σ (w⊺x + b) Neural networks (NN): structured organization of artificial neurons

9 / 32

slide-28
SLIDE 28

Neural networks

One neuron: σ (w⊺x + b) Neural networks (NN): structured organization of artificial neurons w’s and b’s are unknown and need to be learned

9 / 32

slide-29
SLIDE 29

Neural networks

One neuron: σ (w⊺x + b) Neural networks (NN): structured organization of artificial neurons w’s and b’s are unknown and need to be learned Many models in machine learning are neural networks

9 / 32

slide-30
SLIDE 30

A typical setup

Supervised Learning – Gather training data (x1, y1) , . . . , (xn, yn)

10 / 32

slide-31
SLIDE 31

A typical setup

Supervised Learning – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i

10 / 32

slide-32
SLIDE 32

A typical setup

Supervised Learning – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ to measure the approximation quality

10 / 32

slide-33
SLIDE 33

A typical setup

Supervised Learning – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ to measure the approximation quality – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

  • i=1

ℓ (yi, f (xi))

10 / 32

slide-34
SLIDE 34

A typical setup

Supervised Learning – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a family of functions, e.g., H, so that there is f ∈ H to ensure yi ≈ f (xi) for all i – Set up a loss function ℓ to measure the approximation quality – Find an f ∈ H to minimize the average loss min

f∈H

1 n

n

  • i=1

ℓ (yi, f (xi)) ... known as empirical risk minimization (ERM) framework in learning theory

10 / 32

slide-35
SLIDE 35

A typical setup

Supervised Learning from NN viewpoint – Gather training data (x1, y1) , . . . , (xn, yn)

11 / 32

slide-36
SLIDE 36

A typical setup

Supervised Learning from NN viewpoint – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i

11 / 32

slide-37
SLIDE 37

A typical setup

Supervised Learning from NN viewpoint – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ to measure the approximation quality

11 / 32

slide-38
SLIDE 38

A typical setup

Supervised Learning from NN viewpoint – Gather training data (x1, y1) , . . . , (xn, yn) – Choose a NN with k neurons, so that there is a group of weights, e.g., (w1, . . . , wk, b1, . . . , bk), to ensure yi ≈ {NN (w1, . . . , wk, b1, . . . , bk)} (xi) ∀i – Set up a loss function ℓ to measure the approximation quality – Find weights (w1, . . . , wk, b1, . . . , bk) to minimize the average loss min

w′s,b′s

1 n

n

  • i=1

ℓ [yi, {NN (w1, . . . , wk, b1, . . . , bk)} (xi)]

11 / 32

slide-39
SLIDE 39

Linear regression

Credit: D2L

12 / 32

slide-40
SLIDE 40

Linear regression

Credit: D2L

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd

12 / 32

slide-41
SLIDE 41

Linear regression

Credit: D2L

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd – Model: yi ≈ w⊺xi + b

12 / 32

slide-42
SLIDE 42

Linear regression

Credit: D2L

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd – Model: yi ≈ w⊺xi + b – Loss: y − ˆ y2

2

12 / 32

slide-43
SLIDE 43

Linear regression

Credit: D2L

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd – Model: yi ≈ w⊺xi + b – Loss: y − ˆ y2

2

– Optimization: min

w,b

1 n

n

  • i=1

yi − (w⊺xi + b)2

2

12 / 32

slide-44
SLIDE 44

Linear regression

Credit: D2L

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd – Model: yi ≈ w⊺xi + b – Loss: y − ˆ y2

2

– Optimization: min

w,b

1 n

n

  • i=1

yi − (w⊺xi + b)2

2

Credit: D2L

σ is the identity function

12 / 32

slide-45
SLIDE 45

Perceptron

Frank Rosenblatt

(1928–1971)

13 / 32

slide-46
SLIDE 46

Perceptron

Frank Rosenblatt

(1928–1971)

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {+1, −1}

13 / 32

slide-47
SLIDE 47

Perceptron

Frank Rosenblatt

(1928–1971)

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {+1, −1} – Model: yi ≈ σ (w⊺xi + b), σ sign function

13 / 32

slide-48
SLIDE 48

Perceptron

Frank Rosenblatt

(1928–1971)

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {+1, −1} – Model: yi ≈ σ (w⊺xi + b), σ sign function – Loss: 1 {y = ˆ y}

13 / 32

slide-49
SLIDE 49

Perceptron

Frank Rosenblatt

(1928–1971)

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {+1, −1} – Model: yi ≈ σ (w⊺xi + b), σ sign function – Loss: 1 {y = ˆ y} – Optimization: min

w,b

1 n

n

  • i=1

1 {yi = σ (w⊺xi + b)}

13 / 32

slide-50
SLIDE 50

Perceptron

Perceptron is a single artificial neuron for binary classification

14 / 32

slide-51
SLIDE 51

Perceptron

Perceptron is a single artificial neuron for binary classification dominated early AI (50’s – 70’s)

14 / 32

slide-52
SLIDE 52

Perceptron

Perceptron is a single artificial neuron for binary classification dominated early AI (50’s – 70’s) Logistic regression is similar but with sigmod activiation

14 / 32

slide-53
SLIDE 53

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem

15 / 32

slide-54
SLIDE 54

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem – Data preprocessing: labels into vectors via one-hot encoding

Lk = ⇒ [0, . . . , 0,

  • k−1 0′s

, 1, 0, . . . , 0

n−k 0′s

]⊺

So: yi = ⇒ yi

15 / 32

slide-55
SLIDE 55

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem – Data preprocessing: labels into vectors via one-hot encoding

Lk = ⇒ [0, . . . , 0,

  • k−1 0′s

, 1, 0, . . . , 0

n−k 0′s

]⊺

So: yi = ⇒ yi – Model: yi ≈ σ (W ⊺xi + b), here σ is the softmax function

15 / 32

slide-56
SLIDE 56

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem – Data preprocessing: labels into vectors via one-hot encoding

Lk = ⇒ [0, . . . , 0,

  • k−1 0′s

, 1, 0, . . . , 0

n−k 0′s

]⊺

So: yi = ⇒ yi – Model: yi ≈ σ (W ⊺xi + b), here σ is the softmax function (maps vectors to vectors): for z ∈ Rp, z →

  • ez1
  • j ezj , . . . ,

ezp

  • j ezj

⊺ .

15 / 32

slide-57
SLIDE 57

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem – Data preprocessing: labels into vectors via one-hot encoding

Lk = ⇒ [0, . . . , 0,

  • k−1 0′s

, 1, 0, . . . , 0

n−k 0′s

]⊺

So: yi = ⇒ yi – Model: yi ≈ σ (W ⊺xi + b), here σ is the softmax function (maps vectors to vectors): for z ∈ Rp, z →

  • ez1
  • j ezj , . . . ,

ezp

  • j ezj

⊺ . – Loss: cross-entropy loss −

j yj log ˆ

yj

15 / 32

slide-58
SLIDE 58

Softmax regression

– Data: (x1, y1) , . . . , (xn, yn), xi ∈ Rd, yi ∈ {L1, . . . , Lp}, i.e., multiclass classification problem – Data preprocessing: labels into vectors via one-hot encoding

Lk = ⇒ [0, . . . , 0,

  • k−1 0′s

, 1, 0, . . . , 0

n−k 0′s

]⊺

So: yi = ⇒ yi – Model: yi ≈ σ (W ⊺xi + b), here σ is the softmax function (maps vectors to vectors): for z ∈ Rp, z →

  • ez1
  • j ezj , . . . ,

ezp

  • j ezj

⊺ . – Loss: cross-entropy loss −

j yj log ˆ

yj – Optimization ...

15 / 32

slide-59
SLIDE 59

Softmax regression

... for multiclass classification

Credit: D2L

16 / 32

slide-60
SLIDE 60

Multilayer perceptrons

Credit: D2L

Model: yi ≈ σ2 (W ⊺

2σ1(W ⊺ 1x + b1) + b2) 17 / 32

slide-61
SLIDE 61

Multilayer perceptrons

Credit: D2L

Model: yi ≈ σ2 (W ⊺

2σ1(W ⊺ 1x + b1) + b2)

Also called feedforward networks

17 / 32

slide-62
SLIDE 62

Multilayer perceptrons

Credit: D2L

Model: yi ≈ σ2 (W ⊺

2σ1(W ⊺ 1x + b1) + b2)

Also called feedforward networks Modern NNs: many hidden layers (deep), refined connection structure and/or activations

17 / 32

slide-63
SLIDE 63

They’re all (shallow) NNs

– Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs)

18 / 32

slide-64
SLIDE 64

They’re all (shallow) NNs

– Linear regression – Perception and Logistic regression – Softmax regression – Multilayer perceptron (feedforward NNs) – Support vector machines (SVM) – PCA (autoencoder) – Matrix factorization see, e.g., Chapter 2 of [Aggarwal, 2018].

18 / 32

slide-65
SLIDE 65

Outline

Start from neurons Shallow to deep neural networks A brief history of AI Suggested reading

19 / 32

slide-66
SLIDE 66

Birth of AI

20 / 32

slide-67
SLIDE 67

Birth of AI

– Crucial precursors: first computer, Turing test

20 / 32

slide-68
SLIDE 68

Birth of AI

– Crucial precursors: first computer, Turing test – 1956: Dartmouth Artificial Intelligence Summer Research Project — Birth of AI

20 / 32

slide-69
SLIDE 69

Turing test

Turing Test Alan Turing (1912–1954)

21 / 32

slide-70
SLIDE 70

First golden age

22 / 32

slide-71
SLIDE 71

First golden age

Symbolic AI: based on rules and logic

22 / 32

slide-72
SLIDE 72

First golden age

Symbolic AI: based on rules and logic

22 / 32

slide-73
SLIDE 73

First golden age

Symbolic AI: based on rules and logic rules for recognizing dogs?

22 / 32

slide-74
SLIDE 74

First AI winter

23 / 32

slide-75
SLIDE 75

First AI winter

Gartner hype cycle

23 / 32

slide-76
SLIDE 76

Perceptron

invented 1962

24 / 32

slide-77
SLIDE 77

Perceptron

invented 1962 written in 1969, end of Perceptron era Marvin Minsky (1927–2016)

24 / 32

slide-78
SLIDE 78

Birth of computer vision

1966 around 1980

25 / 32

slide-79
SLIDE 79

Second golden age

26 / 32

slide-80
SLIDE 80

Second golden age

expert system Can we build comprehensive knowledge bases and know all rules?

26 / 32

slide-81
SLIDE 81

Big bang in DNNs

27 / 32

slide-82
SLIDE 82

After 2nd AI winter

28 / 32

slide-83
SLIDE 83

After 2nd AI winter

Machine learning takes over ...

28 / 32

slide-84
SLIDE 84

Golden age of Machine learning

Starting 1990’s Support vector machines (SVM) Adaboost Decision trees and random forests Deep learning ...

29 / 32

slide-85
SLIDE 85

Outline

Start from neurons Shallow to deep neural networks A brief history of AI Suggested reading

30 / 32

slide-86
SLIDE 86

Suggested reading

– Chap 2, Neural Networks and Deep Learning. – Chap 3–4, Dive into Deep Learning. – Chap 1, Deep Learning with Python.

31 / 32

slide-87
SLIDE 87

References i

[Aggarwal, 2018] Aggarwal, C. C. (2018). Neural Networks and Deep Learning. Springer International Publishing. [Hughes and Correll, 2016] Hughes, D. and Correll, N. (2016). Distributed machine learning in materials that couple sensing, actuation, computation and

  • communication. arXiv:1606.03508.

32 / 32