A Practical Guide to Deep Learning at the Department of Mathematics - - PowerPoint PPT Presentation

a practical guide to deep learning at the department of
SMART_READER_LITE
LIVE PREVIEW

A Practical Guide to Deep Learning at the Department of Mathematics - - PowerPoint PPT Presentation

A Practical Guide to Deep Learning at the Department of Mathematics Vegard Antun (UiO) March 19, 2019 1 / 61 Layout of the talk Part I Computer resources, the linux operating system, large scale computations. Part II Neural networks,


slide-1
SLIDE 1

A Practical Guide to Deep Learning at the Department of Mathematics

Vegard Antun (UiO) March 19, 2019

1 / 61

slide-2
SLIDE 2

Layout of the talk

Part I Computer resources, the linux operating system, large scale computations. Part II Neural networks, mathematical framework, practical example.

2 / 61

slide-3
SLIDE 3

Computer resources

CPU Cache Memory Hard drive

3 / 61

slide-4
SLIDE 4

INF1060, Pål Halvorsen University of Oslo

cache(s) main memory secondary storage (disks) tertiary storage (tapes)

Memory Hierarchies

0.3 ns On die memory - 1 ns 50 ns 5 ms < 1 s 2 s 1.5 minutes 3.5 months

slide-5
SLIDE 5

Computer resources

GPU Memory CPU Cache Memory Hard drive

5 / 61

slide-6
SLIDE 6

Time measurements

Total time for 10 epochs on CIFAR10. Batch size 10. ◮ CPU: 8 min, 35 sec ◮ GPU: 53 sec (≈10 times faster)

Network Local disk RAM 5 10 15 20 Seconds

Loading 50 MR scans (each 40 MB) on nam shub

6 / 61

slide-7
SLIDE 7

Operating systems (OS)

Hardware Operating system

7 / 61

slide-8
SLIDE 8

The Linux Filesystem Hierarchy

The uppermost directory in the Linux file system is /

[ ∼ ]$ ls Desktop Downloads Pictures www_docs Documents pc WINDOWS [ ∼ ]$ pwd /mn/sarpanitu/ansatte -u4/vegarant [ ∼ ]$ cd / [ / ]$ ls admin etc lib misc

  • pt

sbin tf usit bin hf lib64 mn proc site tmp usr boot home local mnt rh srv ub uv dev ifi med net root sv uio var div jus media

  • dont

run sys use

8 / 61

slide-9
SLIDE 9

Some important directories

◮ /bin Most basic executable files (ls, cp, cd) ◮ /lib Libraries used by the executables ◮ /boot Files related to the boot loader ◮ /dev All devices, /dev/random, /dev/null, /dev/pst/0 ◮ /etc Configuration files, /etc/hostname, /etc/passwd ◮ /home/username Your home folder

∼/ (not on UiO-system)

◮ /root Home directory of root user ◮ /tmp Temporary files - Not preserved during reboots ◮ /usr Read-only user data. Multiuser applications ◮ /var Variable files, i.e. files which changes during execution

9 / 61

slide-10
SLIDE 10
slide-11
SLIDE 11

Environment variables

Variable with a name and a value, used by one or more

  • applications. To view all type env

Some important environment variables ◮ PATH All directories where we search for executables ◮ PYTHONPATH All directories where we search for python modules ◮ HOME Your home directory i.e. the position of ∼/ ◮ EDITOR Default editor ◮ TF_CPP_MIN_LOG_LEVEL Level of verbosity for tensorflow

11 / 61

slide-12
SLIDE 12

Environment variables - Example

[ ∼ ]$ echo $PYTHONPATH /path/to/module1 :/ path/to/module2 [ ∼ ]$ [ ∼ ]$ export PYTHONPATH=$PYTHONPATH :/ path/to/new_module [ ∼ ]$ [ ∼ ]$ echo $PYTHONPATH /path/to/module1 :/ path/to/module2 :/ path/to/new_module

12 / 61

slide-13
SLIDE 13

The ∼/.bashrc

The scrip language you type in the terminal is called “BASH“ (Bourne Again SHell) We often want the environment to stay persistent between logins. Set defaults in the files ◮ ∼/.bashrc Run each time you open a terminal on your computer

[ ∼ ]$ cat ∼/. bashrc export PYTHONPATH=$PYTHONPATH :/ path/to/new_module export TF_CPP_MIN_LOG_LEVEL =1 alias la=’ls -a --color=auto ’ alias ll=’ls -lh --color=auto ’ # Describes the command line prompt PS1=’[ \h \w ]$ ’

13 / 61

slide-14
SLIDE 14

The ∼/.bashrc and ∼/.bash profile files

◮ ∼/.bashrc Run each time you open a terminal on your computer ◮ ∼/.bash_profile Run each time you log in remotely. To have two different settings in ∼/.bashrc and ∼/.bash profile is

  • ften inconvenient. To only use the ∼/.bashrc file, place the

following lines in your ∼/.bash profile

[ ∼ ]$ cat .bash_profile if [ -f ∼/. bashrc ]; then . ∼/. bashrc fi

Note: Files starting with ’.’ don’t show whenever you type ls. In

  • rder to see these files, type ls -a

14 / 61

slide-15
SLIDE 15

Login to remote machines via SSH

Login to the universities network from a personal linux or mac computer

[ ∼ ]$ ssh -X username@login .math.uio.no

The -X options enabels X11 forwarding i.e. you can open GUI based applications. Once you are logged in you can continue to the desired computer by typing

[ ∼ ]$ ssh -X computername [ ∼ ]$ # Example , logging into the hadad computer [ ∼ ]$ ssh -X hadad

15 / 61

slide-16
SLIDE 16

Login to remote machines via SSH

Next we will see how to make this preceedure require less typing!

16 / 61

slide-17
SLIDE 17

SSH config file

Create the file ∼/.ssh/config and add the following lines

host uio hostname login.math.uio.no user your_username ForwardX11 no

You can then logon to the university’s network by

ssh -X uio

We assume you have this setup in the rest of this presentation

17 / 61

slide-18
SLIDE 18

SSH keys

To make the UiO passwords secure they often require a lot of

  • typing. SSH-keys provides an easy way to maintain high sequrety

while having shorter passwords.

18 / 61

slide-19
SLIDE 19

Generate and set up SSH-key

[ ∼ ]$ ssh -keygen -t rsa -b 4096 -C "your@email.com"

This command will create two files ◮ ∼/.ssh/id_rsa Private key. Do not share it. ◮ ∼/.ssh/id_rsa.pub Public key. Can be shared with anyone. Copy the public key to the remote host (UIO)

ssh−copy−i d −i ∼/ . ssh / i d r s a . pub <username>@login . math . uio . no

19 / 61

slide-20
SLIDE 20

SSH and jump connections

Your comp. login.math.uio.no math comp. ◮ Jump connection sends the ssh trafic directly through a computer like a regular ruter ◮ You avoid some typing and you do not allocate a terminal on the jump computer ◮ Does only allow for one jump

20 / 61

slide-21
SLIDE 21

SSH and jump connections

To use jump connection add the following to your ∼/.ssh/config

# Setup for the math computers , this example belet -ili Host belet -ili1 Hostname belet -ili.uio.no ProxyJump vegarant@login .math.uio.no User vegarant

  • r you can add the jump connection directly

ssh −J <username>@login . math . uio . no <username>@<hostname >. uio . no

21 / 61

slide-22
SLIDE 22

Terminal window managers

◮ Common choices are “tmux“ or “screen“.

22 / 61

slide-23
SLIDE 23

Monitor CPU usage

◮ Use the htop command to view CPU-usage and priority

23 / 61

slide-24
SLIDE 24

Reducing the priority of your process

◮ Linux processes can have “niceness“ values {−20, . . . , 19} where a smaller value gives higher priority. ◮ Negative nice values can only be given by root user/administrator. ◮ The default priority of any process you start will be 0 i.e. you will typicaly reduce the priority.

[ ∼ ]$ nice -n 19 python3 my_python_script .py &

24 / 61

slide-25
SLIDE 25

Monitor GPU usage

◮ All of our GPUs are from Nvidia. To view their current usage use nvidia-smi ◮ To call this command every 5 second use the watch command

[ ∼ ]$ watch -n 5 nvidia -smi [ ∼ ]$ # or use [ ∼ ]$ nvidia -smi -l 5

25 / 61

slide-26
SLIDE 26

GPU resources at Dep. of Mathematics

Name GPU CPU cores Mem. scratch nam-shub-01 4 × RTX 2080 ti 28 128GB 30GB zadkiel 1 × RTX 2080 4 16 GB − belet-ili 1 × GTX 1080 4 16 GB − cleopatra 1 × GTX 1080 4 16 GB − euphrosyne 1 × GTX 1080 4 16 GB − hadad 1 × GTX 1080 4 16 GB −

26 / 61

slide-27
SLIDE 27
slide-28
SLIDE 28

AI HUB

◮ An experimental service for machine learning provided by USIT, to gain experience with hardware and software for deep learning. ◮ Reserved for students on weekdays (Mon-Fri) from 09:00 to 17:00. ◮ Need to login via Abel (add ssh keys as before). Name GPU CPU cores Mem. None presistent scratch ml1 4 × RTX 2080 Ti 28 128 GB 17TB ml2 4 × RTX 2080 Ti 28 128 GB 17TB ml3 4 × RTX 2080 Ti 28 128 GB 17TB ◮ AI mailing list: itf-ai-announcements@usit.uio.no

28 / 61

slide-29
SLIDE 29

Deep learning frameworks

◮ Many old frameworks like: MatConvNet, Caffe, Theano ... ◮ For most scientists Tensorflow (and maybe Pytorch) would be the prefered option.

29 / 61

slide-30
SLIDE 30

Tensorflow

◮ Developed by Google, and have a large community. ◮ Relatively well documented ◮ Have APIs in Python, JavaScript, C++, Java, Go, Swift. ◮ Models can be deployed into applications, such as websites and phones.

30 / 61

slide-31
SLIDE 31

How to run Tensorflow?

◮ No unified way to do this on all systems. ◮ The machines ml1, ml2 and ml3, have tensorflow v1.12 and PyTorch v1.0. Just type python3 to get started. ◮ On math computers we use the module system (and maybe singularity)

module avail # See which modules are avaiable module load tensorflow/<version > # Load tensorflow module rm tensorflow/<version > # Unload tensorflow module list # view loaded modules

◮ ML software located under python-ml/<version> and

tensorflow/<version>. Do not load both.

31 / 61

slide-32
SLIDE 32

Singularity

◮ Singularity (similar to docker) is container with a minimal

  • perating system.

◮ Shares the kernel with the host operating system so that CPU

  • verhead is almost non.

◮ You can install whatever software you like within the container, with the nessesary libaries. ◮ Makes reproducible research much easier! ◮ Check out Tormod Landet’s excelent guide to singularity http://folk.uio.no/tormodla/singularity/ ◮ On maths computers precompiled singularity images are located at /mn/sarpanitu/singularity/images/Machine_learning

32 / 61

slide-33
SLIDE 33

Neat commands

◮ ag or ack – Search for pattern in each source file in the tree from the current directory and downward. ◮ fzf – Fuzzy finder. Search for filenames in the tree from the current directory and downwards. ◮ which <command> – E.g. which python Gives the location of the program python. ◮ nohup nice -n 19 python -u my_script.py > output.txt & – Start prosess which aren’t shut down when you exit the login shell.

33 / 61

slide-34
SLIDE 34

File permissions

On UNIX systems, access can be given to a user, group or all. The tree types of permissions are read, write and execute

[ ∼/some/directory]$ ls -l drwxrwxr -x. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir

  • rwxrwxr -x. 1 vegarant

vegarant 8448 Oct 26 10:53 my_file

  • rw -r--r--. 1 vegarant

vegarant 108 Oct 26 10:52 my_file.c

d

  • directory

rwx

  • user

rwx

  • group

r − x

all

vegarant

  • username

vegarant

  • group name

4096

  • size

Oct26 10 : 53

  • last modified

my dir

name

[ ∼/some/directory]$ # Make directory private [ ∼/some/directory]$ chmod 700 my_dir [ ∼/some/directory]$ ls -l drwx ------. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir

  • rwxrwxr -x. 1 vegarant

vegarant 8448 Oct 26 10:53 my_file

  • rw -r--r--. 1 vegarant

vegarant 108 Oct 26 10:52 my_file.c

34 / 61

slide-35
SLIDE 35

Part II

Neural networks, mathematical framework, practical example.

35 / 61

slide-36
SLIDE 36
slide-37
SLIDE 37

Neural Network

Definition 1

Let NNN,L,d with N = (c = NL+1, NL, . . . , N2, N1 = d) denote the set of all L-layer neural networks. That is, all mappings f : Rd → Rc of the form f (x) = WL(. . . ρ(W2(ρ(W1(x)))) . . .), x ∈ Rd, where Wjz = Ajz + bj, Aj ∈ RNj×Nj+1, bj ∈ RNj+1 and ρ: R → R is a non-linear function that acts elementwise on a vector.

37 / 61

slide-38
SLIDE 38

Choices of ρ

ρ: R → R acts elementwise on a vector. Sigmoid: ρ(x) = 1/(1 + e−x) ReLu: ρ(x) = max(0, x) tanh: ρ(x) = tanh(x) Leaky ReLu: ρ(x) =

  • x

x ≥ 0 αx x < 0

38 / 61

slide-39
SLIDE 39

Choices of ρ

ρ    x1 . . . xN    =    max{x1, x2} . . . max{xN−1, xN}   , ρ    x1 . . . xN    =   

x1+x2 2

. . .

xN−1+xN 2

   Max pooling Avrage pooling (linear map)

39 / 61

slide-40
SLIDE 40

Neural Network (Alternative definition)

Directed acyclic graph x z1 = A1x + b1 z2 = ρ1(z1) z3 = A2z2 + b2 z4 = A3x + b3 z5 = ρ2(z4) z6 = z3 + z5 z7 = ρ3(z6) Output

40 / 61

slide-41
SLIDE 41

What is machine learning?

41 / 61

slide-42
SLIDE 42

Machine learning model

◮ Training set: S = (z1, . . . , zm) ⊂ Z where each zi is i.i.d. from an unknown probability distribution D over Z ⊂ Rd. ◮ Function class: F class of funtions/hypotheses. ◮ Cost function: C : F × Z → R ◮ Risk: RD(f ) := Ez∼DC(f , z) where z ∼ D is independent of S. ◮ Goal: Find a “good hypotesis“ ˆ f ∈ F based on S such that RD(ˆ f ) is small.

Shalev-Shwartz & Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.

42 / 61

slide-43
SLIDE 43

Examples

Binary classification

◮ Training set: {(xi, yi)}m

i=1 ⊂ Rd × {0, 1}.

◮ Function class: F can be set of linear classifiers, Neural networks, decision trees. ◮ Cost function: C(f , (xi, yi)) = ✶{yi=f (xi)}.

Linear regression

◮ Training set: {(xi, yi)}m

i=1 ⊂ Rd × R.

◮ Function class: F = {·, θ : θ ∈ Rd+1} ◮ Cost function: C(f , (xi, yi)) = (yi − [xi, 1], θ)2.

Clustering

◮ Training set: S = {zi}m

i=1 ⊂ Rd.

◮ Function class: F = {T = {T1, . . . , Tk} : Partition of S with centers (c1, . . . , ck)} ◮ Cost function: C(T, zi) = ||zi − cj|| for zi ∈ Tj.

43 / 61

slide-44
SLIDE 44

Machine learning model

◮ Risk: RD(f ) := Ez∼DC(f , z) where z ∼ D is independent of S. ◮ Goal: Find a “good hypotesis“ ˆ f ∈ F based on S such that RD(ˆ f ) is small. Notice: We can not evaluate RD(f ) since D is unknown

Emperical Risk Minimazation

Approximate RD(f ) by RS(f ) = 1 |S|

  • z∈S

C(f , z) We seek to find f ♯ ∈ argminf ∈F RS(f )

44 / 61

slide-45
SLIDE 45

Bias-Complexity tradeoff

Let ǫapprox = min

f ∈F RD(f )

and f ♯ ∈ argminf ∈F RS(f ). Then RD(f ♯) = ǫapprox

approximation error

+ RD(f ♯) − ǫapprox

  • estimation error

45 / 61

slide-46
SLIDE 46

Emperial Risk Minimization for Neural Networks

◮ Training set: {(xi, yi)}m

i=1 ⊂ Rd × Rc.

◮ Function class: F = NNN,L,d parametrized by the weights θ = (vec(A1), b1, . . . , vec(AL), bL) i.e. f (·, θ): Rd → RNL+1. ◮ Cost function: C(f , (xi, yi)) = d(f (xi, θ), yi). Function d : Rc × Rc → R+ problem dependent.

  • 1. θ ∈ Rp is often referred to as the weights.
  • 2. Define loss function

L(θ) =

n

  • i=1

d(f (xi, θ), yi)

  • 3. Try to find

θ ∈ argmin

θ∈Rp

L(θ) using (stochastic) gradient decent.

46 / 61

slide-47
SLIDE 47

Convex Optimization – Boyd & Vandenberghe

“Nonlinear optimization (or nonlinear programming) is the term used to describe an optimization problem when the objective or constraint functions are not linear, but not known to be convex. Sadly, there are no effective methods for solving the general nonlinear programming problem (1.1). Even simple looking problems with as few as ten variables can be extremely challenging, while problems with a few hundreds of variables can be intractable. Methods for the general nonlinear programming problem therefore take several different approaches, each of which involves some compromise.“ minimize f0(x), x ∈ Rn subject to fi(x) ≤ bi i = 1, . . . , m (1.1) Boyd & Vandenberghe, Convex Optimization, Cambridge university press, 2004.

47 / 61

slide-48
SLIDE 48

Convex Optimization – Boyd & Vandenberghe

From section on local optimization approaches to nonlinear

  • ptimization:

“Roughly speaking, local optimization methods are more art than

  • technology. Local optimization is a well developed art, and often

very effective, but it is nevertheless an art.“ Boyd & Vandenberghe, Convex Optimization, Cambridge university press, 2004.

48 / 61

slide-49
SLIDE 49

Gradient Decent for Neural Networks

◮ Recall we wanted to minimize L(θ) =

n

  • i=1

d(f (xi, θ), yi) Gradient decent gives the iterations θk+1 = θk − αk∇L(θk) for some step length αk > 0. ◮ What happens to the computational cost if n is very large, say n ≈ 1 200 000

49 / 61

slide-50
SLIDE 50

Stochastic Gradient Decent for Neural Networks

◮ Create a partition {T1, . . . , Tk} of the numbers {1, . . . , n} where each |Tj| ≤ s. ◮ Let Gj(θ) =

  • i∈Tj

∇θC(f (xi, θ), yi) ◮ Perform the updates

1: t = 0 2: for e = 1, . . . , M do 3:

for j = 1, . . . , k do

4:

θt+1 = θt − αtGj(θt)

5:

t = t + 1;

6: return θkM

50 / 61

slide-51
SLIDE 51

Alternative update rules

GD with momentum, 0 < γ < 1. vt+1 = γvt + ηGj(θt) θt+1 = θt − vt+1 Individual scaling of the different parameters. (Adagrad, RMSprop, Adam) θt+1 = θt − DtGj(θt) Dt is a diagonal matrix depending on some or all of the previous comptued gradients.

51 / 61

slide-52
SLIDE 52

Tensorflow

import tensorflow as tf import numpy as np

Most important tensors

◮ tf.Variable (Must be initialized. Can take gradient) ◮ tf.placeholder (Input to the network) ◮ tf.constant (Constant values) ◮ tf.Tensor (Output of an operation)

Important Attributes

◮ shape (Default is None, i.e. not specified) ◮ dtype (tf.float32, tf.int32, . . .) ◮ name (Will be assigned a name of not specified)

52 / 61

slide-53
SLIDE 53

x A z1 = Ax b z2 = z1 + b ◮ A: tf.Variable ◮ x: tf.placeholder ◮ z1: tf.Tensor ◮ b: tf.Variable, tf.placeholder or tf.constant ◮ z2: tf.Tensor

53 / 61

slide-54
SLIDE 54

Tensorflow

# Nodes in a graph a = tf.Variable( initial_value =np.random.randn (1,3), name=’weights ’, dtype=tf.float32) b = tf.Variable( initial_value =[0] , name=’bias ’, dtype=tf.float32) print(a) print(b) $ python3 program_name.py <tf.Variable ’weights :0’ shape =(1, 3) dtype=float32_ref > <tf.Variable ’bias:0’ shape =(1 ,) dtype=float32_ref >

54 / 61

slide-55
SLIDE 55

Linear regression

# Code generating all the data N = 50 a_true = np.array ([[4. ,

  • 5, 3 ]], dtype=np.float32)

b_true = np.array ([2] , dtype=np.float32) x_data = np.concatenate( (np.random.randn (1, N), np.random.uniform(size =[1, N]), np.random.chisquare(df=3.0 , size =(1, N))) ) noise = 0.01* np.random.randn (1, N) labels = np.dot(a_true , x_data) + b_true # + noise

a =   4 −5 3   , b = 2, xi ∈ R3, i = 1, . . . , N xT

i a + b = yi,

i = 1, . . . , N

55 / 61

slide-56
SLIDE 56

Tensorflow

# Nodes in a graph a = tf.Variable( initial_value =np.random.randn (1,3), name=’weights ’, dtype=tf.float32) b = tf.Variable( initial_value =[0] , name=’bias ’, dtype=tf.float32) X = tf.placeholder(dtype=tf.float32 , name=’data ’, shape =[3, N]) prediction = tf.linalg.matmul(a,X) + b # TF graph print(x) print(prediction) $ python3 program_name.py Tensor (" data :0", shape =(3, 50), dtype=float32) Tensor ("add:0", shape =(1, 50), dtype=float32)

56 / 61

slide-57
SLIDE 57

Tensorflow – Sessions

◮ Graphs only define the function you would like to compute. ◮ To execute a graph (function), open a tf.Session().

init = tf. global_variables_initializer (); with tf.Session () as sess: sess.run(init ); # All variables must be initalized # All relevant placeholders goes into the feed_dict pred = sess.run(prediction , feed_dict ={X: x_data }) a_start = sess.run(a); print(a_start ); print(pred) # pred is a numpy array with # values = a*data + b $ python3 program_name.py [[ -0.9025026 0.6354202

  • 0.09739944]]

[[ -0.86136425 0.6985589 0.51153713 1.2961135 ... 0.91275173

  • 1.0157912
  • 0.41740212

0.45071918 0.3727951

  • 0.81552047]]

57 / 61

slide-58
SLIDE 58

Tensorflow – Gradient Decent

Y = tf.placeholder(dtype=tf.float32 , name=’label ’, shape =[1, N]); # Compute sum_{i} (y[i]-prediction[i])^2 loss = tf.reduce_sum(tf.pow(prediction -Y, 2)); nbr_epochs = 100; step_length = 0.01; # often called learning rate

  • ptimizer = tf.train. GradientDescentOptimizer (

step_length ). minimize(loss ); with tf.Session () as sess: sess.run(init ); # All variables must be initalized for epoch in range(nbr_epochs ): # Do gradient decent step sess.run(optimizer , feed_dict ={X: x_data , Y: labels }) a_pred , b_pred = sess.run([a, b]);

58 / 61

slide-59
SLIDE 59

NeurIPS (earlier NIPS)

Submitted papers

◮ 2016: 2406 submissions ◮ 2017: 3240 submissions ◮ 2018: ∼4900 submissions Source: Twitter

59 / 61

slide-60
SLIDE 60
slide-61
SLIDE 61