Lecture 01 – Part 01 Algorithms
▶ How do we turn it into something a computer Recall DSC 40A... can do? ▶ How do we formalize learning from data?
Recall DSC 40A... can do? ▶ How do we formalize learning from data? ▶ How do we turn it into something a computer
Example: Predicting Salary
Example: Predicting Salary
The End 𝑐 (𝑌 𝑈 𝑌) −1 ⃗ 𝑥 = 𝑌 𝑈 ⃗
▶ We need an algorithm . Wait... ▶ We actually need to compute the answer...
Wait... ▶ We actually need to compute the answer... ▶ We need an algorithm .
An Algorithm? >>> import numpy as np >>> w = np.linalg.solve(X.T @ X, X.T @ b) ▶ Will it work for 1,000,000 data points? ▶ What about for 1,000,000 features?
Example: Minimize Error 𝑦 1 , … , 𝑦 𝑜 : absolute error: 𝑜 ∑ 𝑗=1 |𝑁 − 𝑦 𝑗 | ▶ Goal : summarize a collection of numbers, ▶ Idea : find number 𝑁 minimizing the total
Example: Minimize Error ▶ Solution : The median of 𝑦 1 , … , 𝑦 𝑜 . ▶ But how do we actually compute the median?
Lecture 01 – Part 02 Example: Clustering
Clustering that are afgected difgerently. ▶ Given a pile of data, discover similar groups. ▶ Examples: ▶ Find political groups within social network data. ▶ Given data on COVID-19 symptoms, discover groups ▶ Find the similar regions of an image ( segmentation ). ▶ Most useful when data is high dimensional...
Example: Old Faithful
Example: Old Faithful
Clustering the data. ▶ Goal: for computer to identify the two groups in
Example: Old Faithful
Clustering can do? problem”. “goodness” of a clustering; find the best . ▶ How do we turn this into something a computer ▶ DSC 40A says: “Turn it into an optimization ▶ Idea : develop a way of quantifying the
Quantifying Separation Define the “separation” 𝜀(𝐶, 𝑆) to be the smallest distance between a blue point and red point.
The Problem ⃗ 𝑦 (1) , … , ⃗ 𝑦 (𝑜) . so as to maximize 𝜀(𝐶, 𝑆) . ▶ Given : 𝑜 points ▶ Find : an assignment of points to clusters R and B
The End
The “Brute Force” Algorithm that with largest separation, 𝜀(𝐶, 𝑆) . ▶ There are finitely-many possible clusterings. ▶ Algorithm : Try each possible clustering, return ▶ This is called a brute force algorithm.
best_separation = float('inf') # Python for ”infinity” best_clustering = None sep = calculate_separation(clustering) if sep < best_separation: print(best_clustering) for clustering in all_clusterings(data): best_separation = sep best_clustering = clustering
The End
Wait... points? ▶ How long will this take to run if there are 𝑜 ▶ How many clusterings of 𝑜 things are there?
Combinatorics objects? 1 Small nitpick: actual color doesn’t matter, 2 𝑜−1 . ▶ How many ways are there to assign R or B to 𝑜 ▶ Two choices 1 for each object: 2 × 2 × … × 2 = 2 𝑜 .
Time a single clustering. nanoseconds to check all clusterings. ▶ Suppose it takes at least 1 nanosecond to check ▶ One billionth of a second. ▶ If there are 𝑜 points, it will take at least 2 𝑜
Time Needed 𝑜 Time 1 1 nanosecond
Time Needed 𝑜 Time 1 1 nanosecond 10 1 microsecond
Time Needed 𝑜 Time 1 1 nanosecond 10 1 microsecond 20 1 millisecond
Time Needed 𝑜 Time 1 1 nanosecond 10 1 microsecond 20 1 millisecond 30 1 second
Time Needed 𝑜 Time 1 1 nanosecond 10 1 microsecond 20 1 millisecond 30 1 second 40 18 minutes
Time Needed 𝑜 Time 1 1 nanosecond 10 1 microsecond 20 1 millisecond 30 1 second 40 18 minutes 50 13 days
Time Needed 30 60 13 days 50 18 minutes 40 1 second 1 millisecond 𝑜 20 1 microsecond 10 1 nanosecond 1 Time 36 years
Time Needed 1 second 70 36 years 60 13 days 50 18 minutes 40 30 𝑜 1 millisecond 20 1 microsecond 10 1 nanosecond 1 Time 37,000 years
Example: Old Faithful ▶ The Old Faithful data set has 270 points. ▶ Brute force algorithm will finish in 6 × 10 64 years.
Example: Old Faithful ▶ The Old Faithful data set has 270 points. ▶ Brute force algorithm will finish in 6 × 10 64 years.
▶ Does this mean our problem is too hard? ▶ We’ll see an effjcient solution by the end of the Algorithm Design quarter. ▶ Oħten, most obvious algorithm is unusably slow .
▶ We’ll see an effjcient solution by the end of the Algorithm Design quarter. ▶ Oħten, most obvious algorithm is unusably slow . ▶ Does this mean our problem is too hard?
Algorithm Design quarter. ▶ Oħten, most obvious algorithm is unusably slow . ▶ Does this mean our problem is too hard? ▶ We’ll see an effjcient solution by the end of the
DSC 40B work. strategies and data structures. ▶ Assess the effjciency of algorithms. ▶ Understand why and how common algorithms ▶ Develop faster algorithms using design
Recommend
More recommend