What's Next? 1. What's next? 2. K-means What's next? Last Class - - PowerPoint PPT Presentation

▶

Aug 15, 2022 250 likes •550 views

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours Thurs What's next? Programming Problems get bigger Hundreds or thousands of lines of code in a program Programs themselves become complex

SLIDE 1

What's Next?

1. What's next?
2. K-means

SLIDE 2

What's next?

Last Class Friday
No office hours Thurs

SLIDE 3

What's next? Programming

Problems get bigger
Hundreds or thousands of lines of code in a program
Programs themselves become complex objects worthy of study!
SoKware engineering
Programming techniques that support large and complex soKware
Object-oriented programming (CS18)
Event-driven programming (most web stuff)
…

SLIDE 4

What's next? Data structures

Lists, trees are very simple
Amenable to recursion approaches
Build on these: heaps, priority queues, …
Generalize:
Directed acyclic graphs
Prerequisite structure in course requirements make a good example
Directed graphs
Streets in a city (some of them one-way) for example
Edges oKen "labelled" with data like "how long to traverse this one block stretch?"
Problems like "find shortest path" (i.e., quickest route from here to there)
… [CS1570]

SLIDE 5

What's next? Analysis

Analysis of probabiliscc programs like RandSelect
Analysis of performance of more complicated data structures
Analysis of algorithms like shortest-path
Study of "effeccve" solucons to (some instances of) provably hard

problems

SLIDE 6

What's next? Algorithms

How does Google work?
How does Facebook choose which ads to show you?
How do we recognize unusual behaviors?
Securices fraud
Crime
How do you make a drone deliver a package?
How does Disney/Pixar make Frozen II?

SLIDE 7

A shi: in style

In CS17, we've been very concrete: let's sort this list of numbers, let's

find an integer in a tree with int-values at nodes, etc.

ADTs moved away from this a ligle: we have a Dicconary, but we

don't know the details, only the runcme-performance

In general CS work, the gap between the real world and the code is

much greater

SLIDE 8

A conceptual gap

The internet consists of a bunch of computers ced together by network

conneccons from computers to routers (specialized computers that can pass data from one machine to another)

The routers are interconnected as well
The conneccons come and go; some are permanent, some are very

temporary

How do we get data from my computer to yours?
We'll work out an algorithm in which we somehow represent what a router

is or can do, but in discussing the algorithm, we'll just draw pictures, etc.

Leave implementacon for later

SLIDE 9

What makes the next steps difficult?

Complexity
Abstraccon helps
Messiness
Real-world data doesn't arrive nicely formaged as lists
Real-world output oKen needs to be in specialized forms
Variety
Program output might need to go to a file, to your screen, to a remote computer, to your

computer's speaker

Does every program need to consider the possibilices of every device?
Unreliability
Networks that fail
Programs interrupted by OS
Humans that type weird inputs
Data sources that are corrupted
…

SLIDE 10

SLIDE 11

Example problem and algorithm

We have a bunch of data:
We'd like to "classify" it into

clusters (red dots could be cluster centers)

Nocce how difficult it is to

even specify the problem precisely!

SLIDE 12

Idea

First, decide how many clusters (by hand?)
really annoying assumpcon, relieved by fancier methods
For our example, pick k = 2.
Grab any two points in the dataset as "centers"

SLIDE 13

SLIDE 14

Divide the data into those closer to each point

SLIDE 15

SLIDE 16

For each group, find the "mean"

SLIDE 17

SLIDE 18

Using these new means, reclassify!

SLIDE 19

SLIDE 20

Repeat unLl stabilized

What does "stabilized" mean?

SLIDE 21

SLIDE 22

What didn't I menLon?

How to find distances
Are data points stored in a list? An array? A tree?
What are the piles we created?
Are data points lists of ints? of floats? Are they tuples?
Are they all 2-dimensional? Could this work in 3D? in 10D?

SLIDE 23

Skills

Whatever math is needed
Whatever else is needed
For graphics: physics, …
An ability to guess some representacon of the problem that might

work

The ability to translate a pictorial record of a discussion into an actual

algorithm ("pseudocode") and then a real program ("code")

Analysis (during and aKer the fact)

SLIDE 24

ApplicaLon: classifying web pages

Start with a word list of all words that you want to consider (e.g., the

words in a Merriam-Webster Dicconary)

Take a web page, and for each word, mark how oKen it appears:

a 4 aardvark absolute 2 abrupt … … … …

SLIDE 25

List of associated numbers ("bag of words") tells you something about the

web page

Two pages whose word-counts look alike are "nearby"
Challenges:
What if my word-counts are exactly 5 cmes your word-counts
Our pages are probably very similar!
Idea from geometry: treat the count-list as a direccon in N-dimensional

space (where N is the number of words in your dicconary) and divide it by its length to get a length-1 "vector".

Use "angle between direccons" as a measure of "distance"
Now apply k-means.

SLIDE 26

Problems

This doesn't work
"Common" words ("the", "of", "and", "a", …) completely dominate

everything else.

Remove those? [Commonly called a "stop list"]
Then it works…sort of OK.
Exocc words ("epimetheus"), which don't get counted at all, may be the

most important "signature"

Maybe we need to "weight" the word-counts by word-use-frequency!
That handles stop-words as well: they're so frequent that they get totally discounted
That doesn't work either. L Sufficiently rare words mess up everything
Mis-spell "Brady" as "Berady" and suddenly your football webpage is all about Arabic surnames

SLIDE 27

Results

The core algorithm — cluster by "nearer than" — is simple
Applying it in a new domain makes us consider hard quescons
What are "distance?" and "sameness?"
Why is word-use so skewed? Why are rare words so rare?
How is the vocabulary of a sentence related to its meaning?
When we used a bag-of-words, did we already throw away the essence?
Most of these quescons are outside the domain of "pure computer

science."

…which just shows that "pure computer science" may be a misguided nocon
The "rare words" problem, and related ones actually showed that k-means

isn't the right path to follow

Led to mulcdimensional scaling, topic clustering, etc.

SLIDE 28

Summary

Pure CS is interescng…
… but it gets beger (and harder) when it's influenced by the real

world

CS also influences the world
We have a responsibility to consider that influence as we work