what s next
play

What's Next? 1. What's next? 2. K-means What's next? Last Class - PowerPoint PPT Presentation

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours Thurs What's next? Programming Problems get bigger Hundreds or thousands of lines of code in a program Programs themselves become complex


  1. What's Next? 1. What's next? 2. K-means

  2. What's next? • Last Class Friday • No office hours Thurs

  3. What's next? Programming • Problems get bigger • Hundreds or thousands of lines of code in a program • Programs themselves become complex objects worthy of study! • SoKware engineering • Programming techniques that support large and complex soKware • Object-oriented programming (CS18) • Event-driven programming (most web stuff) • …

  4. What's next? Data structures • Lists, trees are very simple • Amenable to recursion approaches • Build on these: heaps, priority queues, … • Generalize: • Directed acyclic graphs • Prerequisite structure in course requirements make a good example • Directed graphs • Streets in a city (some of them one-way) for example • Edges oKen "labelled" with data like "how long to traverse this one block stretch?" • Problems like "find shortest path" (i.e., quickest route from here to there) • … [CS1570]

  5. What's next? Analysis • Analysis of probabiliscc programs like RandSelect • Analysis of performance of more complicated data structures • Analysis of algorithms like shortest-path • Study of "effeccve" solucons to (some instances of) provably hard problems

  6. What's next? Algorithms • How does Google work? • How does Facebook choose which ads to show you? • How do we recognize unusual behaviors? • Securices fraud • Crime • How do you make a drone deliver a package? • How does Disney/Pixar make Frozen II ?

  7. A shi: in style • In CS17, we've been very concrete: let's sort this list of numbers, let's find an integer in a tree with int-values at nodes, etc. • ADTs moved away from this a ligle: we have a Dicconary, but we don't know the details, only the runcme-performance • In general CS work, the gap between the real world and the code is much greater

  8. A conceptual gap • The internet consists of a bunch of computers ced together by network conneccons from computers to routers (specialized computers that can pass data from one machine to another) • The routers are interconnected as well • The conneccons come and go; some are permanent, some are very temporary • How do we get data from my computer to yours? • We'll work out an algorithm in which we somehow represent what a router is or can do , but in discussing the algorithm, we'll just draw pictures, etc. • Leave implementacon for later

  9. What makes the next steps difficult? • Complexity • Abstraccon helps • Messiness • Real-world data doesn't arrive nicely formaged as lists • Real-world output oKen needs to be in specialized forms • Variety • Program output might need to go to a file, to your screen, to a remote computer, to your computer's speaker • Does every program need to consider the possibilices of every device? • Unreliability • Networks that fail • Programs interrupted by OS • Humans that type weird inputs • Data sources that are corrupted • …

  10. Example problem and algorithm • We have a bunch of data: • We'd like to "classify" it into clusters (red dots could be cluster centers) • Nocce how difficult it is to even specify the problem precisely!

  11. Idea • First, decide how many clusters (by hand?) • really annoying assumpcon, relieved by fancier methods • For our example, pick k = 2. • Grab any two points in the dataset as "centers"

  12. Divide the data into those closer to each point

  13. For each group, find the "mean"

  14. Using these new means, reclassify!

  15. Repeat unLl stabilized • What does "stabilized" mean?

  16. What didn't I menLon? • How to find distances • Are data points stored in a list? An array? A tree? • What are the piles we created? • Are data points lists of ints? of floats? Are they tuples? • Are they all 2-dimensional? Could this work in 3D? in 10D?

  17. Skills • Whatever math is needed • Whatever else is needed • For graphics: physics, … • An ability to guess some representacon of the problem that might work • The ability to translate a pictorial record of a discussion into an actual algorithm ("pseudocode") and then a real program ("code") • Analysis (during and aKer the fact)

  18. ApplicaLon: classifying web pages • Start with a word list of all words that you want to consider (e.g., the words in a Merriam-Webster Dicconary) • Take a web page, and for each word, mark how oKen it appears: a 4 aardvark 0 absolute 2 abrupt 0 … … … …

  19. • List of associated numbers ("bag of words") tells you something about the web page • Two pages whose word-counts look alike are "nearby" • Challenges: • What if my word-counts are exactly 5 cmes your word-counts • Our pages are probably very similar! • Idea from geometry: treat the count-list as a direccon in N-dimensional space (where N is the number of words in your dicconary) and divide it by its length to get a length-1 "vector". • Use "angle between direccons" as a measure of "distance" • Now apply k-means.

  20. Problems • This doesn't work • "Common" words ("the", "of", "and", "a", …) completely dominate everything else. • Remove those? [Commonly called a "stop list"] • Then it works…sort of OK. • Exocc words ("epimetheus"), which don't get counted at all, may be the most important "signature" • Maybe we need to "weight" the word-counts by word-use-frequency! • That handles stop-words as well: they're so frequent that they get totally discounted • That doesn't work either. L Sufficiently rare words mess up everything • Mis-spell "Brady" as "Berady" and suddenly your football webpage is all about Arabic surnames

  21. Results • The core algorithm — cluster by "nearer than" — is simple • Applying it in a new domain makes us consider hard quescons • What are "distance?" and "sameness?" • Why is word-use so skewed? Why are rare words so rare? • How is the vocabulary of a sentence related to its meaning? • When we used a bag-of-words, did we already throw away the essence? • Most of these quescons are outside the domain of "pure computer science." • …which just shows that "pure computer science" may be a misguided nocon • The "rare words" problem, and related ones actually showed that k-means isn't the right path to follow • Led to mulcdimensional scaling, topic clustering, etc.

  22. Summary • Pure CS is interescng… • … but it gets beger (and harder) when it's influenced by the real world • CS also influences the world • We have a responsibility to consider that influence as we work

Recommend


More recommend