movie actor
play

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW - PowerPoint PPT Presentation

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. We can also search for the most popular movie (actor) in a


  1. Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue

  2. OVERVIEW ● Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. ● We can also search for the most popular movie (actor) in a specified year or of a specific type. More interesting, we can select a crew for a certain type.

  3. CONTENT ● Specification ● Fetching Data ● Entity Resolution ● Data Fusion ● Data Portal ● Conclusion & Reference

  4. SPECIFICATION ✓ Data source: (1) http://themoviedb.org/ (TMDB) (2) http://www.imdb.com/ (IMDB) (3) https://www.wikipedia.org/ (WIKI) ✓ Data file format: JSON & XML ✓ Database: MongoDB ✓ Programming language: Ruby

  5. Fetching data 1

  6. Fetching data ● Crawling strategy: ● TMDB & WIKI: crawl all the data sequentially ; ● IMDB: Use BFS to crawl the data. Use the popular movies in the front page as the url seeds and a thread-safe Queue to store urls. Multiple threads are working to extract data from current url and push back the new urls in this page. ● Raw data statistic: ● TMDB: 20,000+ movies & 20,000+ actors ● IMDB: 10,000+ movies & 11,000+ actors ● WIKI: 5000+ movies & 7000+ actors ● Raw data were stored in JSON or XML format files.

  7. 1.

  8. Entity Resolution 2

  9. Entity Resolution - Attribute Alignment ATTRIBUTE DATA TYPE movie actor ATTRIBUTE DATA TYPE title String name String year Integer birthday Date rating Float directors Array gender String casts Hash place_of_birth String main_casts Array nationality String total_time Integer known_credits Integer languages Array adult_actor Boolean alias Array years_active String country Array alias Array genre Array biography String writers Array known_for Array filming_locations Array match_id Integer keywords Array db_name String match_id Integer db_name String

  10. Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.

  11. ab Array Button Barack Ajson Adam W. Black

  12. Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.

  13. T2 <= name similarity name similarity < T1 T1 <= name similarity < T2 same different birthday Not Match birthday Compute Distance Match between 2 entries

  14. • • •

  15. 1 2 1 1 2 3 3 We can skip many pairwise match calculation if we use transitivity and exclusive.

  16. Entity Resolution – Methods

  17. Data Fusion 3

  18. Data Fusion ● Methods & algorithms Voting with trust worth of different data sources (a) Naive Voting source accuracy: tmdb>imdb>wiki (actor.gender, actor.birthday, movie.year, etc.) (b) Longest String (actor.name, movie.title, etc.) (c) Union (Array of strings) (actor.biography, movie.director, etc.)

  19. Data portal 4

  20. Data Portal Via the data portal, user can get both data before and after data integration. The interesting part of the portal is that user can build a movie crew given a specific genre. Finally, user can search for the top 10 popular movies given the genre and year.

  21. Problems Encountered ● If two movie has continuation in the same or the next year with the same director and casts, they will match but shouldn’t match (Scared Movie 2) ● Some sources have mistakes in the crucial fields (e.g. birthday: 1960-05-01 & 1860-03-01) which enlarge the distance too much. ● Cannot fully eliminate duplicates in a single source data so that some data may not be match in ER. However, they should be match. ● Some movies are not actually movies, but actually TV show or award ceremony. We have not found a good way to solve this problem.

  22. References: 1. ISO 639 Language Code List: https://www.loc.gov/standards/iso639-2/php/code_list.php 2. Felix Naumann, "Similarity measures" [DPDC_12_Similarity] 3. JENS BLEIHOLDER and FELIX NAUMANN, "Data Fusion", _ACM Computing Surveys, Vol. 41, No. 1, Article 1_

  23. Thank You!

Recommend


More recommend