Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue
OVERVIEW ● Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. ● We can also search for the most popular movie (actor) in a specified year or of a specific type. More interesting, we can select a crew for a certain type.
CONTENT ● Specification ● Fetching Data ● Entity Resolution ● Data Fusion ● Data Portal ● Conclusion & Reference
SPECIFICATION ✓ Data source: (1) http://themoviedb.org/ (TMDB) (2) http://www.imdb.com/ (IMDB) (3) https://www.wikipedia.org/ (WIKI) ✓ Data file format: JSON & XML ✓ Database: MongoDB ✓ Programming language: Ruby
Fetching data 1
Fetching data ● Crawling strategy: ● TMDB & WIKI: crawl all the data sequentially ; ● IMDB: Use BFS to crawl the data. Use the popular movies in the front page as the url seeds and a thread-safe Queue to store urls. Multiple threads are working to extract data from current url and push back the new urls in this page. ● Raw data statistic: ● TMDB: 20,000+ movies & 20,000+ actors ● IMDB: 10,000+ movies & 11,000+ actors ● WIKI: 5000+ movies & 7000+ actors ● Raw data were stored in JSON or XML format files.
1.
Entity Resolution 2
Entity Resolution - Attribute Alignment ATTRIBUTE DATA TYPE movie actor ATTRIBUTE DATA TYPE title String name String year Integer birthday Date rating Float directors Array gender String casts Hash place_of_birth String main_casts Array nationality String total_time Integer known_credits Integer languages Array adult_actor Boolean alias Array years_active String country Array alias Array genre Array biography String writers Array known_for Array filming_locations Array match_id Integer keywords Array db_name String match_id Integer db_name String
Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.
ab Array Button Barack Ajson Adam W. Black
Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.
T2 <= name similarity name similarity < T1 T1 <= name similarity < T2 same different birthday Not Match birthday Compute Distance Match between 2 entries
• • •
1 2 1 1 2 3 3 We can skip many pairwise match calculation if we use transitivity and exclusive.
Entity Resolution – Methods
Data Fusion 3
Data Fusion ● Methods & algorithms Voting with trust worth of different data sources (a) Naive Voting source accuracy: tmdb>imdb>wiki (actor.gender, actor.birthday, movie.year, etc.) (b) Longest String (actor.name, movie.title, etc.) (c) Union (Array of strings) (actor.biography, movie.director, etc.)
Data portal 4
Data Portal Via the data portal, user can get both data before and after data integration. The interesting part of the portal is that user can build a movie crew given a specific genre. Finally, user can search for the top 10 popular movies given the genre and year.
Problems Encountered ● If two movie has continuation in the same or the next year with the same director and casts, they will match but shouldn’t match (Scared Movie 2) ● Some sources have mistakes in the crucial fields (e.g. birthday: 1960-05-01 & 1860-03-01) which enlarge the distance too much. ● Cannot fully eliminate duplicates in a single source data so that some data may not be match in ER. However, they should be match. ● Some movies are not actually movies, but actually TV show or award ceremony. We have not found a good way to solve this problem.
References: 1. ISO 639 Language Code List: https://www.loc.gov/standards/iso639-2/php/code_list.php 2. Felix Naumann, "Similarity measures" [DPDC_12_Similarity] 3. JENS BLEIHOLDER and FELIX NAUMANN, "Data Fusion", _ACM Computing Surveys, Vol. 41, No. 1, Article 1_
Thank You!
Recommend
More recommend