Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles
Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically
The Problem Given a domain name: 'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com' Find the concatenated sentence: 'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'
Why is this useful? How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'? How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'? Domains resolved into words can be compared on a semantic level, not simply as strings.
Primary use case Given 500 domains in a market, what are the themes?
Scope of project As part of our internal idea incubation Adthena labs , this approach was developed during a one- day hack to determine if such an approach could be useful to the business.
Adthena's Data > 10 million unique domains > 50 million unique search terms 3rd Party Data Project Gutenberg (https://www.gutenberg.org/) Google ngram viewer datasets (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
Process 1. Learn some words 2. Find where words occur in a domain name 3. Choose the most likely set of words
1. Learn some words Build a dictionary using suitable documents. Documents: search terms In [2]: import pandas , os search_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv')) search_terms = search_terms['SearchTerm'].dropna().str.lower() search_terms.iloc[1000000::2000000] 1000000 new 2014 mercedes benz b200 cdi Out[2]: 3000000 weight watchers in glynneath 5000000 property for rent in batlow nsw 7000000 us plug adaptor for uk 9000000 which features mobile is best for purchase Name: SearchTerm, dtype: object In [125]: from sklearn.feature_extraction.text import CountVectorizer def build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())
In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001) dictionary_size = len(st_dictionary) print('{} words found'.format(num_fmt(dictionary_size))) sorted(st_dictionary)[dictionary_size//20::dictionary_size//10] 21.4k words found ['430', Out[126]: 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']
We have 21 thousand words in our base dictionary . We can augment this with some books from project gutenberg: In [127]: dictionary = st_dictionary for fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear i n 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.11k words found in a_christmas_carol.txt 1.65k words found in alice_in_wonderland.txt 3.71k words found in huckleberry_finn.txt 4.09k words found in pride_and_predudice.txt 4.52k words found in sherlock_holmes.txt 26.4k words in dictionary
Actually, scrap that... ... and use the google ngram viewer datasets:
In [212]: dictionary = set() ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')] for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2 ) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv" 12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv" 5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv" 4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv" 3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv" 2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv" 2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv" 2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv" 2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv" 2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv" 61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv" 55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv" 72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv" 46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv" 36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv" 32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv" 36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv" 37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv" 30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv" 12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv" 31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv" 36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv" 63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"
That takes us to ~1M words! We even get some good two-letter words to work with: In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2}))) print(sorted({w for w in dictionary if len(w) == 2})) 142 2-letter words ['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']
In [144]: choice(list(dictionary), size=40) array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', Out[144]: 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')
2. Find where words occur in a domain name Find all substrings of a domain that are in our dictionary , along with their start and end indicies.
In [149]: def find_words_in_string(string, dictionary, longest_word= None ): if longest_word is None : longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)
In [234]: domain = 'powerwasherchicago' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 39 ['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'ic ago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was' , 'wash', 'washe', 'washer', 'we', 'wer']
Recommend
More recommend