analyzing time structured corpora
play

Analyzing Time Structured Corpora Corpus Statistics Research Group - PowerPoint PPT Presentation

Analyzing Time Structured Corpora Corpus Statistics Research Group launch event Birmingham, 11th Feb 2016 Tony Hennessey (University of Nottingham) joint work with R. Carrington, Y. van Gennip, M. Mahlberg, S. Preston, K. Severn, V. Wiegand


  1. Analyzing Time Structured Corpora Corpus Statistics Research Group launch event Birmingham, 11th Feb 2016 Tony Hennessey (University of Nottingham) joint work with R. Carrington, Y. van Gennip, M. Mahlberg, S. Preston, K. Severn, V. Wiegand Tony Hennessey (UoN) 1 / 18

  2. Overview Overview How to look at the time dependency in the properties of a corpus. Recap terminology and describe the main example used throughout the presentation. Binning data and how to think about binning mathematically. Using kernels which are better than bins. Tony Hennessey (UoN) 2 / 18

  3. Setting the scene (and a bit of a recap) X - some matrix representation of the corpus 0 2 2 1 0   . . . 0 0 2 1 1 . . .   1 0 0 1 1   . . .   1 1 0 0 1  . . .   1 1 1 0 0  . . .   . . . . . ... . . . . . . . . . . Tony Hennessey (UoN) 3 / 18

  4. Setting the scene (and a bit of a recap) X - some matrix representation of the corpus bandicoot aardvark abacus badger bonsai 0 2 2 1 0   doc 01 . . . 0 0 2 1 1 doc 02 . . .   1 0 0 1 1 doc 03   . . .   1 1 0 0 1 doc 04  . . .   1 1 1 0 0  doc 05 . . .   . . . . . ... . . . . . . . . . . document-term matrix Tony Hennessey (UoN) 3 / 18

  5. Setting the scene (and a bit of a recap) f( X ) - some function that we apply to the corpus Tony Hennessey (UoN) 4 / 18

  6. Setting the scene (and a bit of a recap) f( X ) - some function that we apply to the corpus The cosine of the angle between words in a vector space which was derived using a matrix factorization. ( X = USV T singular value decomposition) This measure quantifies the degree of association between words i.e. a bigger value implies closer association. Tony Hennessey (UoN) 4 / 18

  7. Setting the scene (and a bit of a recap) X (document-term matrix) 11,543,110 documents 472,331 terms Tony Hennessey (UoN) 5 / 18

  8. Setting the scene (and a bit of a recap) X (document-term matrix) 11,543,110 documents 472,331 terms Meta-data for each document includes a date Tony Hennessey (UoN) 5 / 18

  9. How does the corpus change with time? Let us try binning the data using dates. Tony Hennessey (UoN) 6 / 18

  10. Binning by date X = Tony Hennessey (UoN) 7 / 18

  11. Binning by date { 1st Jan 1785 { 2nd Jan 1785 { 3rd Jan 1785 Tony Hennessey (UoN) 7 / 18

  12. Binning by date { { 1st Jan 1785 { { 2nd Jan 1785 { { 3rd Jan 1785 Tony Hennessey (UoN) 7 / 18

  13. Binning by date X(t) X( t = 1st Jan 1785 ) = X( t = 2nd Jan 1785 ) = X( t = 3rd Jan 1785 ) = Tony Hennessey (UoN) 7 / 18

  14. Binning by date X(t) f( X(t) ) X( t = 1st Jan 1785 ) = + 1st Jan 1785 + 2nd Jan 1785 + 3rd Jan 1785 X( t = 2nd Jan 1785 ) = t X( t = 3rd Jan 1785 ) = Tony Hennessey (UoN) 7 / 18

  15. Binning by date Identity matrix X = I X  0 2 2   1 0 0 0 0 0 0 0 0 0   0 2 2  . . . . . . . . . 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 . . . . . . . . .       1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0       . . . . . . . . .       1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0       . . . . . . . . .  1 1 1   0 0 0 0 1 0 0 0 0 0   1 1 1  . . . . . . . . .       4 2 0 0 0 0 0 0 1 0 0 0 0 4 2 0       = . . . . . . . . .       1 2 1 1 0 0 0 0 0 0 0 0 0 2 1 1       . . . . . . . . .       2 0 2 0 0 0 0 0 0 0 1 0 0 2 0 2       . . . . . . . . .    1    1 2 0 0 0 0 0 0 0 0 0 0 1 2 0  . . .   . . .   . . .  0 4 0 0 0 0 0 0 0 0 0 0 1 0 4 0       . . . . . . . . .       . . . . . . . . . . . . . . . ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tony Hennessey (UoN) 8 / 18

  16. Binning by date Filter by date X ( t ) = b ( t ) X 0 2 2 1 0 0 0 0 0 0 0 0 0 0 2 2       . . . . . . . . . 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 . . . . . . . . .       1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0       . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0       . . . . . . . . .  0 0 0   0 0 0 0 0 0 0 0 0 0   1 1 1   . . .   . . .   . . .  0 0 0 0 0 0 0 0 0 0 0 0 0 4 2 0       = . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1       . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2       . . . . . . . . .  0 0 0   0 0 0 0 0 0 0 0 0 0   1 2 0  . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0       . . . . . . . . .       . . . . . . . . . . . . . . . ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . where t = ’1st Jan 1785’ Tony Hennessey (UoN) 8 / 18

  17. Binning by date Filter by date X ( t ) = b ( t ) X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2       . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0       . . . . . . . . .       1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0       . . . . . . . . .  1 1 1   0 0 0 0 1 0 0 0 0 0   1 1 1   . . .   . . .   . . .  4 2 0 0 0 0 0 0 1 0 0 0 0 4 2 0       = . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1       . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2       . . . . . . . . .  0 0 0   0 0 0 0 0 0 0 0 0 0   1 2 0  . . . . . . . . .       0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0       . . . . . . . . .       . . . . . . . . . . . . . . . ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . where t = ’2nd Jan 1785’ Tony Hennessey (UoN) 8 / 18

  18. Binning by date How wide should the bins be? depends on your research question e.g. over what time scale are you interested in examining change? depends on your data e.g. how sparsely distributed are the traits you are looking at likely to be? Tony Hennessey (UoN) 9 / 18

  19. Binning by date An example of binning using the TDA just showing f ( X ( t ) ) for ‘smoking’ and ‘cancer’ 0.5 + 0.4 + + 0.3 + + 0.2 + + + + + + 0.1 + + + + + + + + + + 0.0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Tony Hennessey (UoN) 10 / 18

  20. Binning by date An example of binning using the TDA just showing f ( X ( t ) ) for ‘smoking’ and ‘cancer’ 0.5 + 0.4 + + 0.3 + + 0.2 + + + + + + 0.1 + + + + + + + + + + 0.0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 We used 5 year bins because the number of articles about smoking are quite sparsely distributed we are mainly interested in long term trends Tony Hennessey (UoN) 10 / 18

  21. Tony Hennessey (UoN) Binning by date + 1900 1901 1902 1903 1904 + 1905 1906 1907 1908 1909 + 1910 1911 1912 1913 1914 1915 1916 t 11 / 18

  22. Tony Hennessey (UoN) Binning by date Sliding the bins + + 1900 1901 + 1902 + + 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 t 11 / 18

  23. Tony Hennessey (UoN) Binning by date Sliding the bins + 1900 + 1901 + 1902 + + 1903 1904 + 1905 + 1906 + 1907 + + 1908 1909 + 1910 + + 1911 1912 1913 1914 1915 1916 t 11 / 18

Recommend


More recommend