Towards Modelling Language Innovation Acceptance in Online Social Networks Date : 2016/05/02 Author : Daniel Kershaw, Matthew Rowe and Patrick Stacey Source : ACM WSDM’16 Advisor : Jia-ling Koh Speaker : Yi-hui Lee 1
Outline • Introduction • Approach • Experiment • Conclusion 2
Introduction • Goal : In this work we demonstrate how such innovations in language can be identified across two different OSN’s Online Social Networks through the operationalisation of known language acceptance models that incorporate relatively simple statistical tests. your ur babe, Before Anyone Else bae // ���� 2014 ���������� Pharrell ���� “Come Get It Bae” ����������������� 3
Introduction(cont.) • Reddit : https://www.reddit.com • Twitter : https://twitter.com 4
Introduction(cont.) • Framework : Input Output Operationalisation Pre-Processing Data Grouping 1 . Frequency 2 . Form 3 . Meaning 4 . Classification 5
Outline • Introduction • Approach • Experiment • Conclusion 6
Approach • Pre-Processing : TwitterNLP’s POS tagger : http://www.cs.cmu.edu/~ark/TweetNLP/ - remove : hashtags(#), mentions(@), HTTP links through using regex long pattern repetitions of the same letter were truncated down to just three characters, e.g. soooooooo would be normalised to soo. 7
Approach(cont.) • Data Grouping : Time : To group the data by time a function weekofyear(e) returns the week the Tweet or Reddit post was created on. Word I am a girl I watch the movie Time 1 1 1 1 2 2 2 2 (weeks ) Word I, am, a, girl I, watch, the, movie Time 1 2 (weeks 8 )
Approach(cont.) • Data Grouping : Community : 1. Reddit : Louvain community detection algorithm -Dataset being broken down into on three community levels : local (the sub- reddit), regional (collection of subreddits) and global (all subreddits). 2. Twitter : geographically bound from within the UK this meant that Tweets could be clustered through the use of the longitude and latitude associated with each tweet. � Twitter API (coordinates) � https:// dev.twitter.com/overview/terms/geo-developer-guidelines � 9
Approach(cont.) • Data Grouping : Community : -low-level community defined by a postcode LA1 could be compared to a subreddit (the lowest community in Reddit), potentially containing a greater convergence on topic and language used -higher level community could be classed as showing the ‘general’ patterns that are global understood across all sub communities. I, watch, the, Word I am a boy I watch the show Word I, am, a, boy show Commu Community nity Twitter Twitter Twitter Twitter Reddit Reddit Reddit Reddit Twitter Reddit (Twitter/Reddit) (Twitter/ Reddit) 10
Approach(cont.) • Operationalisation : Frequency : I, watch, the, movie, I, watch, the, When, Word I, am, a, girl, I, am, a, boy …… show bae, eat… Time 1 2 …… n (weeks) Word I am a girl boy I watch the movie show …… When bae eat… Time 1 1 1 1 1 2 2 2 2 2 …… n n n (weeks) T(w, t) 2/8 2/8 2/8 1/8 1/8 2/8 2/8 2/8 1/8 1/8 …… …… …… …… 11
Approach(cont.) • Operationalisation : Form : When, bae, I, am, watching, I, am, listening, I, am, homosexual, they, are, eating, Word …… homosexual homogeneous, joking homogeneous … Time 1 2 …… n (weeks) Word homo homo …… homo …… Word ing ing …… ing …… Time Time 1 2 …… n n 2 1 …… n n (weeks) (weeks) MP(w, t, P) 1/7 2/7 …… …… …… MS(w, t, S) 2/7 1/7 …… …… …… 12
Approach(cont.) • Operationalisation : Meaning : -Word2vec http://city.shaform.com/blog/2014/11/04/word2vec.html -W2V t c : word2vec to each community (c) 13
Approach(cont.) • Operationalisation : Meaning : 14
Approach(cont.) • Operationalisation : Meaning : similarity between communities while still showing variation. If the value is near 0 then it could mean that the word is too diverse for general usage (i.e. too colloquial), while a word with a value near 1 would potentially indicate that it is too specific. 15
Approach(cont.) • Operationalisation : Classification : Increase/Decrease - : Spearman’s Rank bae … t 1 2 n Increase … … Tw 9 18 9*n … TGIF … t 1 2 n Decrease … … Tw 1000 900 5 … 16
Approach(cont.) • Operationalisation : Limitations : The three method proposed though do not cover all the categories proposed through the VFRGT and FUDGE frameworks 17
Approach(cont.) • Framework : Input Output Operationalisation Pre-Processing Data Grouping 1 . Frequency 2 . Form 3 . Meaning 4 . Classification 18
Outline • Introduction • Approach • Experiment • Conclusion 19
Experiment • Frequency : 20
Experiment(cont.) • Form : 21
Experiment(cont.) • Meaning : classified as an innovation did not appear across all the communities, but when they did they they appeared at a low rank and thus the learned embedding, from the word2vec function, generated sparse words within the context of the innovation. 22
Outline • Introduction • Approach • Experiment • Conclusion 23
Conclusion • demonstrated that through the use of relatively simple statistical tests one is able to use known linguistic models to assess language and its change in on-line social networks • when the methods are applied to two on-line social networks, they can show variation in innovations usage and persistence • these methods can be applied to the individual communities that make up the networks, where we have shown how varying community structure has poten- tially different language dynamics. 24
Conclusion(cont.) • Future work : look into identifying the dynamics of language innovations within the context of users, along with the influence communities have over language and innovation diffusion. 25
Recommend
More recommend