BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Dudarenko Yandex • CC RAS • MIPT • HSE • MSU Analysis of Images, Social Networks and Texts Ekaterinburg • 9–11 April 2015
Contents Theory 1 Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org 2 BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM Experiments 3 ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling What is “topic”? Topic is a special terminology of a particular domain area. Topic is a set of coherent terms (words or phrases) that often occur together in documents. Formally, topic is a probability distribution over terms: p ( w | t ) is (unknown) frequency of word w in topic t . Document semantics is a probability distribution over topics : p ( t | d ) is (unknown) frequency of topic t in document d . Each document d consists of terms w 1 , w 2 , . . . , w n d : p ( w | d ) is (known) frequency of term w in document d . When writing term w in document d author thinks about topic t . Topic model tries to uncover latent topics from a text collection. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 3 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Goals and applications of Topic Modeling Goals: Uncover a hidden thematic structure of the text collection Find a compressed semantic representation of each document Applications: Information retrieval for long-text queries Semantic search in large scientific document collections Revealing research trends and research fronts Expert search News aggregation Recommender systems Categorization, classification, summarization, segmentation of texts, images, video, signals, social media and many others Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 4 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Probabilistic Topic Modeling: milestones and mainstream 1 PLSA — Probabilistic Latent Semantic Analysis (1999) 2 LDA — Latent Dirichlet Allocation (2003) 3 100s of PTMs based on Graphical Models & Bayesian Inference David Blei. Probabilistic topic models // Communications of the ACM, 2012. Vol. 55. No. 4. Pp. 77–84. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 5 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Generative Probabilistic Topic Model (PTM) Topic model explains terms w in documents d by topics t : p ( w | d ) = � p ( w | t ) p ( t | d ) t , … , # $ • ( | ! ) ! " " #" $• • ( " | ): 0.023 % ' 0.014 ,а•"• 0.018 •а•••• а!а "# 0.016 (# •) 0.009 ••#'&• 0.013 •$•%•&!• 0.009 *'+#•&"% 0.006 ••&•(• а+- ./ 0.011 •а&&#• … … … … … … … … … … … … " , … , " # $ : Ра••а•• а! #$%& •а'(!• - а!а') )*%#&)+ $•,-•, & ./0.'%!)1 •а•2/ /- $•• 03%!!/- $•. •••. . 4%!•2!/- $•#'%,•.а %'(!•# 0-. М% •, •#!•.а! !а •а•!•2а#7 а•!•2 •8%!).а!)) #-•,# .а !9&'%• ),!/- $•#'%,•.а %'(!•# %+ . $••# •а!# .% &•:;;)8)%! •. •а•'•3%!)0 ;•а42%! •. &•)./- GC - ) GA- #•,%•3а!)0 $• &'а##)*%#&)2 •• •4•!а'(!/2 •а•)#а2. На+,%!/ 9#'•.)0 •$ )2а'(!•+ а$$••&#)2а8)), ••%#$%*).а1>)% а. •2а )*%#&•% •а#$••!а.а!)% $•. •••. •а•')*!/- .),•. ($•02/- ) )!.%• )••.а!!/-, а а&3% а!,%2!/-) !а #$%& •а'(!•+ 2а •)8% #-•,# .а. М% •, •,)!а&•.• -•••7• •а•• а% !а •а•!/- 2а#7 а•а- ,а!!/-. О! $••.•'0% ./0.'0 ( #'%,/ #%42%! !/- ,9$')&а8)+ ) 2%4а#а %'') !/% 9*а# &) . 4%!•2%, •а+•!/ #)! %!)) $•) #•а.!%!)) $а•/ 4%!•2•. . Е4• 2•3!• )#$•'(••.а ( ,'0 ,% а'(!•4• )•9*%!)0 ;•а42%! •. -••2•#•2 ($•)#&а •а•2/ /- 9*а# &•. # 92%•%!!•+ ,')!•+ $•. ••01>%4•#0 $а %•!а ). Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 6 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling PLSA: Probabilistic Latent Semantic Analysis [T. Hofmann 1999] Given: D is a set (collection) of documents W is a set (vocabulary) of terms n dw = how many times term w appears in document d Find: parameters φ wt = p ( w | t ) , θ td = p ( t | d ) of the topic model � p ( w | d ) = φ wt θ td . t The problem of log-likelihood maximization under non-negativeness and normalization constraints: � � n dw ln φ wt θ td → max Φ , Θ , t d , w � � φ wt � 0 , φ wt = 1; θ td � 0 , θ td = 1 . w ∈ W t ∈ T Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 7 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Topic Modeling is an ill-posed inverse problem Topic Modeling is the problem of stochastic matrix factorization : � p ( w | d ) = φ wt θ td . t ∈ T In matrix notation P W × D = Φ W × T · Θ T × D , where � � P = � p ( w | d ) W × D is known term–document matrix, � � � Φ = W × T is unknown term–topic matrix, φ wt = p ( w | t ) , � φ wt � � � Θ = � θ td T × D is unknown topic–document matrix, θ td = p ( t | d ) . � Matrix factorization is not unique, the solution is not stable: ΦΘ = (Φ S )( S − 1 Θ) = Φ ′ Θ ′ for all S such that Φ ′ = Φ S , Θ ′ = S − 1 Θ are stochastic. Then, regularization is needed to find appropriate solution. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 8 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling ARTM: Additive Regularization of Topic Model Additional regularization criteria R i (Φ , Θ) → max , i = 1 , . . . , n . The problem of regularized log-likelihood maximization under non-negativeness and normalization constraints: n � � � n dw ln φ wt θ td + τ i R i (Φ , Θ) → max Φ , Θ , d , w t ∈ T i =1 � �� � � �� � R (Φ , Θ) log-likelihood L (Φ , Θ) � � φ wt � 0; φ wt = 1; θ td � 0; θ td = 1 w ∈ W t ∈ T where τ i > 0 are regularization coefficients . Vorontsov K. V., Potapenko A. A. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization // AIST’2014, Springer CCIS, 2014. Vol. 436. pp. 29–46. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 9 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling ARTM: available regularizers topic smoothing (equivalent to LDA) topic sparsing topic decorrelation topic selection via entropy sparsing topic coherence maximization supervised learning for classification and regression semi-supervised learning using documents citation and links modeling temporal topic dynamics using vocabularies in multilingual topic models and many others Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning. Special Issue “Data Analysis and Intelligent Optimization with Applications”. Springer, 2014. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 10 / 38
Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Given a text document collection Probabilistic Topic Model finds: p ( t | d ) — topic distribution for each document d , p ( w | t ) — term distribution for each topic t . Topics of documents Text documents D doc1: o c doc2: u m doc3: e n doc4: t s ... Topic Modeling Words and keyphrases of topics T o p i c s Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 11 / 38
Recommend
More recommend