Spectral analysis of Wikipedia and PhysRev networks Klaus Frahm Quantware MIPS Center Universit´ e Paul Sabatier Laboratoire de Physique Th´ eorique, UMR 5152, IRSAMC, CNRS supported by EC FET Open project NADINE FET NADINE Workshop, Directed Networks Days 2013, Milano, 13 Juin 2013
Google matrix for directed networks Google matrix for directed networks Define the adjacency matrix A by A ij = 1 if there is a link from the node j to i in the network (of size N ) and A ij = 0 otherwise. Let S ij = A ij / � i A ij and S ij = 1 /N if � i A ij = 0 (dangling nodes). S is of Perron-Frobenius type but for many networks the eigenvalue λ 1 = 1 is highly degenerate [ ⇒ convergence problem to arrive at the stationary limit of p ( t + 1) = S p ( t ) ]. Therefore define the Google matrix : G ( α ) = αS + (1 − α ) 1 N ee T where e = (1 , . . . , 1) T and α = 0 . 85 is a typical damping factor. Here there is a unique eigenvector for λ 1 = 1 called the PageRank P and the convergence goes with α t . ( CheiRank P ∗ by replacing: A → A ∗ = A T ). Klaus Frahm 2 Milano, 13 Juin 2013
Arnoldi method Arnoldi method to (partly) diagonalize large sparse non-symmetric d × d matrices: • choose an initial normalized vector ξ 0 (random or “otherwise”) • determine the Krylov space of dimension n A (typically: 1 ≪ n A ≪ d ) spanned by the vectors: ξ 0 , G ξ 0 , . . . , G n A − 1 ξ 0 • determine by Gram-Schmidt orthogonalization an orthonormal basis { ξ 0 , . . . , ξ n A − 1 } and the representation of G in this basis: k +1 � G ξ k = H jk ξ j j =0 • diagonalize the Arnoldi matrix H which has Hessenberg form: ∗ ∗ · · · ∗ ∗ ∗ ∗ · · · ∗ ∗ 0 ∗ · · · ∗ ∗ which provides the Ritz eigenvalues that are H = . . . . ... . . . . . . . . 0 0 · · · ∗ ∗ 0 0 · · · 0 ∗ very good aproximations to the “largest” eigenvalues of A . Klaus Frahm 3 Milano, 13 Juin 2013
Invariant subspaces Invariant subspaces In realistic WWW or other networks invariant subspaces of nodes create (possibly) large degeneracies of λ 1 (or λ 2 if α < 1 ) which is very problematic for the Arnoldi method. Therefore one needs to determine the invariant subspaces defined as subsets of nodes such that for any node in a subspace each outgoing link stays in the subspace . One can efficiently find all subspaces of maximal size (or dimension) N c (with N c = bN a certain fraction of the network size N , e.g. b = 0 . 1 ) and then all subspaces with common members are merged resulting in a decomposition of the network in many separate subspaces with N s nodes and a “big” core space of the remaining N − N s nodes. Note that dangling nodes are by construction core space nodes . Possible: core space node → subspace node Impossible: subspace node → core space node Klaus Frahm 4 Milano, 13 Juin 2013
Invariant subspaces The decomposition in subspaces and a core space implies a block structure of the matrix S : S 1 0 . . . � � S ss S sc S = , S ss = 0 S 2 0 S cc . ... . . where S ss is block diagonal according to the subspaces. The subspace blocks of S ss are all matrices of PF type with at least one eigenvalue λ 1 = 1 explaining the high degeneracies. To determine the spectrum of S apply: • Exact (or Arnoldi) diagonalization on each subspace. • The Arnoldi method to S cc to determine the largest core space eigenvalues λ j (note: | λ j | < 1 ). The largest eigenvalues of S cc are no longer degenerate but other degeneracies are possible (e.g. λ j = 0 . 9 for Wikipedia). Klaus Frahm 5 Milano, 13 Juin 2013
Spectrum of Wikipedia Spectrum of Wikipedia L. Ermann, KMF and D.L. Shepelyansky, Eur. Phys. J. B 86 , 193 (2013) Wikipedia 2009 : N = 3282257 nodes, N ℓ = 71012307 network links. spectrum of S ∗ , N s = 21198 spectrum of S , N s = 515 n A = 6000 for both cases Klaus Frahm 6 Milano, 13 Juin 2013
Spectrum of Wikipedia Some Eigenvectors: left (right): PageRank (CheiRank) black: PageRank (CheiRank) at α = 0 . 85 grey: PageRank (CheiRank) at α = 1 − 10 − 8 red and green: first two core space eigenvectors blue and pink: two eigenvectors with large imaginary part in the eigenvalue Klaus Frahm 7 Milano, 13 Juin 2013
Spectrum of Wikipedia Detail study of 200 selected eigenvectors with eigenvalues “close” to the unit circle: Klaus Frahm 8 Milano, 13 Juin 2013
Spectrum of Wikipedia Power law decay of eigenvectors: | ψ i ( K i ) | ∼ K b K i ≥ 10 4 for i ϕ = arg( λ i ) Klaus Frahm 9 Milano, 13 Juin 2013
Spectrum of Wikipedia Inverse participation ratio of eigenvectors: j | ψ i ( j ) | 2 ) 2 / � j | ψ i ( j ) | 4 ξ IPR = ( � ϕ = arg( λ i ) Klaus Frahm 10 Milano, 13 Juin 2013
Spectrum of Wikipedia “Themes” of certain eigenvectors: math (function, geometry,surface, logic-circuit) England poetry Iceland aircraft Kuwait poetry Bangladesh football 0.5 biology song muscle-artery muscle-artery New Zeland DNA Austria Bible Poland muscle-artery music 0 -1 -0.5 0 0.5 1 Australia Canada protein Brazil China RNA skin war rail 0 Texas-Dallas-Houston Gaafu Alif Atoll -0.82 -0.8 -0.78 -0.76 -0.74 -0.72 Quantum Leap Language Switzerland Australia Australia England mathematics 0 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Klaus Frahm 11 Milano, 13 Juin 2013
Spectrum of Wikipedia Number of links between or inside sets A and B defined by the index K i ordered by decreasing absolute value of Wikipedia eigenstates: A = { 1 , . . . , K i } B = { K i + 1 , . . . , N } Klaus Frahm 12 Milano, 13 Juin 2013
Physical Review network Physical Review network (work in progress: KMF , Young-Ho Eom, D. Shepelyansky) N = 463347 nodes and N ℓ = 4691015 links. Coarse-grained matrix structure ( 500 × 500 cells): left: time ordered right: journal and then time ordered “11” Journals of Physical Review: (Phys. Rev. Series I), Phys. Rev., Phys. Rev. Lett., (Rev. Mod. Phys.), Phys. Rev. A, B, C, D, E, (Phys. Rev. STAB and Phys. Rev. STPER). Klaus Frahm 13 Milano, 13 Juin 2013
Physical Review network ⇒ nearly triangular matrix structure of adjancy matrix: most citations links t → t ′ are for t > t ′ (“past citations”) but there is small number ( 12126 = 2 . 6 × 10 − 3 N ℓ ) of links t → t ′ with t ≤ t ′ corresponding to future citations . Spectrum by “double-precision” Arnoldi method with n A = 8000 : Numerical problem: eigenvalues with | λ | < 0 . 3 − 0 . 4 are not reliable! Reason: large Jordan subspaces associated to the eigenvalue λ = 0 . Klaus Frahm 14 Milano, 13 Juin 2013
Physical Review network “very bad” Jordan perturbation theory: Consider a “perturbed” Jordan block of size D : 0 1 · · · 0 0 0 0 · · · 0 0 . . . . ... . . . . . . . . 0 0 · · · 0 1 ε 0 · · · 0 0 characteristic polynomial: λ D − ( − 1) D ε ε = 0 ⇒ λ = 0 λ j = − ε 1 /D exp(2 πij/D ) ε � = 0 ⇒ for D ≈ 10 2 and ε = 10 − 16 ⇒ “Jordan-cloud” of artifical eigenvalues due to rounding errors in the region | λ | < 0 . 3 − 0 . 4 . Klaus Frahm 15 Milano, 13 Juin 2013
Triangular approximation Triangular approximation Remove the small number of links due to “future citations”. Semi-analytical diagonalization is possible: S = S 0 + e d T /N where e n = 1 for all nodes n , d n = 1 for dangling nodes n and d n = 0 otherwise. S 0 is the pure link matrix which is nil-potent : S l 0 = 0 with l = 352 . Let ψ be an eigenvector of S with eigenvalue λ and C = d T ψ . • If C = 0 ⇒ ψ eigenvector of S 0 ⇒ λ = 0 since S 0 nil-potent. These eigenvectors belong to large Jordan blocks and are responsible for the numerical problems. Note: Similar situation as in network of integer numbers where l = [log 2 ( N )] and numerical instability for | λ | < 0 . 01 . Klaus Frahm 16 Milano, 13 Juin 2013
Triangular approximation • If C � = 0 ⇒ λ � = 0 since the equation S 0 ψ = − C e/N does not have a solution ⇒ λ 1 − S 0 invertible. l − 1 � j � S 0 ⇒ ψ = C ( λ 1 − S 0 ) − 1 e/N = C � e/N . λ λ j =0 From λ l = ( d T ψ/C ) λ l ⇒ P r ( λ ) = 0 with the reduced polynomial of degree l = 352 : l − 1 P r ( λ ) = λ l − λ l − 1 − j c j = 0 c j = d T S j � , 0 e/N . j =0 ⇒ at most l = 352 eigenvalues λ � = 0 which can be numerically determined as the zeros of P r ( λ ) . However: still numerical problems: • c l − 1 ≈ 3 . 6 × 10 − 352 • alternate sign problem with a strong loss of significance. • big sensitivity of eigenvalues on c j Klaus Frahm 17 Milano, 13 Juin 2013
Triangular approximation Solution: Using the multi precision library GMP with 256 binary digits the zeros of P r ( λ ) can be determined with accuracy ∼ 10 − 18 . Furthermore the Arnoldi method can also be implemented with higher precision. zeros of P r ( λ ) from 256 binary red crosses: digits calculation blue squares: eigenvalues from Arnoldi method with 52, 256, 512, 1280 binary digits. In the last case: ⇒ break off at n A = 352 with vanishing coupling element. Klaus Frahm 18 Milano, 13 Juin 2013
Full Physical Review network Full Physical Review network High precision Arnoldi method for full Physical Review network (including the “future citations”) for 52, 256, 512, 768 binary digits and n A = 2000 : Klaus Frahm 19 Milano, 13 Juin 2013
Recommend
More recommend