Why do complex systems look critical? Matteo Marsili The Abdus Salam International Centre for Theoretical Physics Trieste, Italy + Iacopo Mastromatteo Yasser Roudi Ariel Haimovici Dante Chialvo Silvio Franz Claudia Battistin
The unreasonable effectiveness of science The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve. We should be grateful for it and hope it will remain valid also in future research and that it will extend, for the better of for the worse, to our pleasure, even though perhaps also to our bafflement, to wide branches of learning (E. P . Wigner 1960) • Galaxies have millions of stars, a piece of material has 10 32 molecules, ... Yet, we understand their behavior in terms of few relevant variables! • Will this work for a cell (10 4 genes), the brain (10 7 neurons) an economy (10 6 individuals)... ? • We build airplanes. Can we also cure cancer or avoid the next financial crisis? • Even if the answer is no, what is the best we can do? • How to find the (most) relevant variables or description of complex phenomena?
Facts and questions • Fact 1: Data deluge + advanced experimental techniques (e.g. sequencing) Complex systems involve many variables (high-d inference, e.g. 10 4 genes) Strong under-sampling. Prediction is typically hard (e.g. drug design) • Fact 2: 1 1985 Cumulative probability 1987 1991 We observe “Criticality”, as a statistical regularity, 0.1 1998 b) in a wide variety of different systems as cities, 0.01 rank ~1/size the brain, languages, economy/finance, biology. 0.001 0.0001 • 10000 100000 1000000 10000000 100000000 Questions: (land prices in Japan S Kaizoji & Kaizoji 2006) Are there typical properties of high-d samples of complex systems? Are there overarching organizing principles (e.g. SOC)? Can we exploit “criticality” (e.g. for model selection)? P . Bak How Nature Works (1996) T. Mora & W. Bialek, J.Stat.Phys. (2011) S. Ki Baek et al. N. J. Physics (2012)
Criticality in (statistical) physics • Statistical mechanics: order and disorder T � T c s = ( s 1 , . . . , s N ) , s i = ± 1 Weak interaction Short range correlations Large entropy g } = 1 Z e − E ˆ g [ s ] /T p { s | ˆ T c critical point • Critical phenomena: T ⌧ T c - anomalous fluctuations (C V ) Strong interaction - scale invariance Long range order Small entropy C ( r ) ∼ r − d − η
Criticality everywhere log population log number log frequency (G. Kirby 1985) (log) rank (log) rank A Populations of all countries Figure 1 Frequency of word usage in English A United States B China B Number of ships built by all countries (log) rank C West Germany D Spain C Students at English universities E France F East Germany D Building Societies by assets G Switzerland H United Kingdom E Populations of World’s religions I Mexico F US insurance companies by staff G World languages rank ∝ size − 1 N (size) ∼ size − 2 H English public schools by students ⇒ number of observations From empirical distribution to energy of state s E { s } ' � log K s P { s } = 1 Z e − β E { s } ) M Criticality = linear relation between energy and entropy ~ kN(k) total number of Peak of Cv in learned models observations T. Mora & W. Bialek, J.Stat.Phys. (2011)
Complex system = many degrees of freedom + function • Complex systems are not random: • Individuals do not live in random cities • A writer does not choose words at random when writing • Proteins are not random sequences of amino acids • ... • Only part of what they do is accessible to us: • Variables: s = ( s 1 , . . . , s n , s n +1 , . . . , s N ) , s i = ± 1 ~ , N � 1 knowns unknowns s ¯ s • Function: ⌦ ↵ U ( ~ s ) = u s + v ¯ = 0 s | s , v ¯ s | s model unknown function • Behavior: h i s ∗ = arg max u s + max v ¯ s | s ¯ s s
How relevant are known vars? e.g. Why do you live where you live? • I live where I live because my zip code can be nicely decomposed in primes: 34151 = 13 x 37 x 71 • Others choose where to live depending on job, marriage, interests, etc. The zip code is not a relevant variable in this choice, whereas the city is. • The distribution of city sizes contains information about how people choose where to live. The distribution by zip code does not. • The distribution of population by zip code is trivial, that by city is not • Same for language: word are the relevant variables, punctuations marks are not ... ing of world cities by population, see tab • Modeling: models should contain relevant variables to be predictive • Sampling: if the variables we sample are relevant, we can infer what the system is doing
Modeling: (the direct problem) Observables (knowns) max max U ( s, ¯ s ) ⇒ s ∗ ¯ s s Nature max s ) U ( s, ¯ s ) p s ∗ = P { s 0 = s ∗ } ( s, ¯ Model max s [ U ( s, ¯ s )] E ¯ s s = ( s 1 , . . . , s n ) , n = fN = max u s ⇒ s 0 s s = ( s n +1 , . . . , s N ) ¯ Q: How many? How relevant? 1 P { s ∗ = s } = X Z ( β ) e β u s , e β u s Z ( β ) = s
Gibbs-Boltzmann distribution • Without further knowledge, has to be taken v ¯ s | s as an i.i.d. random variable • As long as h | v ¯ s | s | m i < 1 8 m s | s = a + β − 1 Y, ⇒ max Y ∼ Gumbel v ¯ ¯ • Then s 1 P { s ∗ = s } = X Z ( β ) e β u s , e β u s Z ( β ) = s • For Gaussian(0,1) P{v}, p β = 2 N (1 − f ) log 2 • Same as maximal entropy with h u s i = ¯ u
The most complex system: REM s • If then f u s ∼ Gaussian(0 , σ 2 ) i . i . d . σ c = 1 − f s = ( s 1 , . . . , s n ) , n = fN s = ( s n +1 , . . . , s N ) ¯ 5 (relevance) Known variables a P { s ∗ = s 0 } ' 1 � should be relevant 1 + b ( σ � σ c ) 4 enough! � 3 (relevant = those the system cares about) 2 1 P { s ∗ = s 0 } ' e − cN ( σ c − σ ) (Random Energy Model 0 Cook & Derrida 1991) 0 0.2 0.4 0.6 0.8 1 f (fract. of relevant vars)
Maximally informative models are critical (Mastromatteo+Marsili JSTAT 2012) h • 1 e.g. s = n binary variables (e.g. ⇥ � spikes from salamander retina) � h ⇥ • Parametric models: ⇥ � ⇥ � ⇥ � p(s) = p(s|h,J) = Ising model � � ⇥ � ⇥ � ⇥ ⇥ � ⇥ ����� � � ⇥⇥⇥ ⇥ ⇥ ⇥ ⇥ ⇥ � � ⇥ � � ⇥ � ⇥ • � ⇥ 0 J Uniform P{p(s)} maps in a non- � ⇥ � � ⇥ ⇥ ⇥ � � ⇥ uniform P{h,J} that concentrates � ⇥ � ⇥ ⇥ � � ⇥ � � ⇥ � around critical points ⇥ ⇥ � � ⇥ ⇥ ⇥ � � ⇥ � • ⇥ ⇥ � Intuition (Cramer-Rao): ⇥ ⇥ � � � ⇥ ⇥ ⇥ � � � ⇥ � 1 ⇥ ⇥ χ = δ s δ data � � 0. ⇥ 1 2 � δ h = J ⇥ � ⇥ � δ params ⇥ � ⇥ � ⇥ � ⇥ � � ⇥ � ⇥
Extensions: • What is the analogous of Boltzmann for fat tailed P{v}? • How relevant and how many should known variables be when P{v} is sub-exponential? • GREM (directed polymers on trees) optimal resolution/discounting k s ) = u 1 s 1 + u 2 s 2 | s 1 + u 3 U ( ~ s 3 | s 2 ,s 1 + . . . + u m s m | s m − 1 ,...,s 1 s ∗ ~ Discounting: u k s k | s k − 1 ,...,s 1 ∼ δ k − 1 , δ < 1 s 0 s ≡ s ≥ k = ( s k , . . . , s m ) ¯ knowns unknown s ≡ s <k = ( s 1 , . . . , s k − 1 )
Sampling: (the inverse problem) Observables (knowns) max max U ( s, ¯ s ) ⇒ s ∗ ¯ s s Nature max s ) U ( s, ¯ s ) ( s, ¯ Data M observations ⇣ s (1) , . . . , s ( M ) ⌘ s = ˆ Q: What can I say on u s = E s [U(s,s)]? When is M large enough? What do samples (typically) look like when M is small?
Where is the information on u s in the sample? • Sample of M observations ⇣ s (1) , . . . , s ( M ) ⌘ s = ˆ M • gives a noisy estimate of X δ s ( i ) ,s u s K s = 1=1 u s ≈ c + β − 1 log K s • The information contained in the sample is H[K] kN ( k ) kN ( k ) X H [ K ] = − log 2 M M N(K)=n. of cities of size K k
The information content of the city size distribution: how many bits to find Mr X? • Information gain and entropy M people in the US, need log 2 M bits to find Mr X • If you knew the size K X of the city where X lives kN ( k ) kN ( k ) then you’d need log 2 [K X N(K X )] binary questions X H [ K ] = − log 2 (i.e. bits). M M k • If you knew which city s X X lives in, then you’d kN ( k ) k X H [ s ] = − log 2 need log 2 K X bits M M k • If all individuals live in the same city K X =M then H [ K ] = H [ s ] = 0 you don’t gain any information either way • If each individual lives in a different city (K X =1) H [ K ] = 0 , H [ s ] = log 2 M you don’t gain anything if you know K X you know everything if you know s X What is the most informative • Information gain depends on N(K) and the N(k) for 0 < H[s] < log 2 M ? amount of information is given by H[K]
Maximally informative samples (upper bound) Zipf: µ = 2 N ( k ) : { N ( k ) } H [ K ] max s . t . H [ s ] = H 0 9 M=10 6 M=10 5 8 X kN ( k ) = M 7 k 6 H[K] 5 Data processing inequality: 4 3 kN ( k ) X H [ s ] − H [ K ] = log N ( k ) 2 M 1 k ≥ 0 0 0 2 4 6 8 10 12 14 H[s] N ( k ) = 1 ∼ ∀ k N ( k ) ∼ k − µ
Recommend
More recommend