Information Theory in an Industrial Research Lab Marcelo J. Weinberger Information Theory Research Group Hewlett-Packard Laboratories – Advanced Studies Palo Alto, California, USA with contributions from the ITR group Purdue University – November 19, 2007
Information Theory (Shannon, 1948) it’s all about models, bounds, and algorithms � The mathematical theory: � measures of information � fundamentals of data representation (codes) for � compactness � secure, reliable communication/storage over a possibly noisy channel A formal framework for areas of engineering and science for which the notion of “information” is relevant � Components: � data models � fundamental bounds � codes, efficient encoding/decoding algorithms � Engineering problems addressed: � data compression enabling technologies with many practical � error control coding applications in: computing, imaging, storage, � cryptography multimedia, communications...
Information Theory research in the industry � Mission Research the mathematical foundations and practical applications of information theory, generating intellectual property and technology for “XXX Company” through the advancement of scientific knowledge in these areas � Apply the theory and work on the applications makes obvious sense for “XXX Company” research labs; But why invest on advancing the theory? � some simple answers which apply to any basic research area: long-term investment, prestige, visibility, give back to society... � this talk will be about a different type of answer: differentiating technology vs. enabling technology Main claim: working on the theory helps developing analytical tools that are needed to envision innovative, technology-differentiating ideas
Case studies Input Output � JPEG-LS: 010010... 010010... store, de- compress transmit compress from universal context modeling to a lossless image compression standard � DUDE (Discrete Universal DEnoiser): from a formal setting for universal denoising to actual image denoising algorithms S S 0 0 1 1 � Error-correcting codes in nanotechnology: the advantages of interdisciplinary research 111 111 100 100 � 2-D information theory: 010 010 001 001 looking into the future of storage devices
Work paradigm new challenges patents, technology, consulting work on practical solutions visibility standards identify a start (fairly abstract) industry visibility practical talent problem e.g. e.g. •image compression academia, •2D channel coding work on •ECC for nano the scientific ideas, •denoising theory community papers, motivation participation •scientific interest •vision of benefit to XXX
Universal Modeling and Coding � Traditional Shannon theory assumes that a (probabilistic) model of the data is available, and aims at compressing the data optimally w.r.t. the model � Kraft’s inequality: Every uniquely decipherable (UD) code of ∑ − ( ) 2 ≤ length L ( s ) satisfies L s 1 ∈ s A n string of length n over finite alphabet A ⇒ a code defines a probability distribution P ( s ) = 2 - L ( s ) over A n � Conversely, given a distribution P ( ) (a model), there exists a − log ⎡ ( ) ⎤ bits to s (Shannon code) UD code that assigns p s � Hence, P ( ) serves as a model to encode s , and every code has an associated model � a model is a probabilistic tool to “understand” and predict the behavior of the data
Universal Modeling and Coding (cont.) � Given a model P ( ) on n -tuples, arithmetic coding provides an effective mean to sequentially assign a code word of length close to -log P ( s ) to s � if s = x 1 x 2 … x n the “ideal code length” for symbol x t is ( | ... ) p x x x x − 1 2 1 t t � the model can vary arbitrarily and “adapt” to the data � CODING SYSTEM = MODEL + CODING UNIT � two separate problems: design a model and use it to encode We will view data compression as a problem of assigning probabilities to data
Coding with Model Classes � Universal data compression deals with the optimal description of data in the absence of a given model � in most practical applications, the model is not given to us � How do we make the concept of “optimality” meaningful? � there is always a code that assigns just 1 bit to the given data! The answer: Model classes � We want a “universal” code to perform as well as the best model in a given class C for any string s , where the best competing model changes from string to string � universality makes sense only w.r.t. a model class � A code with length function L ( x n ) is pointwise universal w.r.t. a class if, when n → ∞ , code length with model C 1 = − → ( , ) [ ( ) min ( )] 0 n n n R L x L x L x C C ∈ n C C
How to Choose a Model Class? Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class! � Some possible criteria: � complexity: existence of efficient algorithms � prior knowledge on the data � We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0 � in this sense, prior knowledge is of paramount importance: don’t learn what you already know! Ultimately, the choice of model class is an art
Parametric Model Classes � A useful limitation to the model class is to assume C = { P θ , θ ∈ Θ d } a parameter space of dimension d � Examples: � Bernoulli: d = 1 , general i.i.d. model: d = α -1 ( α = | A | ) � FSM model with k states: d = k ( α -1) � memoryless geometric distribution on the integers i ≥ 0 : P ( i ) = θ i (1- θ ) , d = 1 � A straightforward method: two-part code [Rissanen ‘84] ⎡ ⎤ θ + − log θ ( | ) p x n encode best bits probability of the data “model cost”: under θ : grows with d grows with d � Trade-off: the dimension of the parameter space plays a fundamental role in modeling problems
Fundamental Lower Bound � A criterion for measuring the optimality of a universal model is provided by Rissanen’s lower bound [Rissanen ‘84] � for every p ( ) , any ε > 0 , and sufficiently large n , log n − − ≥ θ − ε 1 [ log ( )] ( ) + ( 1 ) n E P x d n H θ n 2 n for all parameter values θ except for a set whose volume → 0 as n → ∞ , provided a “good” estimator of θ exists Conclusion: the number of parameters affects the achievable convergence rate of a universal code length to the entropy
Contexts and Tree Models � More efficient parametrization of a Markov process [Weinberger/Lempel/Ziv ‘92, Weinberger/Rissanen/Feder ‘95] � Any suffix of a sequence x t is called a context in which the next symbol x t+ 1 occurs context 0 1 0 1 0 1 ... 1 1 0 0 1 0 1 ... 0 1 next input bit 0 1 � For a finite-memory source P , the conditioning states s ( x t ) = ∀ ∈ ∈ ( | ) ( | ( )), *, t t are contexts that satisfy P a x P a us x u A a A � # of parameters: α -1 per leaf of the tree � There exist efficient universal schemes in the class of tree models of any size [ Weinberger/Rissanen/Feder ’95, Willems/Shtarkov/Tjalkens ’95, Martín/Seroussi/Weinberger ’04]
Lossless Image Compression (the real thing…) Input Output 010010... 010010... Store, De- Compress transmit compress � Some applications of lossless image compression: � Images meant for further analysis and processing (as opposed to just human perception) � Images where loss might have legal implications � Images obtained at great cost � Applications with intensive editing and repeated compression/decompression cycles � Applications where desired quality of rendered image is unknown at time of acquisition � International standard: JPEG-LS (1998)
Universality vs. Prior Knowledge � Application of universal algorithms for tree models directly to real images yields poor results � some structural symmetries typical of images are not captured by the model � a universal model has an associated “learning cost:” why learn something we already know? � Modeling approach: limit model class by use of “prior knowledge” � for example, images tend to be a combination of smooth regions and edges � predictive coding was successfully used for years: it encodes the difference between a pixel and a predicted value of it � prediction errors tend to follow a Laplacian distribution ⇒ AR model + Laplacian, where both the center and the decay are context dependent � Prediction = fixed prediction + adaptive correction
Models for Images � In practice, contexts are formed out of a Causal template c b d finite subset of the past sequence a x Current sample � Conditional probability model for prediction errors: two-sided geometric distribution (TSGD) P ( e ) s = θ + θ ∈ ∈ | | ( ) , (0,1), [ 0 , 1 ) e s P e c s 0 TSGD � “discrete Laplacian” e – 1 –s � shift s constrained to [0,1) by integer-valued adaptive correction (bias cancellation) on the fixed predictor
Recommend
More recommend