See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/330141363 A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE PROBABILISTIC EXPERT SYSTEM PES Conference Paper · April 1992 CITATIONS READS 12 10 1 author: Ji ř í Grim The Czech Academy of Sciences 123 PUBLICATIONS 938 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Statistical Recognition Based on Distribution Mixtures View project Probabilistic Neural Networks View project All content following this page was uploaded by Ji ř í Grim on 04 January 2019. The user has requested enhancement of the downloaded file.
A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE PROBABILISTIC EXPERT SYSTEM PES JIˇ R´ I GRIM Department of Computer Aided Decision-Making and Control Institute of Information Theory and Automation, Czechoslovak Academy of Sciences Pod vod´ arenskou vˇ eˇ z´ ı 4, 18208 PRAGUE 8, Czechoslovakia ABSTRACT This paper suggests a qualitatively new method of presentation of census results by means of the probabilistic expert system PES. The knowledge base of PES includes a statistical model of the original data which reproduces any marginal, conditional or unconditional, probability distribution with a reasonable accuracy depending on the corresponding subpopulation size. The resulting data compres- sion should enable an easy distribution of the final software product - possibly on a single diskette. 1 Introduction Extensive data sets become indispensable as a base of modern decision making in many spheres of social life. One of the most important data sets arises from census. It is unique by its extreme extent including the whole population and also by the corresponding extreme costs. With regard to its substantial meaning the census is periodically repeated in many countries despite of strong criticism from the point of view of possible misuse of private data. On the other hand, because of necessary data protection measures, the availability of information contained in census data is rather limited. In some respects the census yields global values like the number of citizens, but most questions of the census paper suggest a few alternative answers. The numerically answered questions may be discretized by introducing suitable intervals. The census paper may relate also to objects like households, buildings etc. in a complicated way. For this reason we assume in the following that the questionnaire relates to a single subject and the variables are transformed to discrete type. Possible other subjetcs of census are assumed to be treated separately on the base of the corresponding subsets of variables. 0 11th European Meeting on Cybernetics and Systems Research ’92, p. 997-1004 , Eds: Trappl R., World Scientific, (Singapore 1992), (Vienna, AT, 21.04.1992-24.04.1992) [1992]
For each question (variable) the fundamental result of census is given by the global relative frequencies of individual answers (values). The table of relative fre- quencies can be viewed as an estimate of the corresponding uncoditional marginal probability distribution and displayed as a histogram. To estimate conditional prob- ability distributions we can compute analogous histograms for different subpopula- tions, e.g. men, women, nationalities, regions etc. Generally, any combination of answers of different questions can be used as a criterion (condition) for the choice of subpopulation which, in turn, may be characterized by histograms of the remaining, unspecified questions. Any histogram (conditional distribution) characterizing such a subpopulation formally corresponds to a possible user query and can be estimated by means of relative frequencies - with some accuracy depending on the subpopula- tion size. The number of possible queries, which can be formulated over a census ques- tionnaire may become exceedingly large. Thus e.g. in case of 100 questions with 4 alternative answers we can consider 100 unconditional ”global” histograms, 39600 conditional histograms corresponding to subpopulations specified by one answer and 125126400 conditional histograms for subpopulations specified by two answers. As the subpopulations specified by three or four answers may still be quite plausible (even if partly empty), it is obvious that only small portion of the potentially inte- resting histograms could be printed. The problem of accessibility of census results can hardly be solved by some kind of choice since potential users may formulate very special and diverse queries. One of the present approaches relies upon technical means. Aggregate results in form of contingency tables are stored in a high capacity memory which is accessible from a computer network. However, even in this case only small order tables (e.g. 6-10 variables) can be stored because of technical limitations. Thus, any member of the network may directly recall any conditional histogram as long as the number of involved variables does not exceed the size of the corresponding table. Despite of technical problems and possibly high costs this ”direct” approach could retain its meaning in the future. An alternative solution enables the recently developed probabilistic expert sys- tem PES 2 , 3 . Instead of contingency tables the census results are described in a highly compressed form by a multivariate discrete probability distribution which is included in the knowledge base. Using PES the customer can compute the estimate of any conditional or unconditional histogram without any further contact with the central data base. 2 The probabilistic expert system PES Considering the probabilistic approach to expert systems we assume that input and output information is expressed in terms of some discrete (finite valued) random
variables v 1 , v 2 , . . . , v N ; v n ∈ X n . (1) The certainty degree (truth value, degree of belief) that a random variable v n has a value x n ∈ X n is assumed to be given by the corresponding probability P { v n = x n } . In this sense the uncertainty of a variable v n is characterized by a discrete probability distribution ∑ P { v n = x n } = p n ( x n ) ; x n ∈ X n ; p n ( x n ) = 1 (2) x n ∈ X n which can be viewed as a histogram having | X n | columns. Because of its informa- tivity and simplicity the concept of histogram represents the fundamental commu- nication means between the user and ES. The purpose of ES usually consists in evaluating some output variables v k +1 , v k +2 , . . . , v N (goals) given a knowledge base and some input variables v 1 , v 2 , . . . , v k (questions). From a more general probabilistic point of view, we have to derive the distribution of the output vector v B = ( v k +1 , v k +2 , . . . , v N ) P { v B = x B } = P ⋆ ( x B ) ; x B = ( x k +1 , x k +2 , . . . , x N ) ∈ X B ; (3) X B = X k +1 × X k +2 × . . . × X N which corresponds to a given distribution of the input vector v A = ( v 1 , v 2 , . . . , v k ) : P { v A = x A } = P ⋆ ( x A ) ; x A = ( x k +1 , x k +2 , . . . , x N ) ∈ X A ; (4) X A = X 1 × X 2 × . . . × X k . In this case the probabilistic knowledge base is fully described by the system of conditional distributions on X B Π B | A = { P B | A ( x B | x A ) ; x B ∈ X B ; x A ∈ X A } (5) since for any input distribution P ⋆ A ( x A ) we can write P ⋆ ∑ P B | A ( x B | x A ) P ⋆ B ( x B ) = A ( x A ) ; x B ∈ X B (6) x A ∈ X A The system of conditional distributions (5) defines s.c. memoryless information channel with noise and the formula of complete probability (6) represents an exact inference mechanism for the considered problem. Obviously, for another choice of input and output variables we would need other classes of conditional distributions. To avoid a difficult direct design of the conditional distributions (5) the know- ledge base of PES is defined as a joint probability distribution of the involved vari- ables - in form of a finite mixture (weighted sum) of product components M N ∑ ∏ P ( x ) = w m F ( x | m ) ; F ( x | m ) = p n ( x n | m ) ; (7) m =1 n =1
Recommend
More recommend