Challenges of linking statistical data and phonetic pronunciation software Case study: Problem Of Regular Statistics Establishments' Frames In Egypt Nehall Ahmed Farouk nehall_ahmed@capmas.gov.eg Research , sampling ,and computer specialist Central Agency of Public Mobilization and Statistics(CAPMAS) Egypt
Points of discussion conslusion Expected Results Problem and methods Introduction
Introduction Different types of statistical data are processed for various reasons to improve the statistical work and to provide new indicators. Some types of these data are measurable, comparable, and linkable but others are not .Statistical work might have a lot of challenges of mixing, comparing, and linking data ,these challenges results from the nature of data type.
Introduction Case Study : Problem Of Regular Statistics Establishments' Frames In Egypt CAPMAS, Egypt conducts many different regular statistics establishments' surveys; each survey has its own frame called (establishments' frame) which all is conducted over about 108515 establishments. The regular statistics contains a total of 89 frames distributed over the 9 different departments. Some of these frames contain main centers only for the establishments, others contains main centers and some branches, or contain some main centers and some branches.
Problem and methods Problem core Current situation Aggregation process purposes Aggregation process structure Aggregation process implementation
Problem and methods Problem core CAPMAS seeks to generate a main aggregated frame for all of the regular statistics establishments' frames. The total number of the related overlapped frames is 67 frames. The problem appears in the implementation of the aggregation process because there is no way to compare and link the same establishments over different frames.
Problem and methods Current situation Disability of matching the same Disability of Different frames establishment in matching the of the the related frames All same establishments establishments as it is not establishment in are overlapped completely have no unique different frames and same ID number to be compatible in as it exists with establishment used in data name but partially different names exists in compatible linking. (about 20% of the different frames. because of the frames). nature of writing in Arabic.
Problem and methods Current situation Total number of Number of Departments establishments frames Labor statistics department 10866 7 Finance and price department 6091 14 Industrial statistics department 7859 8 Agriculture statistics department 748 13 Service statistics department 18270 24 Education statistics department 53158 5 Trade statistics department 10028 4 Transportation statistics department 1054 11 Infrastructure statistics department 441 3
Problem and methods Aggregation process purposes Creating the important part in generating administrative data for the establishment . Solving the frames confliction problem and the establishments repetition. Making each establishment unique with its own ID in the created master frame. Selecting all of the establishments' surveys from the generated master frame.
Problem and methods Aggregation process structure Determining and collecting Determining relationships metadata about all of the and inter-relationships overlapped related frames. between the frames . Classifying the frames : In parallel: Relationship (master frames - related (Creating a unique frames - independent frames ) ID number- compare through Sectoral activity(public /business sector – the pronunciation governmental sector – private / phonetic system). investment sector ). Final aggregation process (matching through TTS software).
Problem and methods Aggregation process implementation 1. Parts already 3. Overcoming 2. Problems achieved the problems By using Phonetic pronunciation software
1. Parts already achieved Classifying the classifying the Dividing each frames according frames according sectoral activity to the sectoral to the type of into 2 relation activity :(public relation with each types: /business sector – other ,then (Comprehensive governmental relations frames excluding 22 sector – private / Partially - independent investment relations frames ( frames . sector). Comprehensive relations frames: almost is the master huge frame that may include establishments for other frames and might have relations with each other. Partially relations frames: have relation with each other and with the comprehensive relations frames.
1. Parts already achieved Relations between the overlapped frames
2. Problems No unique ID number for the establishments to be used in data linking. Disability of matching the same establishment in the related frames as it is almost the same name but partially compatible because of the nature of writing in Arabic. Disability of matching the same establishment in different frames as it exists with different names (about 20% of the frames). Frames Aggregation and unification process is not accomplished due to lack of matching techniques.
2. Problems Collecting all the 67 different establishments frames’ No unique ID number for all establishments. meta data . Disability to aggregate the overlapped establishments Excluding the independent frames and determining the in the different frames . frames in-between relations. Redundancy of establishment in different frames Accomplish to classify the related frames into 2 stages with same partial compatible names or different names. (sectoral activity – relation type ). Lack of soft ware technique to solve the problem of Having consultants that monitor the project S W Strength the natural of Arabic writing. Weakness implementation process . O T Finding the suitable pronunciation phonetic soft ware or program that matches the establishment partial compatible name’s . 20% of the establishments might have different Generating unique ID number for each establishment names in different frames. during the implementation process. Finding a soft ware that make both pronunciation Ability to create the master aggregated establishments’ phonetic and also matches it . frame. Opportunity Threat Achieving the core of making administrative data for establishments in CAPMAS .
3. Overcoming the problems The idea of linking data here will depend on phonetic pronunciation software technique as a main part in the aggregation process to compare the data first and then linking it. The nature of Arabic language writing and its challenges for TTS software like: Arabic has some of problems to be Writing and pronunciation implemented as comparing of Arabic are Very difficult. data through TTS software.
Using phonetic pronunciation software Phonetic pronunciation comparing process contains two levels Generating speech for the establishments as Comparing the by using TTS software establishments' program. pronunciation name by using phonetic pronunciation software.
Using phonetic pronunciation software (Text) (Text) establishment 1 establishment 2 Phonetic (speech) (speech) Comparing establishment 1 establishment 2 Process Y N Aggregation Different ids +Same id
Using phonetic pronunciation software Phonetic pronunciation TTS program compare software Most of the writing mistakes that appears from the nature of Arabic writing will be vanished as in these two samples . Sample 1 Sample 2 هكرشلا ةينواعتلاتلباصتلؤل ةكرشلا ةينواعتلاتلباصتلؤل تاراقعلل ةيردنكسلبا تاراقعلل ةيردنكسلئا ةكرشلا ةشمقلؤل ةيرصملا ةكرشلا ةيرصملا
Using phonetic pronunciation software In parallel generating a primary new id for each establishment that's code depends on many factors to be generated, these factors are: The department that include the establishment in one of it frames. The establishment sector . The eligible structure of the establishment. Whether if the establishment is a main center or a branch.
Expected results The expected results of generating the master aggregated frame will have many effects in our statistical work, economic and technical systems • Data about one establishment will be collected once. 1 • Reduce the fieldwork cost. 2 • Excluding some surveys and affects the total cost. 3 • Helping in generating the administrative data for establishment. 4
conclusions Linking incomparable data can be achieved by the Statisticians must analysis of the data. The step study the nature of of finding out relations data and then think of between different files of how to use the most data and how to compare technological systems then is the most important or methods to link it. point to link data. Also that phonetic software is useful in comparing and linking data if the suitable software was developed.
Recommend
More recommend