network pdf file unstructured information
play

Network PDF file unstructured information - PowerPoint PPT Presentation

Network PDF file unstructured information Pengchanghuan,Sunwei,Fuxiaohan Shanghai Jiaotong University 598095762@qq.com extraction What is the most important thing in our modern society? PPT www.1ppt.com/moban/ PPT


  1. Network PDF file unstructured information Pengchanghuan,Sunwei,Fuxiaohan Shanghai Jiaotong University 598095762@qq.com extraction

  2. What is the most important thing in our modern society? PPT 模板下载: www.1ppt.com/moban/ 行业 PPT 模板: www.1ppt.com/hangye/ 节日 PPT 模板: www.1ppt.com/jieri/ PPT 素材下载: www.1ppt.com/sucai/ PPT 背景图片: www.1ppt.com/beijing/ PPT 图表下载: www.1ppt.com/tubiao/ 优秀 PPT 下载: www.1ppt.com/xiazai/ PPT 教程: www.1ppt.com/powerpoint/ Word 教程: www.1ppt.com/word/ Excel 教程: www.1ppt.com/excel/ 资料下载: www.1ppt.com/ziliao/ PPT 课件下载: www.1ppt.com/kejian/ 范文下载: www.1ppt.com/fanwen/ 试卷下载: www.1ppt.com/shiti/ 教案下载: www.1ppt.com/jiaoan/ PPT 论坛: www.1ppt.cn DATA Faster data = Faster cognition = Faster results

  3. THE BUSENESS PLAN What did we do? Smar t P DF Financial Repo r t Data Search Engine What can it do?

  4. Catalog 1.Introduction 2.Our process 3.Evaluation 4.Discussion 5.Conclusion 2.1PDF2CSV 3.1Table search 2.2Preliminary 3.2Table extraction extraction 2.3 Autodetect module 2.4Visualization 2.5 Convert to CSV file

  5. 1.Introduction 1.Introduction Significance : Once successfully acquisition of economic data for most companies, quickly access to market information, the company can make correct decision and successfully occupy a favorable position in the market. Problems : PDF(Portable Document Format), is a form of documentation designed to prevent modifications (in fact, it is difficult to modify them). Results : for our test of 32 PDF reports it has a 93% recall rate

  6. 2.Our process Tabula can help you extract data tables from PDF files and save them in CSV format so that you can easily access data and use it for the second time.It is an open source free program Python library xlwt can make our python program be able to handle excel 2.1PDF2CSV form After you enter the correct parameter to these tools,you can transform the entire PDF file into a CSV file. In the CSV file,text and data are separated by commas and line breaks.

  7. 2.Our process 2.1PDF2CSV

  8. 2.Our process We use the table header specific known as a template Extract the corresponding part of the data Store in our original database.Collect the training data we need for fuzzy 2.2Preliminary matching. extraction Features:text content, numeric type, numeric size, literal numeric distribution, text density, and so on. We take 27 known headers of 5 different reports as our initial samples.

  9. 2.Our process 2.2Preliminary extraction

  10. 2.Our process We find that the appearance of tables is related to the features we mentioned in the last part. We match the model trained by the existing data in the database with the contents in the CSV file 2.3 Autodetect module We compare the feature sets between them.Then select the higher matching part as a table. In addition,We also take the line distribution in the graph of the PDF file as the basis for the table to appear.

  11. 2.Our process 2.4 Visualization 2.5 Convert to CSV file

  12. 3.Evaluation Result In the test, we used 32 PDF file.We have a recall rate about 93%,such a high recall rate indicates that our program is of practical value. Table search Table extraction

  13. 3.Evaluation 3.1Table search

  14. 3.Evaluation 3.2Table extraction

  15. 3.Evaluation 3.2Table extraction

  16. 4&5 Discussion&Conclusion 4.Discussion In our project we can’t extract the title of the table,because it can exist in all directions of the table.So the next thing to do is to look for the title of the table intelligently. 5.Conclusion With our own intelligent analysis extraction program, we have a very high recognition extraction rate (for our test of 32 report draws have 93% recall rate).

  17. Network PDF file unstructured information extraction Thank you! --Piracle present

Recommend


More recommend