Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This - PowerPoint PPT Presentation

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006

This abstract relates to a document about low-price movies This document contains the words “cheap film”, but is not useful - Little human feedback is gathered on what makes a document relevant; it’s mainly automated. - The algorithms that decide relevancy are extremely complex and need to built from scratch. In 2003, Google used over 120 independent variables to sort results. Is it possible to teach a system how to identify relevant documents without defining any explicit rules?

To teach a system how to distinguish relevant documents from irrelevant, a large amount of training data is required. A wide range of documents and queries are needed to give a realistic model. Early work in indexing documents – dating back to the 1960s – provides collections of sample queries, matched up to relevant document content. Cyril Cleverdon pioneered work on organising information, and creating indexes. He led creation of a 1400-strong set of aerospace documents, accompanied by hundreds of natural language queries. A list of matching documents was also manually created for each query. This set of documents, queries and relevance judgements were known as the

Searching all documents for a given query is a very time consuming process. Documents can be indexed according to the words they contain. This shrinks search space considerably. Document A The aerodynamic properties deforming A of wing surfaces under pressure change according pressure A,B to temperature. The amount of pressure will also risk properties A deforming the wing, thus moving any heat spots and adjusting flow. Document B surfaces A High pressure water hoses standard B are a fantastic tool for cleaning your garden. They also have uses in farming, where cattle enjoy a high washdowns B hygiene standard due to regular washdowns. This allows documents containing keywords to be rapidly identified – only one lookup needs to be performed for each word in the query!

Identify document features A set of statistics can be used to describe a document. They can be about the document itself, or about a particular word in the document. These numeric descriptions then become training examples for a machine learning algorithm. For example, two documents can be assessed based on a query such as: “ what chem i cal ki net i c syst em i s appl i cabl e t o hyper soni c aer odynam i c Pr obl em s” A set of statistics describing each document relative to the query can then be derived. Independent Independent stats stats Overall Overall keyword info keyword info Localised Localised keyword info keyword info Human judgement, from reference collection Positive example Negative example

Decision trees are acyclic graphs that have a decision at each branch, based on an attribute of an example, and end at leaves which classify a document as relevant or not relevant. First position of keyword > 0.093 <= 0.093 Ratio of sentences missing keyword to those containing it <= 11.3 (Other half of the tree) > 11.3 Number of sentences in doc Absolute average word length > 6 <= 6 > 5.98 <= 5.98 Proportion of paragraphs Keyword density in Negative Positive containing keyword keyword sentences <= 0.59 > 0.59 > 0.045 <= 0.045 Absolute position of paragraphs Mean position in paragraph Positive Positive containing keyword of sentences with keyword > 1.54 <=1.54 <= 1.1 > 1.1 Keyword frequency Positive Negative Positive <= 4.74 > 4.74 Mean position in Positive sentence of keyword <= 9.84 > 9.84 Keyword density Negative <= 0.0098 > 0.0098 Negative Positive A C4.5 Decision Tree, produced in an effort to emulate the decisions of the Cranfield judges. The full version of this tree attained an 80.4% accuracy rate.

Neural nets Neural nets have a set of nodes, each of which has various weights assigned to inputs. These are coupled with attributes, and when a certain internal value is reached, the output value changes. Backpropagation is used to help converge on a net that solves the problem. K-Nearest Neighbour Document A K-Nearest Neighbour plots all training data as points in multi- dimensional space, with one dimension for each attribute. New examples are classified by working out the weighted average classification of the k nearest training examples. Document B Query

The task is possible, with all algorithms managing to learn to identify a good amount of relevant documents. 40.00% 600 35.00% 500 Accuracy difference 30.00% Negative examples 400 25.00% 20.00% 300 15.00% 200 10.00% 100 5.00% 0.00% 0 1 2 3 4 Difference Set Negatives MED accuracy 100.00 95.00 90.00 85.00 accuracy 80.00 acc1 75.00 acc2 70.00 acc3 65.00 60.00 55.00 50.00 1 2 3 4 5 6 7 8 9 1011121314151617181920 Hidden units Not every document suggested as relevant by human judges could be matched by the system. Sometimes, words were used that did not occur in the document. Adding synonym lookup or a thesaurus should help.

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This - PowerPoint PPT Presentation

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This abstract relates to a document about low-price movies This document contains the words cheap film, but is not useful - Little human feedback is gathered on what makes a

Joint Rumour Stance and Veracity Ander Edelbo Lillie, Emil Refsgaard Middelboe, Leon Derczynski

The Leon Project Sonora, Mexico Investors Presentation July, 2010 Leon Project Location

Mastering Drupal 8 Views Gregg Marshall Amanda Marshall http://bit.ly/D8Views Today About

Mastering Drupal 8 Views http://bit.ly/D8Views Amanda Marshall Gregg Marshall About Gregg

2021 - 2026 DOT CIP STEVE SHARKEY , DIRECTOR JANUARY 16, 2020 FY 2021-2026 CIP DOT Mission The

Macro Dark Matter David M. Jacobs Claude Leon Postdoctoral Fellow University of Cape Town SLAC

Business Assistance Available Pat Sharkey, Tourism Tabitha Hodge, Econ Devel Kayla Cox, Town of

RESULTS FROM AMANDA AMANDA Carlos de los Heros Division of High Energy Physics Uppsala

Machine Learning and Society Why Autonomous Warfare is a Bad Idea Noel Sharkey University of

Supervised Practice Program F LO R I D A B O A R D O F B A R E X A M I N E R S Supervised

Leon N. Moses Dis,nguished Lecture in Transporta,on October 27, 2015 Leon N. Moses 1924-2013

Making Your Own Open Source Raspberry Pi HAT Leon Anavi Konsulko Group leon.anavi@konsulko.com

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

the months, the weeks, the days. Christina Sharkey Macmillan GP Facilitator End of Life Care

MAGNETISM Concept tests Leon Abelmann leon.manucodiata.org Peer instruction for active

Leon County Sales Tax Committee August 23 rd , 2012 Framework for Sustainable Economic Development

OUTLINE Introduction Supervised ,Unsupervised And Semi-Supervised Learning.

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Other COVID 19 Financing Options Amanda Peterson amanda@ColoradoLendingSource.org 303.657.0010 O

Building Homebridge with the Yocto Project Leon Anavi Konsulko Group leon.anavi@konsulko.com

Amanda-Clearcreek Strategic Planning Presentation 2018-2019 Amanda-Clearcreek Safety Plan

Process Synchronization Prepared By: Saed Swedan Omar Hirzallah Supervised By: Dr. Loai

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised