Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006
This abstract relates to a document about low-price movies This document contains the words “cheap film”, but is not useful - Little human feedback is gathered on what makes a document relevant; it’s mainly automated. - The algorithms that decide relevancy are extremely complex and need to built from scratch. In 2003, Google used over 120 independent variables to sort results. Is it possible to teach a system how to identify relevant documents without defining any explicit rules?
To teach a system how to distinguish relevant documents from irrelevant, a large amount of training data is required. A wide range of documents and queries are needed to give a realistic model. Early work in indexing documents – dating back to the 1960s – provides collections of sample queries, matched up to relevant document content. Cyril Cleverdon pioneered work on organising information, and creating indexes. He led creation of a 1400-strong set of aerospace documents, accompanied by hundreds of natural language queries. A list of matching documents was also manually created for each query. This set of documents, queries and relevance judgements were known as the
Searching all documents for a given query is a very time consuming process. Documents can be indexed according to the words they contain. This shrinks search space considerably. Document A The aerodynamic properties deforming A of wing surfaces under pressure change according pressure A,B to temperature. The amount of pressure will also risk properties A deforming the wing, thus moving any heat spots and adjusting flow. Document B surfaces A High pressure water hoses standard B are a fantastic tool for cleaning your garden. They also have uses in farming, where cattle enjoy a high washdowns B hygiene standard due to regular washdowns. This allows documents containing keywords to be rapidly identified – only one lookup needs to be performed for each word in the query!
Identify document features A set of statistics can be used to describe a document. They can be about the document itself, or about a particular word in the document. These numeric descriptions then become training examples for a machine learning algorithm. For example, two documents can be assessed based on a query such as: “ what chem i cal ki net i c syst em i s appl i cabl e t o hyper soni c aer odynam i c Pr obl em s” A set of statistics describing each document relative to the query can then be derived. Independent Independent stats stats Overall Overall keyword info keyword info Localised Localised keyword info keyword info Human judgement, from reference collection Positive example Negative example
Decision trees are acyclic graphs that have a decision at each branch, based on an attribute of an example, and end at leaves which classify a document as relevant or not relevant. First position of keyword > 0.093 <= 0.093 Ratio of sentences missing keyword to those containing it <= 11.3 (Other half of the tree) > 11.3 Number of sentences in doc Absolute average word length > 6 <= 6 > 5.98 <= 5.98 Proportion of paragraphs Keyword density in Negative Positive containing keyword keyword sentences <= 0.59 > 0.59 > 0.045 <= 0.045 Absolute position of paragraphs Mean position in paragraph Positive Positive containing keyword of sentences with keyword > 1.54 <=1.54 <= 1.1 > 1.1 Keyword frequency Positive Negative Positive <= 4.74 > 4.74 Mean position in Positive sentence of keyword <= 9.84 > 9.84 Keyword density Negative <= 0.0098 > 0.0098 Negative Positive A C4.5 Decision Tree, produced in an effort to emulate the decisions of the Cranfield judges. The full version of this tree attained an 80.4% accuracy rate.
Neural nets Neural nets have a set of nodes, each of which has various weights assigned to inputs. These are coupled with attributes, and when a certain internal value is reached, the output value changes. Backpropagation is used to help converge on a net that solves the problem. K-Nearest Neighbour Document A K-Nearest Neighbour plots all training data as points in multi- dimensional space, with one dimension for each attribute. New examples are classified by working out the weighted average classification of the k nearest training examples. Document B Query
The task is possible, with all algorithms managing to learn to identify a good amount of relevant documents. 40.00% 600 35.00% 500 Accuracy difference 30.00% Negative examples 400 25.00% 20.00% 300 15.00% 200 10.00% 100 5.00% 0.00% 0 1 2 3 4 Difference Set Negatives MED accuracy 100.00 95.00 90.00 85.00 accuracy 80.00 acc1 75.00 acc2 70.00 acc3 65.00 60.00 55.00 50.00 1 2 3 4 5 6 7 8 9 1011121314151617181920 Hidden units Not every document suggested as relevant by human judges could be matched by the system. Sometimes, words were used that did not occur in the document. Adding synonym lookup or a thesaurus should help.
Recommend
More recommend