Language and Language and Language and Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic 2: Searching Introduction Introduction Introduction Text Text Text Speech Speech Speech Searching in a Searching in a Searching in a Introduction Library Catalogue Library Catalogue ◮ An astounding number of information resources are Library Catalogue Special characters Special characters Special characters Linguistics 384: Language and Computers available: books, databases, the web, newspapers, . . . Operators Operators Operators Searching in a Library Catalogue Searching the web Searching the web Searching the web ◮ To locate relevant information, we need to be able to Topic 2: Searching Operators Operators Operators Improving searching Improving searching Improving searching search these resources, which often are written texts : Ranking of results Ranking of results Ranking of results Searching the web Evaluating search results Evaluating search results Evaluating search results ◮ Searching in a library catalogue (e.g., using OSCAR) Scott Martin ∗ Advanced searches Advanced searches Advanced searches ◮ Searching the web (e.g., using Google) with regular with regular with regular expressions expressions expressions Advanced searches with regular expressions Dept. of Linguistics, OSU ◮ Advanced searching in text corpora (e.g., using regular Syntax of regular expressions Syntax of regular expressions Syntax of regular expressions Winter 2008 Grep: An example for using Grep: An example for using Grep: An example for using expressions in Opus) regular expressions regular expressions regular expressions Text corpora and searching Text corpora and searching Text corpora and searching them them them ∗ The course was created together with Chris Brew, Markus Dickinson and Detmar Meurers. 1 / 33 2 / 33 3 / 33 Searching in speech Language and Searching in a library catalogue Language and Basic searching in OSCAR Language and Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic 2: Searching ◮ Literal strings are composed of characters which Introduction Introduction Introduction naturally must be in the same character encoding Text Text Text Speech Speech system (e.g. ASCII, ISO8859-1, UTF-8) as the strings Speech ◮ One might also want to search for speech , e.g., to find Searching in a ◮ To find articles, books, and other library holdings, a Searching in a Searching in a encoded in the database. a particular sentence spoken in an interview one only Library Catalogue Library Catalogue Library Catalogue library generally provides a database containing Special characters Special characters Special characters has a recording (audio file) of. ◮ For literal strings, OSCAR does not distinguish between Operators Operators Operators information on its holdings. upper and lower-case letters (i.e. they aren’t so literal ◮ With current technology, this is only possible if the Searching the web Searching the web Searching the web Operators Operators Operators ◮ OSCAR is the database frontend providing access to after all) interview is transcribed, using the IPA or another writing Improving searching Improving searching Improving searching Ranking of results Ranking of results Ranking of results the library database at OSU. ◮ Adjacent words are searched as a phrase. system. Evaluating search results Evaluating search results Evaluating search results Advanced searches ◮ OSCAR makes it possible to search for the occurrence Advanced searches Advanced searches ◮ It is, however, already possible to ◮ art therapy with regular with regular with regular expressions of literal strings occurring in the author, title, keywords, expressions expressions ◮ detect the language of a spoken conversation, e.g., ◮ vitamin c Syntax of regular expressions Syntax of regular expressions Syntax of regular expressions call number, etc. associated with an item held by the when listening in to a telephone conversation Grep: An example for using Grep: An example for using Grep: An example for using regular expressions regular expressions ◮ In addition to querying literal strings, the query regular expressions ◮ detect a new topic being started in a conversation Text corpora and searching library. Text corpora and searching Text corpora and searching them them them language of OSCAR also supports the use of ◮ In the following, we focus on searching in text. ◮ special characters to abbreviate multiple options ◮ special operators for combining two query strings (boolean operators) or modifying the meaning of a single string (unary operators) 4 / 33 5 / 33 6 / 33 Language and Language and Language and OSCAR: Special characters OSCAR: Literal Strings and Operators (I) OSCAR: Operators (II) Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic 2: Searching Introduction Introduction Introduction Text Text Text Speech Speech Speech ◮ Use parentheses to group words together when using ◮ Use * for 1–5 characters at end or within a word. Searching in a Searching in a Searching in a Library Catalogue Library Catalogue more than one operator. Library Catalogue ◮ art* finds arts, artists, artistic ◮ Use and or or to specify multiple words in any field, any Special characters Special characters Special characters art therapy and not ((music or dance) Operators Operators Operators ◮ gentle*n order. therapy) Searching the web Searching the web Searching the web Operators Operators Operators ◮ Use ** for any number of characters at end of word. ◮ art and therapy ◮ Use near to specify words within 10 words of each Improving searching Improving searching Improving searching Ranking of results Ranking of results Ranking of results ◮ art or therapy art** finds artificial, artillery Evaluating search results Evaluating search results other, in any order. Evaluating search results Advanced searches Advanced searches Advanced searches ◮ Use ? for a single character at end or within a word. ◮ Use and not to exclude words. ◮ art near therapy with regular with regular with regular expressions expressions expressions gentlem?n ◮ art and not therapy Syntax of regular expressions Syntax of regular expressions ◮ Use within n to specify words within n words of each Syntax of regular expressions ◮ The special * and ? characters must have at least 2 Grep: An example for using Grep: An example for using Grep: An example for using regular expressions regular expressions regular expressions other. The value of n has no limit. Text corpora and searching Text corpora and searching Text corpora and searching characters to their left. ( → for efficiency reasons) them them them ◮ art within 12 therapy 7 / 33 8 / 33 9 / 33
Recommend
More recommend