overview
play

Overview 1. A step back: a look at the data we had to work with 2. - PowerPoint PPT Presentation

1/19 Overview 1. A step back: a look at the data we had to work with 2. Requirements for the application 3. Basic choices made for data storage 4. Specific technologies we use 5. A detailed example: storage of the tagging 6. A note on


  1. 1/19 Overview 1. A step back: a look at the data we had to work with 2. Requirements for the application 3. Basic choices made for data storage 4. Specific technologies we use 5. A detailed example: storage of the tagging 6. A note on long-term archiving

  2. 2/19 Data we had to work with • Sound files (.aiff) of the fieldwork interviews • Transcripts (plain text files) made from these sound files with the PRAAT transcription program (www.praat.org)

  3. 3/19 Requirements for the application • Text of transcripts must be searcheable for words, arbitrary substrings, and with wildcards • It must be possible to assign POS tagging to individual words • It must be possible to lemmatize individual words • It must be possible for one word to have more than one tagging/annotation associated with it • It must be possible to draw maps from the results of searches in the data • It must be possible to hear the original audio of text fragments which are the results of searches

  4. 4/19 So how should the data be stored? • … in such a way that textual searches are fast enough • … in such a way that it is possible to uniquely identify the smallest parts of the text, i.e. individual words, while preserving their context at all levels: sentences, question-answer pairs, and complete interviews • It should also be easy to link to external information (for example, geographical coordinates of the localities of the interviews) • The most flexible and economic way of achieving this seemed to us to use a relational database for storage of the transcripts.

  5. 5/19 From PRAAT transcripts to a relational database: a quick overview

  6. 6/19 transcription in PRAAT waveform of sound transcription of sound: separate speakers in separate tiers, divided into intervals

  7. 7/19 • PRAAT saves its transcripts as plain text files, containing, for each tier, intervals with transcribed text and start and end times, assigned by the transcriber • scripts were written to split the PRAAT files into individual words which were read into the database • the basic table structure for the transcripts looks like this: interval word file (interview) tier (speaker) interval_id word_id file_id tier_id tier_id interval_id file_id start time locality end time

  8. 8/19 Specific technologies we used • We want to make our data accessible in an open standard format and we preferably use open source software • On the storage end, data should be stored in such a way that it is always possible to get it out again of whatever storage mechanism one chooses. The storage mechanism should not be tied to a specific operating system. Since we wanted a relational database backend for speed and flexibility reasons, the best choice in this regard would be a non-platform-specific open source database system. • The database we use at the Meertens Institute is MySQL , version 4.1.11 at the time of writing. In the database for the SAND, we use InnoDB tables so that we can use foreign key constraints and transactions to guarantee data consistency.

  9. 9/19 Specific technologies (continued) • The application is written in PHP (version 4.3.6 at this moment, upgrade to 5.0 imminent). PHP is used to translate search questions from web pages into SQL queries for the database, and SQL results from the database back into HTML pages. • Within the PHP layer, programming logic and presentation (HTML) are separated by means of the Smarty templating system: smarty.php.net. • The end result: a web application, delivering web pages in standard XHTML: maximum accessability for end users. The only software needed is a modern web browser, on any operating system.

  10. 10/19 Specific technologies (continued): maps • The maps are SVG (Scalable Vector Graphics) images. SVG is an open standard (SVG images are actually XML files). • We have a base map of the Dutch language area in the so-called “Rijksdriehoeksstelsel” (National Triangulation System), and a database with about 4500 localities and their coordinates in the same system. Searches in the SAND database always return localities, so it is straightforward to display search results as symbols on the map. • In addition to SVG images, maps can be displayed as JPEG images (generated by the PHP GD library).

  11. 11/19 Specific technologies (continued): sound • The original (.aiff) sound files are converted to QuickTime movies, and served by QuickTime Streaming Server. With this setup it is possible to play a small slice of the file with a specific starting and ending time. • Since searches in the SAND database always return one or more intervals (from the original transcripts), and intervals have start- and endtimes which are also saved in the database, it is straightforward to construct an HTML <object> with “starttime” and “endtime” parameters which plays, using the QuickTime plugin, just that specific slice of the sound file which corresponds with the found interval. • We use Apple’s QuickTime server, but there is also an open source version called Darwin Streaming Server. It can be downloaded from www.opensource.apple.com.

  12. 12/19 Use of a relational database: a detailed example storage of the tagging The text of interval id 3817 (from the municipality of Aalter) is: k weet datij zal moete were keren . (“I know that he will have to return”.) The word moete is word id 12345, and this word has the following tag: V(-e,eind2,inf,mod) Main category: V = verb; with four attributes defined: -e (inflexion is -e); eind2 (“end2”, it is the second highest verb, hierarchically, in the verb cluster); inf (it is a verbally used infinitive), mod (it is a modal auxiliary verb).

  13. 13/19 storage of the tagging (continued) This tag, V(-e,eind2,inf,mod) , is broken up into its constituent parts in the database and linked to the word id in two database tables: • sandtag_toegekend_woordsoort (“sandtag_assigned_category”) • sandtag_toegekend_attribuut (“sandtag_assigned_attribute”) If we look up word id 12345 in these tables, we find this:

  14. 14/19 storage of the tagging (category) sandtag_toegekend_woordsoort +----------+---------------+ | woord_id | woordsoort_id | +----------+---------------+ | 12345 | 4 | +----------+---------------+ Category is assigned to word 12345 by means of an id number (4) which refers to another table ( sand_woordsoort , “sand_category”): a small table which contains all word categories used in the SAND tag set. sandtag_woordsoort +---------------+------------+-----------------+ | woordsoort_id | woordsoort | woordsoort_kort | +---------------+------------+-----------------+ | 4 | V-infin | V | +---------------+------------+-----------------+ Category id 4 is ‘V-infin’ (infinite verb), the short form of which, for readability purposes, is just ‘V’.

  15. 15/19 storage of the tagging (attributes) sandtag_toegekend_attribuut +----------+--------------+-----------+ | woord_id | attribuut_id | waarde_id | +----------+--------------+-----------+ | 12345 | 1 | 3 | | 12345 | 2 | 18 | | 12345 | 14 | 98 | | 12345 | 11 | 57 | +----------+--------------+-----------+ The attributes are stored as attribute-value pairs, as usual with id numbers. These numbers refer to two tables: sandtag_attribuut (“sandtag_attribute”) and sandtag_waarde (“sandtag_value”). If we look up id numbers found above in these tables, we find: sandtag_attribuut sandtag_waarde +--------------+-----------+ +-----------+--------+ | attribuut_id | attribuut | | waarde_id | waarde | +--------------+-----------+ +-----------+--------+ | 1 | infl | | 3 | -e | | 2 | pos | | 18 | eind2 | | 14 | infin | | 98 | inf | | 11 | aux | | 57 | mod | +--------------+-----------+ +-----------+--------

  16. 16/19 storage of the tagging (continued) Apart from the three tables mentioned earlier (for categories, attributes and values), there are also two tables which list the legal combinations of categories, attributes and values, whether attributes are optional or required, and the order in which these elements are to be presented. The complete SAND tag set is described in these five tables. The relational database principle of storing data and tagging in separate tables, which are linked by id numbers, makes interesting things possible …

  17. 17/19 flexibility • adding a translation of the tagging system in other languages is possible without disturbing the data or the existing tagging. Just add a column: +--------------+-----------+--------------+ | attribuut_id | attribuut | attribuut_en | +--------------+-----------+--------------+ | 1 | infl | infl | | 2 | pos | pos | | 3 | casus | case | | 4 | pers | pers | | 5 | getal | number | | 6 | genus | gender | | 7 | functie | function | This has already been done for English.

Recommend


More recommend