Finding datasets / resources LING575 Analyzing Neural Language - PowerPoint PPT Presentation

Finding datasets / resources LING575 Analyzing Neural Language Models Shane Steinert-Threlkeld 01/30/2020 1

Other sources of data ● Thanks, Rachel Rudinger, for the presentation on decomp.io! ● Hopefully some of you will do analysis projects using that data ● Now: just a few pointers on finding data for your projects 2

Roles for data ● You will need data for your analysis project ● One simple case: data captures linguistic feature X, ask which representations in which models can capture that feature ● (Can be good to use more than one dataset here if possible) ● More complicated: generate your own data ● Because you hypothesize that model X will struggle with it (“adversarial”) ● To carefully control various linguistic variables ● Can borrow / take inspiration from / build upon examples from linguistics papers ● Examples: Marvin and Linzen 2018, Warstadt et al 2019, McCoy et al 2019 3

What makes a good dataset? ● Can depend on the project; try to find/build data that’s motivated by your question/hypothesis ● Well-designed: ● Clear annotation guidelines that yield consistent results ● Targets the intended task ● Relatively large (somewhat less important for analysis projects) ● Precedent in the literature ● If your project involves phenomena that are well-studied in NLP, use (and/or compare with) existing datasets! 4

LDC; Treehouse DB ● The Linguistics Data Consortium has many excellent datasets (think Penn Treebank) ● Many of those, and lots more, pre-installed on paths ● For a complete directory, see https://cldb.ling.washington.edu/ 5

SemEval ● International Workshop on Semantic Evaluation ● Each year, a shared task (or tasks) ● Multiple teams build models for one task ● Data is well-designed to be consumable by teams ● 2020 (links to older): http://alt.qcri.org/semeval2020/index.php?id=tasks ● Not every task will be appropriate; but you can search for your keywords + “semeval” and see if there’s been a task in the past ● NB: there are other shared tasks, not just SemEval, so you can also try keywords + “shared task” 6

Some general resources ● Brand new! Google Dataset Search ● https://datasetsearch.research.google.com/ ● Personally some mixed results so far, but could be very useful ● The Big Bad NLP Database ● https://quantumstat.com/dataset/dataset.html ● New, has large/standard datasets, but fairly small coverage (low recall) 7

Special Topics Presentations 8

Presentations ● Each group will be responsible for leading an ~45 minute discussion on a special topic of their choosing ● For example: ● A deep dive into one or two papers that are important to your group’s project ● Survey of a method / model / dataset that you are using that was not covered in the earlier lectures ● Present material, but also lead/guide a discussion, to make these sessions as much seminar-style as possible ● You don’t need to have all the answers about everything that could possibly come up 9

Logistics ● Sign up here: ● https://docs.google.com/spreadsheets/d/ 1RNQ1PyMXylQ5ouzXFlA6ldUsuSsELRr_1JvQlRibo5A/edit?usp=sharing ● For now: pick a time slot. You only need to fill in the first two columns. ● NB: there are 9 groups; so one week will have three presentations ● One full week before your presentation: ● Fill in topic, and list of reading(s) / resources ● Email me as well ● I will post to the website so that everyone can read in advance 10

Finding datasets / resources LING575 Analyzing Neural Language - PowerPoint PPT Presentation

Finding datasets / resources LING575 Analyzing Neural Language Models Shane Steinert-Threlkeld 01/30/2020 1 Other sources of data Thanks, Rachel Rudinger, for the presentation on decomp.io! Hopefully some of you will do analysis

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

STATUS COUNT FINDING APPROVED 5 FINDING CONDITIONAL 16 FINDING DENIED 11

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

documentation Overview The datasets Common data manipulations Analysis using weights

Russian baseline datasets for climatological climatological Russian baseline datasets for

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets Gene

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Abilene Observatory Datasets Matt Zekauskas, matt@internet2.edu 03-Jun-2004 Major Datasets,

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Gene finding Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, September 2002

Gospel of John JESUS Finding the Old Testament in the Gospel of John JESUS 1. Review

Dependency Parsing CMSC 470 Marine Carpuat Dependency Grammars Syntactic structure = lexical

Engagement and Motivation Knowing what we do and why is key! 5 steps for success in our young

Forecasting number of natural gas consumers and their total consumption with R Ondej Konr,

the technical and economic integration of distributed energy options in the electricity industry

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome (back) to UCR! 2 Class

Linguistically Motivated Reordering Modeling for Phrase-Based Statistical Machine Translation

1 What makes a successful Successful software teams team? Studies show a 10 to 1 difference

1 What motivates you? Motivation Survey Achievement Interpersonal relationships, superior

Sambuz

Useful Links

Newsletter

Mail Us