Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou - PowerPoint PPT Presentation

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence Represent ntations Mingda Chen Joint work with Zewei Chu and Kevin Gimpel

Prior work on evaluation benchmarks • Focus on capabilities of representations for stand-alone sentences • Sentiment analysis • Linguistic properties, e.g. verb tense prediction • … • What about the broader context (i.e. discourse) for a sentence? 1

Our contributions • An evaluation suite for evaluating discourse knowledge encoded in sen senten ence r e rep epresen esentati tion ons . • Benchmark and compare several pretrained sentence representations. • Novel learning criteria for capturing discourse structures. 2

Discourse Evaluation (DiscoEval) • Focus on evaluating the role of a sentence in its discourse context. • 7 task groups, covering multiple domains (e.g. Wikipedia, stories, dialogues, and scientific literature). • Probing tasks. Pretrained embeddings are kept fixed and we only use simple classifiers. 3

<latexit sha1_base64="bvaxiqMuIvBbqER7MO1ac5wEjoI=">ACD3icbVC7TsMwFHXKq5RXgJHFogIxQJUJBgrWBiLRB9SGkWO47ZWnTiyHdQq7R+w8CsDCDEysrG3+C0GaDlSLbOPede2f4MaNSWda3UVhaXldK6XNja3tnfM3b2m5InApIE546LtI0kYjUhDUcVIOxYEhT4jLX9wk/mtByIk5dG9GsXEDVEvol2KkdKSZx47Q8+hUOvml027PCAq1k5zuqzjI9dzyxbFWsKuEjsnJRBjrpnfnUCjpOQRAozJKVjW7FyUyQUxYxMSp1EkhjhAeoR9MIhUS6XSfCTzSgC7XOgTKThVf0+kKJRyFPq6M0SqL+e9TPzPcxLVvXJTGsWJIhGePdRNGFQcZuHAgAqCFRtpgrCg+q8Q95FAWOkISzoEe37lRdKsVuzSvXuoly7zuMogNwCE6ADS5BDdyCOmgADB7BM3gFb8aT8WK8Gx+z1oKRz+yDPzA+fwBSoZmx</latexit> <latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit> Discourse Evaluation (DiscoEval) • In general, we follow SentEval and use following input for tasks involving pairs of sentences x 1 , x 2 [ x 1 , x 2 , x 1 � x 2 , | x 1 � x 2 | ] 4

<latexit sha1_base64="O13UVs1+Wil2AH1HG0wZF7TLsA=">ACMXicbVA9T8MwEHX4pnwVGFksKiQGqJKCBCOChbFIFJDaKHKcK1g4cWRfUKuQv8TCP0EsHUCIlT+B03aAwkm23r27d2e/MJXCoOsOnKnpmdm5+YXFytLyupadX3jyqhMc2hxJZW+CZkBKRJoUAJN6kGFocSrsP7s7J+/QDaCJVcYj8FP2a3iegKztBSQfW8g9D4ZxcQ1Tk7aIXeHu0FzTKy6MdFSkcpY9lvl/ix0mVXwTVmlt3h0H/Am8MamQczaD60okUz2JIkEtmTNtzU/RzplFwCUWlkxlIGb9nt9C2MGExGD8f7izojmUi2lXangTpkP2pyFlsTD8ObWfM8M5M1kryv1o7w+6xn4skzRASPlrUzSRFRUv7aCQ0cJR9CxjXwr6V8jumGUdrcsWa4E1+S+4atS9g3rj4rB2cjq2Y4FskW2ySzxyRE7IOWmSFuHkibySN/LuPDsD58P5HLVOWPNJvkVztc3mf6p2w=</latexit> <latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit> Discourse Evaluation (DiscoEval) • In general, we follow SentEval and use following input for tasks involving pairs of sentences x 1 , x 2 [ x 1 , x 2 , x 1 � x 2 , | x 1 � x 2 | ] 5

<latexit sha1_base64="Cv+ztvsh7fKnajPEbExTQdBEYo=">ACIHicbVDLTsMwEHR4U14FjlwsKiQOUCUFCY4ILhxBogWpjSLH2YKFE0f2BrUK5U+48CtcOIAQ3OBrcNIeK1kazQzu15PmEph0HU/nLHxicmp6ZnZytz8wuJSdXmlZVSmOTS5kpfhMyAFAk0UaCEi1QDi0MJ5+H1UaGf34A2QiVn2E/Bj9lIrqCM7RUN1r9wJvi/aCRnF5dx2EHpZjcw3RIO+oSOHgrtRvrYFuF95bP6jW3LpbFv0LvBGokVGdBNX3TqR4FkOCXDJj2p6bop8zjYJLGFQ6mYGU8Wt2CW0LExaD8fNykwHdsExEu0rbkyAt2e8dOYuN6cehdcYMr8xvrSD/09oZdvf9XCRphpDw4UPdTFJUtEiLRkIDR9m3gHEt7K6UXzHNONpMKzYE7/eX/4JWo+7t1Bunu7WDw1EcM2SNrJN4pE9ckCOyQlpEk7uySN5Ji/Og/PkvDpvQ+uYM+pZJT/K+fwCadaig=</latexit> <latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit> Discourse Evaluation (DiscoEval) • In general, we follow SentEval and use following input for tasks involving pairs of sentences x 1 , x 2 [ x 1 , x 2 , x 1 � x 2 , | x 1 � x 2 | ] 6

<latexit sha1_base64="EZhCq4owAtF2cwbdS8hlXB0MsHM=">AB73icbVBNSwMxEJ31s9avqkcvwSJ4kLJbBT0WvXisYD+gXZsm1Ds8maZKVl6Z/w4kERr/4db/4b03YP2vpg4PHeDPzwoQzbVz321lZXVvf2CxsFbd3dvf2SweHTS1TRWiDSC5VO8SaciZowzDaTtRFMchp61weDv1W09UaSbFgxkn1I9xX7CIEWys1B4F3jkaBdWgVHYr7gxomXg5KUOelD6vYkSWMqDOFY647nJsbPsDKMcDopdlNE0yGuE87lgocU+1ns3sn6NQqPRJZUsYNFN/T2Q41noch7YzxmagF72p+J/XSU107WdMJKmhgswXRSlHRqLp86jHFCWGjy3BRDF7KyIDrDAxNqKiDcFbfHmZNKsV76JSvb8s127yOApwDCdwBh5cQ3uoA4NIMDhGV7hzXl0Xpx352PeuLkM0fwB87nD9ITjys=</latexit> <latexit sha1_base64="S8HMBUZD6ql8pU5lTkqauFa2iXg=">ACMXicbVBNSwMxEM36WetX1aOXYBE8aNmtgh6LXjwqWBXaZclmpzY0u1mSWlZ+5e8+E/EiwdFvPonzLY9qHUg4eW9N8zkhakUBl31ZmZnZtfWCwtlZdXVtfWKxub10ZlmkOTK6n0bcgMSJFAEwVKuE01sDiUcBP2zgr95h60ESq5wkEKfszuEtERnKGlgsp5qx94+7Qf1IvLo20VKRw/2wh9HE3INUTD/GFYGA4KcVryg0rVrbmjotPAm4AqmdRFUHluR4pnMSTIJTOm5bkp+jnTKLiEYbmdGUgZ7E7aFmYsBiMn4+GDumuZSLaUdqeBOmI/dmRs9iYQRxaZ8ywa/5qBfmf1sqwc+LnIkzhISPB3UySVHRIj4aCQ0c5cACxrWwu1LeZpxtCGXbQje3y9Pg+t6zTus1S+Pqo3TSRwlsk12yB7xyDFpkHNyQZqEk0fyQt7Iu/PkvDofzufYOuNMerbIr3K+vgFbcqnb</latexit> Discourse Evaluation (DiscoEval) • In general, we follow SentEval and use following input for tasks involving pairs of sentences x 1 , x 2 [ x 1 , x 2 , x 1 � x 2 , | x 1 � x 2 | ] 7

What is a discourse? • A discourse is a coherent, structured group of sentences that acts as a fundamental type of structure in natural language. 8

What is a discourse? • Linearly-structured, e.g. sentence ordering. • The timing of introducing entities. • Tree-structured, e.g. RST discourse tree. “S” represents “satellite”, containing additional information about the nucleus. 1. The European Community's consumer price index rose a provisional 0.6% in September from August NS-Attribution 2. and was up 5.3% from September 1988, 3. according to Eurostat, the EC's statistical agency. NN-Comparison 1 2 3 “N” represents “nucleus”, containing basic information for the relation. 9

Discourse Relations • Two human-annotated datasets: Penn Discourse Treebank (PDTB) and RST Discourse Treebank (RST-DT). • PDTB provides discourse markers for ad adjac jacent sen senten ences es , whereas RST-DT offers do docum ument-le level discourse trees. 10

Discourse Relations – PDTB • Use a pair of sentences to predict discourse relations. • We focus on predicting implicit relations (PDTB-I) and explicit relations (PDTB-E). PDTB-E PDTB-I 1. In any case, the brokerage firms are clearly 1. “A lot of investor confidence comes from the moving faster to create new ads than they did in fact that they can speak to us,” he says. the fall of 1987. 2. But it remains to be seen whether their ads will 2. [so] “To maintain that dialogue is absolutely be any more effective. crucial.” Label La el: Co Comparison.Co Contrast La Label el: Contingency cy.Cause 11

Discourse Relations – RST-DT • Text is segmented into basic units, elementary discourse units (EDUs), upon which a discourse tree is built recursively. • We use 18 fine-grained relations. 1. The European Community's consumer price index rose NS-Attribution a provisional 0.6% in September from August 2. and was up 5.3% from September 1988, NN-Comparison 3. according to Eurostat, the EC's statistical agency. 1 2 3 12

Discourse Relations – RST-DT • Text is segmented into basic units, elementary discourse units (EDUs), upon which a discourse tree is built recursively. • We use 18 fine-grained relations. 1. The European Community's consumer price index rose NS-Attribution a provisional 0.6% in September from August 2. and was up 5.3% from September 1988, NN-Comparison 3. according to Eurostat, the EC's statistical agency. 1 2 3 13

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou - PowerPoint PPT Presentation

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence Represent ntations Mingda Chen Joint work with Zewei Chu and Kevin Gimpel Prior work on evaluation benchmarks Focus on capabilities of

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

ESG Criteria: ESG Criteria: ESG Criteria: ESG Criteria: New paradigm that will redefine the

Criteria Decreasing number of alternatives Alternatives Increasing number of criteria

Harmonization Agreement Part I: Evaluation criteria 96 th session of the Evaluation Committee 23

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Guidelines and Guidelines and criteria criteria for for evaluation evaluation of of protected

Benchmarking and Evaluation in Inverse Reinforcement Learning Peter Henderson Workshop on New

and Benchmarks May 24, 2018 Panelists Katy Miller Regional Coordinator Jasmine Hayes Deputy

Revision of Recreation Use and Criteria and Adoption of Aquatic Life Criteria for Three Toxics

First Time Accreditation Dr. J.N. Jha Principal MIT, Muzaffarpur Criteria No. Criteria

Map the System 2019 Evaluation Criteria Scorecard for Presentation and Q&A (Step 2)

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Probing the relative momentum of two-nucleon system in 6 He and 6 Li W. Horiuchi and Y. Suzuki

Sponsors tinyML Committee tinyML Org team: Bette Cooper, tinyML Org Lead Gary Brown

Attribute-Efficient Learning of Monomials over Highly-Correlated Variables Alexandr Andoni,

Lecture 11: Security January 25, 2020 Chris Stone Lab 3 (Bomb) Due 1:15pm Friday Lab 4 (Attack)

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis

NH & RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming

How Well are Minnesotans Housed? Housing Trends and Policy in Minnesota Sarah Berke, Director of

Sambuz

Useful Links

Newsletter

Mail Us

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou - PowerPoint PPT Presentation

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence Represent ntations Mingda Chen Joint work with Zewei Chu and Kevin Gimpel Prior work on evaluation benchmarks Focus on capabilities of

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

ESG Criteria: ESG Criteria: ESG Criteria: ESG Criteria: New paradigm that will redefine the

Criteria Decreasing number of alternatives Alternatives Increasing number of criteria

Harmonization Agreement Part I: Evaluation criteria 96 th session of the Evaluation Committee 23

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Guidelines and Guidelines and criteria criteria for for evaluation evaluation of of protected

Benchmarking and Evaluation in Inverse Reinforcement Learning Peter Henderson Workshop on New

and Benchmarks May 24, 2018 Panelists Katy Miller Regional Coordinator Jasmine Hayes Deputy

Revision of Recreation Use and Criteria and Adoption of Aquatic Life Criteria for Three Toxics

First Time Accreditation Dr. J.N. Jha Principal MIT, Muzaffarpur Criteria No. Criteria

Map the System 2019 Evaluation Criteria Scorecard for Presentation and Q&amp;A (Step 2)

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Probing the relative momentum of two-nucleon system in 6 He and 6 Li W. Horiuchi and Y. Suzuki

Sponsors tinyML Committee tinyML Org team: Bette Cooper, tinyML Org Lead Gary Brown

Attribute-Efficient Learning of Monomials over Highly-Correlated Variables Alexandr Andoni,

Lecture 11: Security January 25, 2020 Chris Stone Lab 3 (Bomb) Due 1:15pm Friday Lab 4 (Attack)

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis

NH &amp; RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming

How Well are Minnesotans Housed? Housing Trends and Policy in Minnesota Sarah Berke, Director of

Sambuz

Useful Links

Newsletter

Mail Us

Map the System 2019 Evaluation Criteria Scorecard for Presentation and Q&A (Step 2)

NH & RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming