Statistical Significance Testing for Natural Language Processing. Rotem Dror

Читать онлайн.
Название Statistical Significance Testing for Natural Language Processing
Автор произведения Rotem Dror
Жанр Программы
Серия Synthesis Lectures on Human Language Technologies
Издательство Программы
Год выпуска 0
isbn 9781681738307



Скачать книгу

Role Labeling

      Martha Palmer, Daniel Gildea, and Nianwen Xue

      2010

      Spoken Dialogue Systems

      Kristiina Jokinen and Michael McTear

      2009

      Introduction to Chinese Natural Language Processing

      Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang

      2009

      Introduction to Linguistic Annotation and Text Analytics

      Graham Wilcock

      2009

      Dependency Parsing

      Sandra Kübler, Ryan McDonald, and Joakim Nivre

      2009

      Statistical Language Models for Information Retrieval

      ChengXiang Zhai

      2008

      Copyright © 2020 by Morgan & Claypool

      All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.

      Statistical Significance Testing for Natural Language Processing

      Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart

       www.morganclaypool.com

      ISBN: 9781681737959 paperback

      ISBN: 9781681737966 ebook

      ISBN: 9781681738307 epub

      ISBN: 9781681737973 hardcover

      DOI 10.2200/S00994ED1V01Y202002HLT045

      A Publication in the Morgan & Claypool Publishers series

       SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES

      Lecture #45

      Series Editor: Graeme Hirst, University of Toronto

      Series ISSN

      Print 1947-4040 Electronic 1947-4059

       Statistical Significance Testing for Natural Language Processing

      Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart

      Technion – Israel Institute of Technology

       SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #45

       ABSTRACT

      Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, domains, and languages is constantly growing. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we, as a community, rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.

      The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Our guiding assumption throughout the book is that the basic question NLP researchers and engineers deal with is whether or not one algorithm can be considered better than another one. This question drives the field forward as it allows the constant progress of developing better technology for language processing challenges. In practice, researchers and engineers would like to draw the right conclusion from a limited set of experiments, and this conclusion should hold for other experiments with datasets they do not have at their disposal or that they cannot perform due to limited time and resources. The book hence discusses the opportunities and challenges in using statistical significance testing in NLP, from the point of view of experimental comparison between two algorithms. We cover topics such as choosing an appropriate significance test for the major NLP tasks, dealing with the unique aspects of significance testing for non-convex deep neural networks, accounting for a large number of comparisons between two NLP algorithms in a statistically valid manner (multiple hypothesis testing), and, finally, the unique challenges yielded by the nature of the data and practices of the field.

       KEYWORDS

      Natural Language Processing, statistics, statistical significance, hypothesis testing, algorithm comparison, deep neural network models, replicability analysis

       Contents

       Preface

       Acknowledgments

       1 Introduction

       2 Statistical Hypothesis Testing

       2.1 Hypothesis Testing

       2.2 P-Value in the World of NLP

       3 Statistical Significance Tests

       3.1 Preliminaries

       3.2 Parametric Tests

       3.3 Nonparametric Tests

       4 Statistical Significance in NLP

       4.1 NLP Tasks and Evaluation Measures

       4.2 Decision Tree for Significance Test Selection

       4.3 Matching Between Evaluation Measures and Statistical Significance Tests

       4.4 Significance with Large Test Samples

       5 Deep Significance

       5.1 Performance Variance in Deep Neural Network Models

       5.2 A Deep Neural Network Comparison Framework

       5.3 Existing Methods for Deep Neural Network Comparison

       5.4 Almost Stochastic Dominance

       5.5 Empirical Analysis

       5.6 Error Rate Analysis