Skip to content

gsarti/interpreting-complexity

Repository files navigation

Interpreting Models of Linguistic Complexity

This repository contains data and code implementations for reproducing all the experiments for:

Interpreting Neural Language Models for Linguistic Complexity Assessment, Gabriele Sarti, Data Science and Scientific Computing MSc Thesis, University of Trieste, 2020 [Gitbook] [Slides (Long)] [Slides (Short)]

UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations, Gabriele Sarti, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, [ArXiv] CEUR Video

That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models, Gabriele Sarti and Dominique Brunato and Felice Dell'Orletta, Proceeding of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021 [ACL Anthology]

If you find these resource useful for your research, please consider citing one or more following works:

@mastersthesis{sarti-2020-interpreting, author = {Sarti, Gabriele}, institution = {University of Trieste}, school = {University of Trieste}, title = {Interpreting Neural Language Models for Linguistic Complexity Assessment}, year = 2020 } @inproceedings{sarti-2020-umbertomtsa, author = {Sarti, Gabriele}, title = {{UmBERTo-MTSA @ AcCompl-It}: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations}, booktitle = {Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020)}, editor = {Basile, Valerio and Croce, Danilo and Di Maro, Maria, and Passaro, Lucia C.}, publisher = {CEUR.org}, year = {2020}, address = {Online} } @inproceedings{sarti-etal-2021-looks, title = "That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models", author = "Sarti, Gabriele and  Brunato, Dominique and  Dell'Orletta, Felice", booktitle = "Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics", month = jun, year = "2021", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "TBD", doi = "TBD", pages = "TBD", }

Overview

⚠️ TODO: Short summary and images ⚠️

Installation

Prerequisites

  • Python >= 3.6 is required to run the scripts provided in this repository. Torch should be installed using the wheels available on the Pytorch website that are compatible with your CUDA version.

  • For CUDA 10 and Python 3.6, we used the wheel torch-1.3.0-cp36-cp36m-linux_x86_64.whl.

  • Python >= 3.7 is required to run SyntaxGym-related scripts.

Main dependencies

  • torch == 1.6.0
  • farm == 0.5.0
  • transformers == 3.3.1
  • syntaxgym

Setup procedure

python3 -m venv env source env/bin/activate pip install --upgrade pip ./scripts/setup.sh

Run scripts/setup.sh from the main project folder. This will install dependencies, download data and create the repository structure. If you want to download ZuCo MAT files (30GB), edit setup.sh setting DOWNLOAD_ZUCO_MAT_FILES=false.

You need to manually download the original perceived complexity dataset presented in Brunato et al. 2018 from the ItaliaNLP Lab website and place it in the data/complexity folder.

The AcCompl-IT campaign data and the Dundee corpus cannot be redistributed due to copyright restrictions.

After all datasets are in the respective folders, run python script/preprocess.py --all from the main project folder to preprocess the datasets. Refer to the Getting Started section for further steps.

Code Overview

Repository structure

  • data contains the subfolders for all data used throughout the study:

    • complexity: the Perceived Complexity corpus by Brunato et al. 2018.
    • eyetracking: Eye-tracking corpora (Dundee, GECO, ZuCo 1 & 2).
    • eval: SST dataset used for representational similarity evaluation.
    • garden_paths: three test suites taken from the SyntaxGym benchmark.
    • readability: OneStopEnglish corpus paragraphs by reading level.
    • preprocessed: The preprocessed versions of each corpus produced by scripts/preprocess.py.
  • src/lingcomp is the library built behind this work, composed by:

    • data_utils: Eye-tracking processors and utils.
    • farm: Custom extension of the FARM library to add token-level regression, better multitask learning for NLMs and the GPT-2 model.
    • similarity: Methods used for representational similarity evaluation.
    • syntaxgym: Methods used to perform evaluation over SyntaxGym test suites.
  • scripts: Used to carry out the analysis and modeling experiment:

    • shortcuts: in development, scripts calling other scripts multiple times to provide a quick interface.
    • analyze_linguistic_features: Produces a report containing correlations across various complexity metrics and linguistic features.
    • compute_sentence_baselines: Computes sentence-level avg., binned avg. and SVM baselines for complexity scores using cross-validation.
    • compute_similarity: Evaluates the representational similarity of embeddings produced by neural language models using different methods.
    • evaluate_garden_paths: Allows using custom metrics (surprisal, gaze metrics prediction) to estimate the presence of atypical construction over SyntaxGym test suites.
    • finetune_sentence_level: Train NLMs on sentence-level regression or classification tasks in single or multi-task settings.
    • finetune_token_regression: Train NLMs on token-level regression in single or multi-task settings.
    • get_surprisals: Compute surprisal scores produced by NLMs for sentences.
    • preprocess: Performs initial preprocessing and train/test splitting.

Getting Started

Preprocessing

# Generate sentence-level dataset for eyetracking python scripts/preprocess.py \ --all \ --do_features \ --eyetracking_mode sentence \ --do_train_test_split

⚠️ TODO: Examples for all experiments ⚠️

Contacts

If you have any questions, feel free to contact me through email (gabriele.sarti996@gmail.com) or raise a Github issue in the repository!