This repository contains data and code implementations for reproducing all the experiments for:
Interpreting Neural Language Models for Linguistic Complexity Assessment, Gabriele Sarti, Data Science and Scientific Computing MSc Thesis, University of Trieste, 2020 [Gitbook] [Slides (Long)] [Slides (Short)]
UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations, Gabriele Sarti, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, [ArXiv] CEUR Video
That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models, Gabriele Sarti and Dominique Brunato and Felice Dell'Orletta, Proceeding of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021 [ACL Anthology]
If you find these resource useful for your research, please consider citing one or more following works:
@mastersthesis{sarti-2020-interpreting, author = {Sarti, Gabriele}, institution = {University of Trieste}, school = {University of Trieste}, title = {Interpreting Neural Language Models for Linguistic Complexity Assessment}, year = 2020 } @inproceedings{sarti-2020-umbertomtsa, author = {Sarti, Gabriele}, title = {{UmBERTo-MTSA @ AcCompl-It}: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations}, booktitle = {Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020)}, editor = {Basile, Valerio and Croce, Danilo and Di Maro, Maria, and Passaro, Lucia C.}, publisher = {CEUR.org}, year = {2020}, address = {Online} } @inproceedings{sarti-etal-2021-looks, title = "That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models", author = "Sarti, Gabriele and Brunato, Dominique and Dell'Orletta, Felice", booktitle = "Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics", month = jun, year = "2021", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "TBD", doi = "TBD", pages = "TBD", }Prerequisites
-
Python >= 3.6 is required to run the scripts provided in this repository. Torch should be installed using the wheels available on the Pytorch website that are compatible with your CUDA version.
-
For CUDA 10 and Python 3.6, we used the wheel torch-1.3.0-cp36-cp36m-linux_x86_64.whl.
-
Python >= 3.7 is required to run SyntaxGym-related scripts.
Main dependencies
torch == 1.6.0farm == 0.5.0transformers == 3.3.1syntaxgym
Setup procedure
python3 -m venv env source env/bin/activate pip install --upgrade pip ./scripts/setup.shRun scripts/setup.sh from the main project folder. This will install dependencies, download data and create the repository structure. If you want to download ZuCo MAT files (30GB), edit setup.sh setting DOWNLOAD_ZUCO_MAT_FILES=false.
You need to manually download the original perceived complexity dataset presented in Brunato et al. 2018 from the ItaliaNLP Lab website and place it in the data/complexity folder.
The AcCompl-IT campaign data and the Dundee corpus cannot be redistributed due to copyright restrictions.
After all datasets are in the respective folders, run python script/preprocess.py --all from the main project folder to preprocess the datasets. Refer to the Getting Started section for further steps.
Repository structure
-
datacontains the subfolders for all data used throughout the study:complexity: the Perceived Complexity corpus by Brunato et al. 2018.eyetracking: Eye-tracking corpora (Dundee, GECO, ZuCo 1 & 2).eval: SST dataset used for representational similarity evaluation.garden_paths: three test suites taken from the SyntaxGym benchmark.readability: OneStopEnglish corpus paragraphs by reading level.preprocessed: The preprocessed versions of each corpus produced byscripts/preprocess.py.
-
src/lingcompis the library built behind this work, composed by:data_utils: Eye-tracking processors and utils.farm: Custom extension of the FARM library to add token-level regression, better multitask learning for NLMs and the GPT-2 model.similarity: Methods used for representational similarity evaluation.syntaxgym: Methods used to perform evaluation over SyntaxGym test suites.
-
scripts: Used to carry out the analysis and modeling experiment:shortcuts: in development, scripts calling other scripts multiple times to provide a quick interface.analyze_linguistic_features: Produces a report containing correlations across various complexity metrics and linguistic features.compute_sentence_baselines: Computes sentence-level avg., binned avg. and SVM baselines for complexity scores using cross-validation.compute_similarity: Evaluates the representational similarity of embeddings produced by neural language models using different methods.evaluate_garden_paths: Allows using custom metrics (surprisal, gaze metrics prediction) to estimate the presence of atypical construction over SyntaxGym test suites.finetune_sentence_level: Train NLMs on sentence-level regression or classification tasks in single or multi-task settings.finetune_token_regression: Train NLMs on token-level regression in single or multi-task settings.get_surprisals: Compute surprisal scores produced by NLMs for sentences.preprocess: Performs initial preprocessing and train/test splitting.
Preprocessing
# Generate sentence-level dataset for eyetracking python scripts/preprocess.py \ --all \ --do_features \ --eyetracking_mode sentence \ --do_train_test_splitIf you have any questions, feel free to contact me through email (gabriele.sarti996@gmail.com) or raise a Github issue in the repository!