|
1 | 1 | # EDA-NLP |
2 | | -This is the code for [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196) |
| 2 | +This is the code for the paper [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196) |
3 | 3 |
|
4 | | -We present the following data augmentation techniques: |
| 4 | +We present **EDA**: **e**asy **d**ata **a**ugmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements over 5 nlp classification tasks, which larger improvements on datasets of size *N<500*. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in substantial performance gains. Given a sentence of *l* ordered words *[w_1, w_2, ..., w_l]*, we perform the following operations: |
5 | 5 |
|
6 | | -Given a sentence consisting of *l* ordered words *[w_1, w_2, ..., w_l]*, we perform the following operations: |
7 | | -Synonym Replacement (SR): Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random. |
8 | | -Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times. |
9 | | -Random Swap (RS): Randomly choose two words in the sentence and swap their positions. Do this *n* times. |
10 | | -Random Deletion (RD): For each word in the sentence, randomly remove it with probability *p*. |
| 6 | +- **Synonym Replacement (SR):** Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random. |
| 7 | +- **Random Insertion (RI):** Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times. |
| 8 | +- **Random Swap (RS):** Randomly choose two words in the sentence and swap their positions. Do this *n* times. |
| 9 | +- **Random Deletion (RD):** For each word in the sentence, randomly remove it with probability *p*. |
11 | 10 |
|
12 | | -## Usage |
| 11 | +# Usage |
13 | 12 |
|
14 | | -### Data format |
15 | | -First place the training file in the format `label\tsentence` in `datasets/dataset/train_orig.txt` and the testing file in the same format in `datasets/dataset/test.txt`. |
| 13 | +You can run EDA any text classification dataset in less than 5 minutes. Just two steps: |
16 | 14 |
|
17 | | -### Word embeddings |
18 | | -Download [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) and place in a folder named `word2vec`. |
| 15 | +## Install NLTK (if you don't have it already): |
| 16 | + |
| 17 | +Pip install it. |
| 18 | + |
| 19 | +``` |
| 20 | +pip install -U nltk |
| 21 | +``` |
19 | 22 |
|
20 | | -### Augment the data and load the word2vec dictionary |
| 23 | +Download WordNet. |
21 | 24 | ``` |
22 | | -python code/1_data_process.py |
| 25 | +python |
| 26 | +>>> import nltk; nltk.download('wordnet') |
23 | 27 | ``` |
24 | 28 |
|
25 | | -#Experiments |
| 29 | +## Run EDA |
26 | 30 |
|
27 | | -Dependencies: tensorflow, keras, sklearn |
| 31 | +You can easily write your own implementation, but this one takes input files in the format `label\tsentence` (note the `\t`). So for instance, your input file should look like this (example from stanford sentiment treebank): |
28 | 32 |
|
29 | 33 | ``` |
30 | | -pip install tensorflow-gpu |
31 | | -pip install keras |
32 | | -pip install sklearn |
33 | | -pip install -U nltk |
| 34 | +1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present |
| 35 | +0 it is a visual rorschach test and i must have failed |
| 36 | +0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers |
| 37 | +... |
34 | 38 | ``` |
35 | 39 |
|
| 40 | +Now place this input file into the `data` folder. Run |
36 | 41 |
|
37 | | -### Train the model and evaluate it |
38 | 42 | ``` |
39 | | -python code/2_train_eval.py |
| 43 | +python code/augment.py --input=<insert input filename> |
40 | 44 | ``` |
41 | 45 |
|
42 | | -## EDA-7?: |
43 | | -5. Sliding window: slide a window of size *w* with stride *s* over the text input |
44 | | -6. Jittering: add *c* to *n* random word embeddings or take *e^c* for *n* random word embeddings |
45 | | -7. Random noising: add Gaussian noise to *n* random word embeddings |
| 46 | +The default output filename will append `eda_` to the input filename, but you can specify your own with `--output`. You can also specify the number of generated augmented sentences per original sentence using `--num_aug` (default is 9). So for example, if your input file is `sst2_train.txt` and you want to output to `sst2_augmented.txt` with 16 augmented sentences per original sentence, you would do: |
| 47 | + |
| 48 | +``` |
| 49 | +python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16 |
| 50 | +``` |
| 51 | + |
| 52 | +Best of luck! |
| 53 | + |
| 54 | +# Experiments (Coming soon) |
| 55 | + |
| 56 | +### Word embeddings |
| 57 | +Download [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) and place in a folder named `word2vec`. |
0 commit comments