Skip to content

Commit 295e058

Browse files
authored
Update README.md
1 parent 412857f commit 295e058

File tree

1 file changed

+38
-26
lines changed

1 file changed

+38
-26
lines changed

README.md

Lines changed: 38 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,57 @@
11
# EDA-NLP
2-
This is the code for [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196)
2+
This is the code for the paper [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196)
33

4-
We present the following data augmentation techniques:
4+
We present **EDA**: **e**asy **d**ata **a**ugmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements over 5 nlp classification tasks, which larger improvements on datasets of size *N<500*. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in substantial performance gains. Given a sentence of *l* ordered words *[w_1, w_2, ..., w_l]*, we perform the following operations:
55

6-
Given a sentence consisting of *l* ordered words *[w_1, w_2, ..., w_l]*, we perform the following operations:
7-
Synonym Replacement (SR): Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
8-
Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times.
9-
Random Swap (RS): Randomly choose two words in the sentence and swap their positions. Do this *n* times.
10-
Random Deletion (RD): For each word in the sentence, randomly remove it with probability *p*.
6+
- **Synonym Replacement (SR):** Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
7+
- **Random Insertion (RI):** Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times.
8+
- **Random Swap (RS):** Randomly choose two words in the sentence and swap their positions. Do this *n* times.
9+
- **Random Deletion (RD):** For each word in the sentence, randomly remove it with probability *p*.
1110

12-
## Usage
11+
# Usage
1312

14-
### Data format
15-
First place the training file in the format `label\tsentence` in `datasets/dataset/train_orig.txt` and the testing file in the same format in `datasets/dataset/test.txt`.
13+
You can run EDA any text classification dataset in less than 5 minutes. Just two steps:
1614

17-
### Word embeddings
18-
Download [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) and place in a folder named `word2vec`.
15+
## Install NLTK (if you don't have it already):
16+
17+
Pip install it.
18+
19+
```
20+
pip install -U nltk
21+
```
1922

20-
### Augment the data and load the word2vec dictionary
23+
Download WordNet.
2124
```
22-
python code/1_data_process.py
25+
python
26+
>>> import nltk; nltk.download('wordnet')
2327
```
2428

25-
#Experiments
29+
## Run EDA
2630

27-
Dependencies: tensorflow, keras, sklearn
31+
You can easily write your own implementation, but this one takes input files in the format `label\tsentence` (note the `\t`). So for instance, your input file should look like this (example from stanford sentiment treebank):
2832

2933
```
30-
pip install tensorflow-gpu
31-
pip install keras
32-
pip install sklearn
33-
pip install -U nltk
34+
1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present
35+
0 it is a visual rorschach test and i must have failed
36+
0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers
37+
...
3438
```
3539

40+
Now place this input file into the `data` folder. Run
3641

37-
### Train the model and evaluate it
3842
```
39-
python code/2_train_eval.py
43+
python code/augment.py --input=<insert input filename>
4044
```
4145

42-
## EDA-7?:
43-
5. Sliding window: slide a window of size *w* with stride *s* over the text input
44-
6. Jittering: add *c* to *n* random word embeddings or take *e^c* for *n* random word embeddings
45-
7. Random noising: add Gaussian noise to *n* random word embeddings
46+
The default output filename will append `eda_` to the input filename, but you can specify your own with `--output`. You can also specify the number of generated augmented sentences per original sentence using `--num_aug` (default is 9). So for example, if your input file is `sst2_train.txt` and you want to output to `sst2_augmented.txt` with 16 augmented sentences per original sentence, you would do:
47+
48+
```
49+
python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16
50+
```
51+
52+
Best of luck!
53+
54+
# Experiments (Coming soon)
55+
56+
### Word embeddings
57+
Download [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) and place in a folder named `word2vec`.

0 commit comments

Comments
 (0)