Skip to content

Commit 08c3bba

Browse files
authored
Update README
1 parent 6cd58d3 commit 08c3bba

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ This trained model is in a "released" state, which means that we stripped it fro
8383
To train a model from scratch:
8484
* Edit the file [train.sh](train.sh) to point it to the right preprocessed data. By default,
8585
it points to our "java14m" dataset that was preprocessed in the previous step.
86-
* Before training, you can edit the configuration hyper-parameters in the file [config.py](config.py),
86+
* Before training, you can edit the configuration hyper-parameters in the file [common.py](common.py),
8787
as explained in [Configuration](#configuration).
8888
* Run the [train.sh](train.sh) script:
8989
```
@@ -94,7 +94,7 @@ source train.sh
9494
1. By default, the network is evaluated on the validation set after every training epoch.
9595
2. The newest 10 versions are kept (older are deleted automatically). This can be changed, but will be more space consuming.
9696
3. By default, the network is training for 20 epochs.
97-
These settings can be changed by simply editing the file [config.py](config.py).
97+
These settings can be changed by simply editing the file [common.py](common.py).
9898
Training on a Tesla v100 GPU takes about 50 minutes per epoch.
9999
Training on Tesla K80 takes about 4 hours per epoch.
100100

@@ -116,7 +116,8 @@ After the model loads, follow the instructions and edit the file Input.java and
116116
method or code snippet, and examine the model's predictions and attention scores.
117117

118118
## Configuration
119-
Changing hyper-parameters is possible by editing the file [config.py](config.py).
119+
Changing hyper-parameters is possible by editing the file [common.py](common
120+
.py).
120121

121122
Here are some of the parameters and their description:
122123
#### config.NUM_EPOCHS = 20
@@ -184,10 +185,11 @@ python3
184185
>>> from gensim.models import KeyedVectors as word2vec
185186
>>> vectors_text_path = 'models/java14m/targets.txt' # or: `models/java14m/tokens.txt'
186187
>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
187-
>>> model.most_similar(positive=['equals', 'to|lower'])
188+
>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings
189+
>>> model.most_similar(positive=['download', 'send'], negative=['receive'])
188190
```
189191
The above python commands will result in the closest name to both "equals" and "to|lower", which is "equals|ignore|case".
190-
Note: the input token and target words are saved using the symbol "|" as a subtokens delimiter ("*toLower*" is saved as: "*to|lower*").
192+
Note: In embeddings that were exported manually using the "--save_w2v" or "--save_t2v" flags, the input token and target words are saved using the symbol "|" as a subtokens delimiter ("*toLower*" is saved as: "*to|lower*"). In the embeddings that are available to download (which are the same as in the paper), the "|" symbol is not used, thus "*toLower*" is saved as "*tolower*".
191193

192194
## Extending to other languages
193195
In order to extend code2vec to work with other languages other than Java, a new extractor (similar to the [JavaExtractor](JavaExtractor))

0 commit comments

Comments
 (0)