Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
624eab6
README minor fixes
eladn Mar 13, 2019
79f5891
train.sh - add #! hash first line
eladn Mar 13, 2019
04efcc5
gitignore: ignore models, data, .idea, and tar.gz
eladn Mar 13, 2019
beb2eda
new: Keras AttentionLayer
eladn Mar 13, 2019
36c69a7
Config / add params: DL_FRAMEWORK & DROPOUT_KEEP_RATE
eladn Mar 14, 2019
b79127d
common::split_to_batches() / use iterator instead of creating a list
eladn Mar 14, 2019
254ee5e
Keras AttentionLayer / minor comments modification
eladn Mar 14, 2019
7ebf1ce
add Keras model impl (not fully implemented yet); dispatch tf/keras i…
eladn Mar 14, 2019
bee8ccb
keras model: use `tf.data` reader [now training works]
eladn Mar 21, 2019
e41730c
export Config to config.py; common.SpecialDictWords(Enum);
eladn Mar 22, 2019
429258e
add common.tf_get_first_true()
eladn Mar 22, 2019
69dc76a
model_base: use common.SpecialDictWords
eladn Mar 22, 2019
9bf16e8
keras model: add prediction tf graph, add evaluation f1 metric; pass …
eladn Mar 22, 2019
290f254
export word prediction calculation from model into WordPredictionLayer
eladn Mar 22, 2019
79fdfe3
impl keras Words Subtoken Metrics (Precision, Recall, F1)
eladn Mar 22, 2019
b832b4e
export the `topk` param into config.TOP_K_WORDS_CONSIDERED_DURING_PRE…
eladn Mar 22, 2019
fd48203
keras model: use 'target_word_prediction' layer output as an addition…
eladn Mar 23, 2019
4a99e60
minor refactor: name for metrics
eladn Mar 23, 2019
6014a33
keras model: impl save+load, use val reader for train, use checkpoint…
eladn Mar 23, 2019
992622a
keras model: minor refactor
eladn Mar 23, 2019
d976f41
keras model: add code_vectors to output, impl evaluate()+predict()
eladn Mar 23, 2019
e80cc9b
base model redactor; OOV+PAD special words instead of NoSuch; impl sa…
eladn Mar 24, 2019
4ca69ca
fix store+load model (to use RELEASE param correctly); fix transform …
eladn Mar 25, 2019
8ea4851
SpecialDictWords: each word has its string representation and its index
eladn Mar 26, 2019
8db1d7d
subtoken metrics: fix subtoken separator
eladn Mar 26, 2019
d191727
impl Vocab class; refactor SpecialVocabWords
eladn Mar 26, 2019
ffd416a
reader: minor refactor + fix csv_record_defaults
eladn Mar 27, 2019
5122495
export vocabs management into `vocabularies.py`; new Code2VecVocabs c…
eladn Mar 28, 2019
c97d6c3
keras model store+load: use tf checkpoint to store optimizer status; …
eladn Apr 2, 2019
98a0b1b
move VocabType to vocabularies; move `save_word2vec_format()` to base…
eladn Apr 3, 2019
ec93a81
reader: add option `repeat_endlessly`; use in keras model.
eladn Apr 3, 2019
48965d3
tensorflow model: adapt to the new model base and the new reader; NOT…
eladn Apr 3, 2019
c062e64
add `framework` option to argparse; use unified `ModelEvaluationResul…
eladn Apr 4, 2019
058443c
tensorflow model: make it work correctly with the new reader and new …
eladn Apr 4, 2019
3c31a73
add log.txt to .gitignore
eladn Apr 4, 2019
91ff0d6
reader: support input lines + support predict; keras model: fix predi…
eladn Apr 4, 2019
89fcfe5
keras model: don't use `WordPredictionLayer` anymore, fix subtoken wo…
eladn Apr 5, 2019
b350596
make tensorflow the default impl; fix help descriptions in argparser.
eladn Apr 5, 2019
225d6f2
tensorflow model: `TopKAccuracyEvaluationMetric` refactor: use numpy …
eladn Apr 6, 2019
b682143
keras model: pass target word string in input pipeline for estimation
eladn Apr 6, 2019
d806f81
tensorflow model minor refactor
eladn Apr 15, 2019
5ab7672
update README: paper-version, dependencies, configuration params
eladn May 26, 2019
87b5b1a
Merge remote-tracking branch 'upstream/master'
eladn May 26, 2019
2acd3ba
update README: mention keras impl; fix links
eladn May 26, 2019
1db70f1
update README: add section in Features about choosing impl
eladn May 26, 2019
5d0ff2f
train.sh: remove accidentally added line before EOF
eladn May 26, 2019
143d2a6
config: minor refactor to params names & order
eladn May 26, 2019
c09e588
old reader: remove impl & remove its old config params
eladn May 26, 2019
1e368c1
keras WordsSubtokenMetric: fix duplicates preserve
eladn May 26, 2019
e105e66
add logger; use tensorboard; refactor config; tf: fix subtoken metric
eladn May 27, 2019
de5305c
keras words subtokens metric: fix
eladn May 27, 2019
915f3a3
config: rename prop; keras model: add logging on load model
eladn May 27, 2019
4545b02
config: rename param; add method to iter all params
eladn May 27, 2019
3a5f77b
logging format; log creating model; log config
eladn May 27, 2019
97e752f
keras model: refactor ckpt saver cb; refactor eval cb; cancel re-comp…
eladn May 27, 2019
64f30d4
main file minor refactor
eladn Jun 2, 2019
81d22bd
vocabs: add tiny doc line
eladn Jun 2, 2019
27b7e22
keras model: tiny fix
eladn Jun 2, 2019
365bc36
config: change default values
eladn Jun 2, 2019
46ae7ad
migrate to TF-2.0.0-alpha; now keras model works with BS=1024 + TP=12…
eladn Jun 2, 2019
5891587
config: refactor: add annotations + optional strings + `is_saving` + …
eladn Jun 3, 2019
df3b0ca
reader: fix `process_input_row()` to adapt TF2
eladn Jun 3, 2019
cbe8fa6
vocabs: fix minor issue (field initialization position)
eladn Jun 3, 2019
7da5dc4
keras modeL: save only when needed; adapt save+load+predict to TF2
eladn Jun 3, 2019
2547420
docs: add some docs to classes
eladn Jun 3, 2019
b79789e
keras model: refactor: add `_create_train_callbacks()`; add docs
eladn Jun 3, 2019
69b41b6
logger fix: don't print twice (turn off propagate)
eladn Jun 3, 2019
6edde7a
model base tiny refactor
eladn Jun 3, 2019
8242e6e
keras model save fix (to also save vocabs)
eladn Jun 3, 2019
49bbd64
tensorflow model: print # trainable params
eladn Jun 27, 2019
0aa498e
add requirements.txt
eladn Jul 15, 2019
f710132
README: add keras impl title, update requirements
eladn Jul 15, 2019
38ffab5
model base: create model save dir if not exists
eladn Jul 15, 2019
d43d0d0
update version: TF2.0.0-alpha ==> TF2.0.0-beta1 (use tf.compat.v1.str…
eladn Jul 15, 2019
1bc14d4
tensorflow model: make it work in TF2.0.0-beta1 (train+eval+predict+s…
eladn Jul 15, 2019
cb7df2b
add & support option `config.SEPARATE_OOV_AND_PAD` and `vocab.special…
eladn Jul 15, 2019
63a8019
README: minor bold-updates order fix
eladn Jul 15, 2019
41e13ec
vocabs: fix save&load to match old format (without special words); no…
eladn Jul 16, 2019
dfba714
config: rename param "NUM_BATCHES_TO_LOG_PROGRESS"
eladn Jul 16, 2019
cf716b5
vocabs: refactor error msg on failed load due to wrong min word idx
eladn Jul 16, 2019
b68f0e9
tf model: use logger instead of print()s
eladn Jul 16, 2019
cbc6487
vocabs: fix save_to_file() to adapt to old format
eladn Jul 16, 2019
86d5b43
keras & tf models: refactor - move aux classes below main model class
eladn Jul 16, 2019
419c7eb
config: make logging go to stdout instead of stderr
eladn Jul 17, 2019
1c92dab
readme: remove paper-version note. We just added new functionalities.…
eladn Jul 17, 2019
64bebf7
Merge remote-tracking branch 'upstream/master'
eladn Jul 17, 2019
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
*.class
*.lst
**/models/**
**/data/**
**/.idea/**
*.tar.gz
**/log.txt
156 changes: 0 additions & 156 deletions PathContextReader.py

This file was deleted.

84 changes: 60 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,20 @@ This is an official implementation of the model described in:
[Uri Alon](http://urialon.cswp.cs.technion.ac.il), [Meital Zilberstein](http://www.cs.technion.ac.il/~mbs/), [Omer Levy](https://levyomer.wordpress.com) and [Eran Yahav](http://www.cs.technion.ac.il/~yahave/),
"code2vec: Learning Distributed Representations of Code", POPL'2019 [[PDF]](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf)

_**October 2018** - the paper was accepted to [POPL'2019](https://popl19.sigplan.org)_!
_**October 2018** - The paper was accepted to [POPL'2019](https://popl19.sigplan.org)_!

_**April 2019** - The talk video is available [here](https://www.youtube.com/watch?v=EJ8okcxL2Iw)_.

_**July 2019** - Add `tf.keras` model implementation (see [here](#choosing-implementation-to-use))._

An **online demo** is available at [https://code2vec.org/](https://code2vec.org/).

This is a TensorFlow implementation, designed to be easy and useful in research,
and for experimenting with new ideas in machine learning for code tasks.
By default, it learns Java source code and predicts Java method names, but it can be easily extended to other languages,
since the TensorFlow network is agnostic to the input programming language (see [Extending to other languages](#extending-to-other-languages).
Contributions are welcome.
This repo actually contains two model implementations. The 1st uses pure TensorFlow and the 2nd uses TensorFlow's Keras.

<center style="padding: 40px"><img width="70%" src="https://github.com/tech-srl/code2vec/raw/master/images/network.png" /></center>

Expand All @@ -33,13 +36,18 @@ Table of Contents
On Ubuntu:
* [Python3](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04). To check if you have it:
> python3 --version
* TensorFlow - version 1.5 or newer ([install](https://www.tensorflow.org/install/install_linux)). To check TensorFlow version:
* TensorFlow - version 2.0.0-beta1 ([install](https://www.tensorflow.org/install/install_linux)).
To check TensorFlow version:
> python3 -c 'import tensorflow as tf; print(tf.\_\_version\_\_)'
* If you are using a GPU, you will need CUDA 9.0 ([download](https://developer.nvidia.com/cuda-90-download-archive))
* If you are using a GPU, you will need CUDA 10.0
([download](https://developer.nvidia.com/cuda-10.0-download-archive-base))
as this is the version that is currently supported by TensorFlow. To check CUDA version:
> nvcc --version
* For GPU: cuDNN (>=7.0) ([download](http://developer.nvidia.com/cudnn))
* For [creating a new dataset](#creating-and-preprocessing-a-new-java-dataset) or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model) (any operation that requires parsing of a new code example) - [Java JDK](https://openjdk.java.net/install/)
* For GPU: cuDNN (>=7.5) ([download](http://developer.nvidia.com/cudnn)) To check cuDNN version:
> cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
* For [creating a new dataset](#creating-and-preprocessing-a-new-java-dataset)
or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model)
(any operation that requires parsing of a new code example) - [Java JDK](https://openjdk.java.net/install/)

## Quickstart
### Step 0: Cloning this repository
Expand Down Expand Up @@ -124,46 +132,74 @@ To manually examine a trained model, run:
```
python3 code2vec.py --load models/java14_model/saved_model_iter8 --predict
```
After the model loads, follow the instructions and edit the file Input.java and enter a Java
After the model loads, follow the instructions and edit the file [Input.java](Input.java) and enter a Java
method or code snippet, and examine the model's predictions and attention scores.

## Configuration
Changing hyper-parameters is possible by editing the file [common.py](common
.py).
Changing hyper-parameters is possible by editing the file
[common.py](common.py).

Here are some of the parameters and their description:
#### config.NUM_EPOCHS = 20
#### config.NUM_TRAIN_EPOCHS = 20
The max number of epochs to train the model. Stopping earlier must be done manually (kill).
#### config.SAVE_EVERY_EPOCHS = 1
After how many training iterations a model should be saved.
#### config.BATCH_SIZE = 1024
#### config.TRAIN_BATCH_SIZE = 1024
Batch size in training.
#### config.TEST_BATCH_SIZE = config.BATCH_SIZE
#### config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE
Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
#### config.READING_BATCH_SIZE = 1300 * 4
The batch size of reading text lines to the queue that feeds examples to the network during training.
#### config.NUM_BATCHING_THREADS = 2
The number of threads enqueuing examples.
#### config.BATCH_QUEUE_SIZE = 300000
Max number of elements in the feeding queue.
#### config.DATA_NUM_CONTEXTS = 200
The number of contexts in a single example, as was created in preprocessing.
#### config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10
Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.
#### config.NUM_BATCHES_TO_LOG_PROGRESS = 100
Number of batches (during training / evaluating) to complete between two progress-logging records.
#### config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100
Number of training batches to complete between model evaluations on the test set.
#### config.READER_NUM_PARALLEL_BATCHES = 4
The number of threads enqueuing examples to the reader queue.
#### config.SHUFFLE_BUFFER_SIZE = 10000
Size of buffer in reader to shuffle example within during training.
Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.
#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024 # 100 MB
The buffer size (in bytes) of the CSV dataset reader.

#### config.MAX_CONTEXTS = 200
The number of contexts to use in each example.
#### config.WORDS_VOCAB_SIZE = 1301136
#### config.MAX_TOKEN_VOCAB_SIZE = 1301136
The max size of the token vocabulary.
#### config.TARGET_VOCAB_SIZE = 261245
#### config.MAX_TARGET_VOCAB_SIZE = 261245
The max size of the target words vocabulary.
#### config.PATHS_VOCAB_SIZE = 911417
#### config.MAX_PATH_VOCAB_SIZE = 911417
The max size of the path vocabulary.
#### config.EMBEDDINGS_SIZE = 128
Embedding size for tokens and paths.
#### config.DEFAULT_EMBEDDINGS_SIZE = 128
Default embedding size to be used for token and path if not specified otherwise.
#### config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for tokens.
#### config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for paths.
#### config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE
Size of code vectors.
#### config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE
Embedding size for target words.
#### config.MAX_TO_KEEP = 10
Keep this number of newest trained versions during training.
#### config.DROPOUT_KEEP_RATE = 0.75
Dropout rate used during training.
#### config.SEPARATE_OOV_AND_PAD = False
Whether to treat `<OOV>` and `<PAD>` as two different special tokens whenever possible.

## Features
Code2vec supports the following features:

### Choosing implementation to use
This repo comes with two model implementations:
(i) uses pure TensorFlow (written in [tensorflow_model.py](tensorflow_model.py));
(ii) uses TensorFlow's Keras (written in [keras_model.py](keras_model.py)).
The default implementation used by `code2vec.py` is the pure TensorFlow.
To explicitly choose the desired implementation to use, specify `--framework tensorflow` or `--framework keras`
as an additional argument when executing the script `code2vec.py`.
Particularly, this argument can be added to each one of the usage examples (of `code2vec.py`) detailed in this file.
Note that in order to load a trained model (from file), one should use the same implementation used during its training.

### Releasing the model
If you wish to keep a trained model for inference only (without the ability to continue training it) you can
release the model using:
Expand Down
72 changes: 27 additions & 45 deletions code2vec.py
Original file line number Diff line number Diff line change
@@ -1,56 +1,38 @@
from common import Config, VocabType
from argparse import ArgumentParser
from vocabularies import VocabType
from config import Config
from interactive_predict import InteractivePredictor
from model import Model
import sys
from model_base import Code2VecModelBase

if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument("-d", "--data", dest="data_path",
help="path to preprocessed dataset", required=False)
parser.add_argument("-te", "--test", dest="test_path",
help="path to test file", metavar="FILE", required=False)

is_training = '--train' in sys.argv or '-tr' in sys.argv
parser.add_argument("-s", "--save", dest="save_path",
help="path to save file", metavar="FILE", required=False)
parser.add_argument("-w2v", "--save_word2v", dest="save_w2v",
help="path to save file", metavar="FILE", required=False)
parser.add_argument("-t2v", "--save_target2v", dest="save_t2v",
help="path to save file", metavar="FILE", required=False)
parser.add_argument("-l", "--load", dest="load_path",
help="path to save file", metavar="FILE", required=False)
parser.add_argument('--save_w2v', dest='save_w2v', required=False,
help="save word (token) vectors in word2vec format")
parser.add_argument('--save_t2v', dest='save_t2v', required=False,
help="save target vectors in word2vec format")
parser.add_argument('--export_code_vectors', action='store_true', required=False,
help="export code vectors for the given examples")
parser.add_argument('--release', action='store_true',
help='if specified and loading a trained model, release the loaded model for a lower model '
'size.')
parser.add_argument('--predict', action='store_true')
args = parser.parse_args()
def load_model_dynamically(config: Config) -> Code2VecModelBase:
assert config.DL_FRAMEWORK in {'tensorflow', 'keras'}
if config.DL_FRAMEWORK == 'tensorflow':
from tensorflow_model import Code2VecModel
elif config.DL_FRAMEWORK == 'keras':
from keras_model import Code2VecModel
return Code2VecModel(config)


if __name__ == '__main__':
config = Config(set_defaults=True, load_from_args=True, verify=True)

config = Config.get_default_config(args)
model = load_model_dynamically(config)
config.log('Done creating code2vec model')

model = Model(config)
print('Created model')
if config.TRAIN_PATH:
if config.is_training:
model.train()
if args.save_w2v is not None:
model.save_word2vec_format(args.save_w2v, source=VocabType.Token)
print('Origin word vectors saved in word2vec text format in: %s' % args.save_w2v)
if args.save_t2v is not None:
model.save_word2vec_format(args.save_t2v, source=VocabType.Target)
print('Target word vectors saved in word2vec text format in: %s' % args.save_t2v)
if config.TEST_PATH and not args.data_path:
if config.SAVE_W2V is not None:
model.save_word2vec_format(config.SAVE_W2V, VocabType.Token)
config.log('Origin word vectors saved in word2vec text format in: %s' % config.SAVE_W2V)
if config.SAVE_T2V is not None:
model.save_word2vec_format(config.SAVE_T2V, VocabType.Target)
config.log('Target word vectors saved in word2vec text format in: %s' % config.SAVE_T2V)
if config.is_testing and not config.is_training:
eval_results = model.evaluate()
if eval_results is not None:
results, precision, recall, f1 = eval_results
print(results)
print('Precision: ' + str(precision) + ', recall: ' + str(recall) + ', F1: ' + str(f1))
if args.predict:
config.log(
str(eval_results).replace('topk', 'top{}'.format(config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION)))
if config.PREDICT:
predictor = InteractivePredictor(config, model)
predictor.predict()
model.close_session()
Loading