tech-srl · urialon · Jul 17, 2019 · Mar 13, 2019 · Mar 13, 2019 · Mar 13, 2019
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,7 @@
 *.class
 *.lst
+**/models/**
+**/data/**
+**/.idea/**
+*.tar.gz
+**/log.txt
diff --git a/PathContextReader.py b/PathContextReader.py
diff --git a/README.md b/README.md
@@ -5,17 +5,20 @@ This is an official implementation of the model described in:
 [Uri Alon](http://urialon.cswp.cs.technion.ac.il), [Meital Zilberstein](http://www.cs.technion.ac.il/~mbs/), [Omer Levy](https://levyomer.wordpress.com) and [Eran Yahav](http://www.cs.technion.ac.il/~yahave/),
 "code2vec: Learning Distributed Representations of Code", POPL'2019 [[PDF]](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf)
 
-_**October 2018** - the paper was accepted to [POPL'2019](https://popl19.sigplan.org)_!
+_**October 2018** - The paper was accepted to [POPL'2019](https://popl19.sigplan.org)_!
 
 _**April 2019** - The talk video is available [here](https://www.youtube.com/watch?v=EJ8okcxL2Iw)_.
 
+_**July 2019** - Add `tf.keras` model implementation (see [here](#choosing-implementation-to-use))._
+
 An **online demo** is available at [https://code2vec.org/](https://code2vec.org/).
 
 This is a TensorFlow implementation, designed to be easy and useful in research, 
 and for experimenting with new ideas in machine learning for code tasks.
 By default, it learns Java source code and predicts Java method names, but it can be easily extended to other languages, 
 since the TensorFlow network is agnostic to the input programming language (see [Extending to other languages](#extending-to-other-languages).
 Contributions are welcome.
+This repo actually contains two model implementations. The 1st uses pure TensorFlow and the 2nd uses TensorFlow's Keras. 
 
 <center style="padding: 40px"><img width="70%" src="https://github.com/tech-srl/code2vec/raw/master/images/network.png" /></center>
 
@@ -33,13 +36,18 @@ Table of Contents
 On Ubuntu:
  * [Python3](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04). To check if you have it:
 > python3 --version
- * TensorFlow - version 1.5 or newer ([install](https://www.tensorflow.org/install/install_linux)). To check TensorFlow version:
+ * TensorFlow - version 2.0.0-beta1 ([install](https://www.tensorflow.org/install/install_linux)).
+ To check TensorFlow version:
 > python3 -c 'import tensorflow as tf; print(tf.\_\_version\_\_)'
- * If you are using a GPU, you will need CUDA 9.0 ([download](https://developer.nvidia.com/cuda-90-download-archive)) 
+ * If you are using a GPU, you will need CUDA 10.0
+ ([download](https://developer.nvidia.com/cuda-10.0-download-archive-base)) 
  as this is the version that is currently supported by TensorFlow. To check CUDA version:
 > nvcc --version
- * For GPU: cuDNN (>=7.0) ([download](http://developer.nvidia.com/cudnn))
- * For [creating a new dataset](#creating-and-preprocessing-a-new-java-dataset) or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model) (any operation that requires parsing of a new code example) - [Java JDK](https://openjdk.java.net/install/)
+ * For GPU: cuDNN (>=7.5) ([download](http://developer.nvidia.com/cudnn)) To check cuDNN version:
+> cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
+ * For [creating a new dataset](#creating-and-preprocessing-a-new-java-dataset)
+ or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model)
+ (any operation that requires parsing of a new code example) - [Java JDK](https://openjdk.java.net/install/)
 
 ## Quickstart
 ### Step 0: Cloning this repository
@@ -124,46 +132,74 @@ To manually examine a trained model, run:
 ```
 python3 code2vec.py --load models/java14_model/saved_model_iter8 --predict
 ```
-After the model loads, follow the instructions and edit the file Input.java and enter a Java 
+After the model loads, follow the instructions and edit the file [Input.java](Input.java) and enter a Java 
 method or code snippet, and examine the model's predictions and attention scores.
 
 ## Configuration
-Changing hyper-parameters is possible by editing the file [common.py](common
-.py).
+Changing hyper-parameters is possible by editing the file
+[common.py](common.py).
 
 Here are some of the parameters and their description:
-#### config.NUM_EPOCHS = 20
+#### config.NUM_TRAIN_EPOCHS = 20
 The max number of epochs to train the model. Stopping earlier must be done manually (kill).
 #### config.SAVE_EVERY_EPOCHS = 1
 After how many training iterations a model should be saved.
-#### config.BATCH_SIZE = 1024 
+#### config.TRAIN_BATCH_SIZE = 1024 
 Batch size in training.
-#### config.TEST_BATCH_SIZE = config.BATCH_SIZE
+#### config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE
 Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
-#### config.READING_BATCH_SIZE = 1300 * 4
-The batch size of reading text lines to the queue that feeds examples to the network during training.
-#### config.NUM_BATCHING_THREADS = 2
-The number of threads enqueuing examples.
-#### config.BATCH_QUEUE_SIZE = 300000
-Max number of elements in the feeding queue.
-#### config.DATA_NUM_CONTEXTS = 200
-The number of contexts in a single example, as was created in preprocessing.
+#### config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10
+Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.
+#### config.NUM_BATCHES_TO_LOG_PROGRESS = 100
+Number of batches (during training / evaluating) to complete between two progress-logging records.
+#### config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100
+Number of training batches to complete between model evaluations on the test set.
+#### config.READER_NUM_PARALLEL_BATCHES = 4
+The number of threads enqueuing examples to the reader queue.
+#### config.SHUFFLE_BUFFER_SIZE = 10000
+Size of buffer in reader to shuffle example within during training.
+Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.
+#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024 # 100 MB
+The buffer size (in bytes) of the CSV dataset reader.
+
 #### config.MAX_CONTEXTS = 200
 The number of contexts to use in each example.
-#### config.WORDS_VOCAB_SIZE = 1301136
+#### config.MAX_TOKEN_VOCAB_SIZE = 1301136
 The max size of the token vocabulary.
-#### config.TARGET_VOCAB_SIZE = 261245
+#### config.MAX_TARGET_VOCAB_SIZE = 261245
 The max size of the target words vocabulary.
-#### config.PATHS_VOCAB_SIZE = 911417
+#### config.MAX_PATH_VOCAB_SIZE = 911417
 The max size of the path vocabulary.
-#### config.EMBEDDINGS_SIZE = 128
-Embedding size for tokens and paths.
+#### config.DEFAULT_EMBEDDINGS_SIZE = 128
+Default embedding size to be used for token and path if not specified otherwise.
+#### config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
+Embedding size for tokens.
+#### config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
+Embedding size for paths.
+#### config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE
+Size of code vectors.
+#### config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE
+Embedding size for target words.
 #### config.MAX_TO_KEEP = 10
 Keep this number of newest trained versions during training.
+#### config.DROPOUT_KEEP_RATE = 0.75
+Dropout rate used during training.
+#### config.SEPARATE_OOV_AND_PAD = False
+Whether to treat `<OOV>` and `<PAD>` as two different special tokens whenever possible.
 
 ## Features
 Code2vec supports the following features: 
 
+### Choosing implementation to use
+This repo comes with two model implementations:
+(i) uses pure TensorFlow (written in [tensorflow_model.py](tensorflow_model.py));
+(ii) uses TensorFlow's Keras (written in [keras_model.py](keras_model.py)).
+The default implementation used by `code2vec.py` is the pure TensorFlow.
+To explicitly choose the desired implementation to use, specify `--framework tensorflow` or `--framework keras`
+as an additional argument when executing the script `code2vec.py`.
+Particularly, this argument can be added to each one of the usage examples (of `code2vec.py`) detailed in this file.
+Note that in order to load a trained model (from file), one should use the same implementation used during its training.
+
 ### Releasing the model
 If you wish to keep a trained model for inference only (without the ability to continue training it) you can
 release the model using:

diff --git a/code2vec.py b/code2vec.py
@@ -1,56 +1,38 @@
-from common import Config, VocabType
-from argparse import ArgumentParser
+from vocabularies import VocabType
+from config import Config
 from interactive_predict import InteractivePredictor
-from model import Model
-import sys
+from model_base import Code2VecModelBase
 
-if __name__ == '__main__':
- parser = ArgumentParser()
- parser.add_argument("-d", "--data", dest="data_path",
- help="path to preprocessed dataset", required=False)
- parser.add_argument("-te", "--test", dest="test_path",
- help="path to test file", metavar="FILE", required=False)
 
- is_training = '--train' in sys.argv or '-tr' in sys.argv
- parser.add_argument("-s", "--save", dest="save_path",
- help="path to save file", metavar="FILE", required=False)
- parser.add_argument("-w2v", "--save_word2v", dest="save_w2v",
- help="path to save file", metavar="FILE", required=False)
- parser.add_argument("-t2v", "--save_target2v", dest="save_t2v",
- help="path to save file", metavar="FILE", required=False)
- parser.add_argument("-l", "--load", dest="load_path",
- help="path to save file", metavar="FILE", required=False)
- parser.add_argument('--save_w2v', dest='save_w2v', required=False,
- help="save word (token) vectors in word2vec format")
- parser.add_argument('--save_t2v', dest='save_t2v', required=False,
- help="save target vectors in word2vec format")
- parser.add_argument('--export_code_vectors', action='store_true', required=False,
- help="export code vectors for the given examples")
- parser.add_argument('--release', action='store_true',
- help='if specified and loading a trained model, release the loaded model for a lower model '
- 'size.')
- parser.add_argument('--predict', action='store_true')
- args = parser.parse_args()
+def load_model_dynamically(config: Config) -> Code2VecModelBase:
+ assert config.DL_FRAMEWORK in {'tensorflow', 'keras'}
+ if config.DL_FRAMEWORK == 'tensorflow':
+ from tensorflow_model import Code2VecModel
+ elif config.DL_FRAMEWORK == 'keras':
+ from keras_model import Code2VecModel
+ return Code2VecModel(config)
+
+
+if __name__ == '__main__':
+ config = Config(set_defaults=True, load_from_args=True, verify=True)
 
- config = Config.get_default_config(args)
+ model = load_model_dynamically(config)
+ config.log('Done creating code2vec model')
 
- model = Model(config)
- print('Created model')
- if config.TRAIN_PATH:
+ if config.is_training:
  model.train()
- if args.save_w2v is not None:
- model.save_word2vec_format(args.save_w2v, source=VocabType.Token)
- print('Origin word vectors saved in word2vec text format in: %s' % args.save_w2v)
- if args.save_t2v is not None:
- model.save_word2vec_format(args.save_t2v, source=VocabType.Target)
- print('Target word vectors saved in word2vec text format in: %s' % args.save_t2v)
- if config.TEST_PATH and not args.data_path:
+ if config.SAVE_W2V is not None:
+ model.save_word2vec_format(config.SAVE_W2V, VocabType.Token)
+ config.log('Origin word vectors saved in word2vec text format in: %s' % config.SAVE_W2V)
+ if config.SAVE_T2V is not None:
+ model.save_word2vec_format(config.SAVE_T2V, VocabType.Target)
+ config.log('Target word vectors saved in word2vec text format in: %s' % config.SAVE_T2V)
+ if config.is_testing and not config.is_training:
  eval_results = model.evaluate()
  if eval_results is not None:
- results, precision, recall, f1 = eval_results
- print(results)
- print('Precision: ' + str(precision) + ', recall: ' + str(recall) + ', F1: ' + str(f1))
- if args.predict:
+ config.log(
+ str(eval_results).replace('topk', 'top{}'.format(config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION)))
+ if config.PREDICT:
  predictor = InteractivePredictor(config, model)
  predictor.predict()
  model.close_session()