Hello TensorFlow community,
I’m facing an issue related to memory growth when using TensorFlow for a multi-round training process. Specifically, I have a model training loop in which I generate training and evaluation data in each round, and my memory usage seems to keep growing, eventually causing out-of-memory errors. I’m trying to understand how I can effectively manage or release memory during these iterations.
Here is a simplified version of my code
# Define TensorFlow variables for training data for num_round in range(1, 1 + total_num_round): train_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, epochs_t + 1) eval_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, eval_num_batch) # train and evaluate process # delete used data del train_data, eval_data gc.collect() Issues I’m Facing:
- The
train_dataandeval_datagenerated in each round occupy a lot of memory, and I cannot seem to release this memory effectively, leading to continuous memory growth. - I have tried several approaches to control memory usage:
- Using
assign()instead of repeatedly definingtrain_dataandeval_data. - Using
gc.collect()anddel train_data, eval_datato free up memory, but these methods did not work.
- Using
- The function
generate_all_batch_s_path_samplesis not decorated withtf.functionbecause it uses threading for parallel computation, which makes it incompatible withtf.function.
Questions:
- Is there a more effective way to release memory between iterations, besides using
tf.keras.backend.clear_session()? - Is there a recommended approach to managing memory growth in multi-round training scenarios like this?
Any advice, suggestions, or code examples would be greatly appreciated! Thank you all in advance for your help.
Context:
- I’m using TensorFlow 2.16.0.
- The data generation process (
generate_all_batch_s_path_samples) creates new tensors for training and evaluation in each round.
Thanks again for your support!
