How to Control Memory Growth When Using TensorFlow in Multi-Round Training?

Lekang_Zhang · October 30, 2024, 5:29am

Hello TensorFlow community,

I’m facing an issue related to memory growth when using TensorFlow for a multi-round training process. Specifically, I have a model training loop in which I generate training and evaluation data in each round, and my memory usage seems to keep growing, eventually causing out-of-memory errors. I’m trying to understand how I can effectively manage or release memory during these iterations.

Here is a simplified version of my code

 # Define TensorFlow variables for training data for num_round in range(1, 1 + total_num_round): train_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, epochs_t + 1) eval_data = generate_all_batch_s_path_samples(s_0_, net_list_c, batch_size, eval_num_batch) # train and evaluate process # delete used data del train_data, eval_data gc.collect()

Issues I’m Facing:

The train_data and eval_data generated in each round occupy a lot of memory, and I cannot seem to release this memory effectively, leading to continuous memory growth.
I have tried several approaches to control memory usage:
1. Using assign() instead of repeatedly defining train_data and eval_data .
2. Using gc.collect() and del train_data, eval_data to free up memory, but these methods did not work.
The function generate_all_batch_s_path_samples is not decorated with tf.function because it uses threading for parallel computation, which makes it incompatible with tf.function .

Questions:

Is there a more effective way to release memory between iterations, besides using tf.keras.backend.clear_session() ?
Is there a recommended approach to managing memory growth in multi-round training scenarios like this?

Any advice, suggestions, or code examples would be greatly appreciated! Thank you all in advance for your help.

Context:

I’m using TensorFlow 2.16.0.
The data generation process (generate_all_batch_s_path_samples ) creates new tensors for training and evaluation in each round.

Thanks again for your support!

Topic		Replies	Views
Keras Model Memory Leak Keras tfkeras	1	938	June 19, 2024
Tensorflow memory leak in loop TensorFlow keras , memory , gpu	1	734	January 2, 2024
Call Tensorflow Model in a loop leaks memory General Discussion nlp , keras , transformers	1	1409	September 25, 2023
Memory leak during training General Discussion models , keras	2	730	October 24, 2023
How to manage gpu memory allocation properly General Discussion models , gpu , help_request	1	2131	November 3, 2021

How to Control Memory Growth When Using TensorFlow in Multi-Round Training?

Related topics