Reduce memory usage of forced alignment on CPU #3787
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
In the forced alignment c++ code,
backPtris anint8tensor while only storing the values 0,1, and 2 which can be effectively stored using only 2 bits instead of 8, and since thebackPtrtensor size islog_probs_len * (targets_length * 2 + 1), it can grow to unmanageable sizes in audio files that exceed 2 hours.By using two
std::vector<bool>to represent the two bits needed forbackPtrwe guarantee that the results are exactly the same while lowering memory usage sincestd::vector<bool>should use 1 bit to represent a boolean.Best case scenario is memory usage drops to 25%, worst case scenario memory usage doubles if a boolean is represented using 1 byte.
From my experiments, the new code can handle longer audio files without running out of memory. I also noticed that on average, only
1-targets_length/log_probs_lengthof thebackPtrarray is used (depending on the inputs) so further memory savings can be gained if we used a shape that reduces unused elements.edit:
I implemented a better structure for the
backPtrtensor that still uses two boolean vectors but the numbers of elements are greatly reduced to achieve better memory efficiency.The new structure is similar to a sparse matrix or a list of lists, instead of initializing a complete trellis matrix, we initialize the elements which are only going to be used which is approximated by the formula in the code (deduced empirically and tested thorougly). We also create two new arrays for indexing purposes.