Skip to content

Conversation

@mmaciej2
Copy link
Contributor

@mmaciej2 mmaciej2 commented Feb 1, 2018

Using the "int" function instead of proper rounding while creating the segment ID could lead to cases where two segments would erroneously receive the same ID.

@danpovey
Copy link
Contributor

danpovey commented Feb 1, 2018

Can you give an example of the problem? I have a hard time seeing how this could happen.

@mmaciej2
Copy link
Contributor Author

mmaciej2 commented Feb 1, 2018

@danpovey The problem happens when the start_time (I assume as some kind of machine precision issue) that is read out of the segments file is slightly larger than its value rounded to two decimal places. So the sub-segment is listed as being from something like time 0.99999 to 2.9999 and will produce a segments file with something like this:
utt_000099_000299 1.00 3.00
If there is a segment going from 0.99 to 2.99, it will receive the same ID, i.e.
utt_000099_000299 0.99 2.99

I think there's an argument that you should not be trying to produce subsegments that are only one frame apart, so the ID collision isn't an issue, but nevertheless the code in the master branch produces IDs that are inconsistent with the time marks by 0.01 seconds.

If you would like me to point you to a particular segments file and set of parameters that have this problem, I can set that up as well.

@danpovey danpovey merged commit c82560d into kaldi-asr:master Feb 1, 2018
@mmaciej2 mmaciej2 deleted the subsegmentation-fix branch February 1, 2018 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants