This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Created on 2009-07-30 00:32 by mwh, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
make_msgid.diff ggenellina, 2009-11-15 10:12
make_msgid_2.diff serhiy.storchaka, 2015-05-17 12:19 review
Messages (18)
msg91073 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2009-07-30 00:32
If you call email.utils.make_msgid a number of times within the same second, the uniqueness of the results depends on random.randint(100000) returning different values each time. A little mathematics proves that you don't have to call make_msgid *that* often to get the same message id twice: if you call it 'n' times, the probability of a collision is approximately "1 - math.exp(-n*(n-1)/200000.0)", and for n == 100, that's about 5%. For n == 1000, it's over 99%. These numbers are born out by experiment: >>> def collisions(n): ... msgids = [make_msgid() for i in range(n)] ... return len(msgids) - len(set(msgids)) ... >>> sum((collisions(100)>0) for i in range(1000)) 49 >>> sum((collisions(1000)>0) for i in range(1000)) 991 I think probably having a counter in addition to the randomness would be a good fix for the problem, though I guess then you have to worry about thread safety.
msg91074 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2009-07-30 00:34
A higher resolution timer would also help, of course. (Thanks to James Knight for the prod).
msg91251 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-08-04 04:30
This patch replaces the random part with an increasing sequence (in a thread safe way). Also, added a test case for make_msgid (there was none previously)
msg91378 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-08-06 18:18
Is it ok if the message id is predictable? Besides, _gen_next_number() can more efficiently be written as: _gen_next_number = itertools.cycle(xrange(N))
msg91410 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-08-07 19:34
--- El jue 6-ago-09, Antoine Pitrou <report@bugs.python.org> escribió: > Antoine Pitrou <pitrou@free.fr> > added the comment: > > Is it ok if the message id is predictable? I don't know of any use of message ids apart from uniquely identifying the message, but we could still keep a random part followed by the sequence. > Besides, _gen_next_number() can more efficiently be written > as: > > _gen_next_number = itertools.cycle(xrange(N)) Written that way it will eventually hold a list with 100000 integers forever. Yahoo! Cocina Encontra las mejores recetas con Yahoo! Cocina. http://ar.mujer.yahoo.com/cocina/
msg95281 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-11-15 10:12
An updated version of the make_msgid patch. Includes a random part plus a sequential part. Testing takes at most 3 seconds now (we're interested in those msgids generated in a whole second)
msg236504 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2015-02-24 15:18
Can we have a patch review please. If nothing else xrange will have to change for Python 3.
msg236527 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-24 17:51
The probability of a collision when generated n numbers from 0 to m-1 is 1 - factorial(m)/factorial(m-n)/m**n When n << sqrt(m), it is approximately equal to n**2/(2*m). When m = 1000000, we have problems for n about 1000. When increase m to 2**32, the probability of collisions is decreased to 0.01%. When increase m to 2**64 as recommended in [1], it is decreased to 2.7e-14. I.e. when generate 1000 IDs per second, collisions happen one per a million years. We could also take datetime with larger precision. E.g with 2 digits after the dot, as recommended in [1]. When apply both changes, we could generate 100000 IDs per second with one collision per 10000 years or 10000 IDs per second with one collision per 1000000 years. [1] https://tools.ietf.org/html/draft-ietf-usefor-message-id-01
msg243394 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-17 12:19
Here is a patch that changes the generating of message IDs: 1. The datetime is taken with higher precision, 2 decimal digits after dot. 2. The datetime is written as Unix time in 0.01 second units, not as YYYYmmddHHMMSS. This is more compact and faster. 3. The random part is taken as the random integer in the range 0 from to 2**64, not from 0 to 10**5. This increases the length of generated part of the ID from 26 to 39 characters in average. Perhaps in new releases we can use non-decimal or even non-alphanumeric characters in ID for compactness.
msg243398 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-17 13:15
Looking at my mail store, it looks like one more more mail clients use a GUID for the messageid, which is 36 characters. So 39 doesn't seem so bad. There's even one that is 74 characters long. There are also shorter ones, though, so compactifying the result wouldn't be a bad thing.
msg243411 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2015-05-17 16:03
An increase of 13 characters doesn't seem so bad.
msg243417 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-17 17:46
Currently the maximal length of local part is 14+1+len(str(2**16))+1+len(str(10**5-1)) == 26 With the patch it will be len(str(2**31*100))+1+len(str(2**16))+1+len(str(2**64)) = 39 If encode all three numbers with hexadecimal, it will be len('%x'%(2**31*100))+1+len('%x'%(2**16))+1+len('%x'%(2**64)) = 34 If encode their with base32: math.ceil((31+math.log(100)/math.log(2))/5)+1+math.ceil(16/5)+1+math.ceil(64/5) = 27 Using base64 or base85 can be not safe, because message ID can be used as file name in case-insensitive file system.
msg243561 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-05-19 07:11
New changeset 6969bac411fa by Serhiy Storchaka in branch '2.7': Issue #6598: Increased time precision and random number range in https://hg.python.org/cpython/rev/6969bac411fa New changeset ea878f847eee by Serhiy Storchaka in branch '3.4': Issue #6598: Increased time precision and random number range in https://hg.python.org/cpython/rev/ea878f847eee New changeset 933addbc7041 by Serhiy Storchaka in branch 'default': Issue #6598: Increased time precision and random number range in https://hg.python.org/cpython/rev/933addbc7041
msg243562 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-19 07:23
Now there is a question. Is it worth to use base16 (hexadecimal) to compact message id to 34 characters or base32 to compact it to 27 characters? Using base16 is pretty easy: just replace %d with %x. Using base32 is more complicated. With base64 and base85 the length can be decreased to 23 and 21, but I doubt that this is safe.
msg243585 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2015-05-19 12:33
On May 19, 2015, at 07:23 AM, Serhiy Storchaka wrote: >Now there is a question. Is it worth to use base16 (hexadecimal) to compact >message id to 34 characters or base32 to compact it to 27 characters? Using >base16 is pretty easy: just replace %d with %x. Using base32 is more >complicated. It might not be relevant, but we're using base 32 in Message-ID-Hash: http://wiki.list.org/DEV/Stable%20URLs We settled on base 32 after consulting with the curators of mail-archive.com and others.
msg254458 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-10 17:43
http://buildbot.python.org/all/builders/x86%20Tiger%202.7/builds/3246/steps/test/logs/stdio ====================================================================== FAIL: test_make_msgid_collisions (email.test.test_email.TestMiscellaneous) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/email/test/test_email.py", line 2434, in test_make_msgid_collisions pass File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/contextlib.py", line 24, in __exit__ self.gen.next() File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/test/test_support.py", line 1570, in start_threads raise AssertionError('Unable to join %d threads' % len(started)) AssertionError: Unable to join 5 threads ---------------------------------------------------------------------- Threads that should be finished for 3 seconds can't be finished for 15 minutes. It looks as clock() wrapped around. From clock (3) manpage: Note that the time can wrap around. On a 32-bit system where CLOCKS_PER_SEC equals 1000000 this function will return the same value approximately every 72 minutes. The solution is to use monotonic() (time() on older releases) instead of clock().
msg254461 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-11-10 17:54
New changeset 3c0a817ab616 by Serhiy Storchaka in branch '3.4': Issue #6598: Avoid clock wrapping around in test_make_msgid_collisions. https://hg.python.org/cpython/rev/3c0a817ab616 New changeset e259c0ab7a69 by Serhiy Storchaka in branch '3.5': Issue #6598: Avoid clock wrapping around in test_make_msgid_collisions. https://hg.python.org/cpython/rev/e259c0ab7a69 New changeset d1c11a78b43a by Serhiy Storchaka in branch 'default': Issue #6598: Avoid clock wrapping around in test_make_msgid_collisions. https://hg.python.org/cpython/rev/d1c11a78b43a New changeset bdd257d60da2 by Serhiy Storchaka in branch '2.7': Issue #6598: Avoid clock wrapping around in test_make_msgid_collisions. https://hg.python.org/cpython/rev/bdd257d60da2
msg254600 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-13 11:57
Perhaps will open a separate issue for more compact non-decimal message ID.
History
Date User Action Args
2022-04-11 14:56:51adminsetgithub: 50847
2015-11-13 11:57:05serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg254600

stage: patch review -> resolved
2015-11-10 17:54:15python-devsetmessages: + msg254461
2015-11-10 17:43:49serhiy.storchakasetmessages: + msg254458
2015-05-19 12:33:37barrysetmessages: + msg243585
2015-05-19 07:23:10serhiy.storchakasetmessages: + msg243562
2015-05-19 07:11:08python-devsetnosy: + python-dev
messages: + msg243561
2015-05-19 07:04:32serhiy.storchakasetassignee: serhiy.storchaka
2015-05-17 17:46:32serhiy.storchakasetmessages: + msg243417
2015-05-17 16:03:48barrysetmessages: + msg243411
2015-05-17 13:15:45r.david.murraysetmessages: + msg243398
2015-05-17 12:19:31serhiy.storchakasetfiles: + make_msgid_2.diff

nosy: + barry
messages: + msg243394

stage: patch review
2015-02-24 17:52:00serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg236527
2015-02-24 15:18:59BreamoreBoysetnosy: + BreamoreBoy

messages: + msg236504
versions: + Python 3.4, Python 3.5, - Python 3.1, Python 3.2
2014-02-28 20:27:27varunsetnosy: + varun
2010-11-30 05:42:21eric.araujosetnosy: + r.david.murray

versions: - Python 2.6
2009-11-15 10:12:49ggenellinasetfiles: - utils.diff
2009-11-15 10:12:43ggenellinasetfiles: - test_email.diff
2009-11-15 10:12:29ggenellinasetfiles: + make_msgid.diff

messages: + msg95281
2009-08-07 19:59:53ggenellinasetfiles: + utils.diff
2009-08-07 19:59:18ggenellinasetfiles: - utils.diff
2009-08-07 19:34:15ggenellinasetmessages: + msg91410
2009-08-06 18:18:59pitrousetnosy: + pitrou
messages: + msg91378
2009-08-04 04:32:45ggenellinasetversions: - Python 2.5, Python 2.4, 3rd party, Python 3.0
2009-08-04 04:31:30ggenellinasetfiles: + test_email.diff
2009-08-04 04:30:54ggenellinasetfiles: + utils.diff

nosy: + ggenellina
messages: + msg91251

keywords: + patch
2009-07-30 08:21:31allenapsetnosy: + allenap
2009-07-30 00:34:46mwhsetmessages: + msg91074
2009-07-30 00:32:15mwhcreate