Skip to content

Conversation

Erotemic
Copy link

@Erotemic Erotemic commented Apr 7, 2023

I'm running into an issue when trying to use jsonschema with concurrent.futures to check multiple schema in parallel. It is complaining that it can't pickle a lambda function. Changing this lambda to a globally defined function fixes the issue.

Currently pretty tight on time, so this PR is only the change without a test or MWE to demonstrate the issue, but I think it's simple enough and the rational is reasonble enough.


📚 Documentation preview 📚: https://python-jsonschema--1083.org.readthedocs.build/en/1083/

@Julian
Copy link
Member

Julian commented Apr 7, 2023

Hi there, I appreciate the PR, but pickling isn't supported -- I can't guarantee even if this was merged that it wouldn't break again.

@Julian Julian closed this Apr 7, 2023
@Erotemic
Copy link
Author

Erotemic commented Apr 7, 2023

Is this open to discussion?

Why isn't pickling supported? That prevents you from using multiprocessing, which is needed to execute jsonschema on large numbers of files in any reasonable amount of time. This PR provides meaningful zero functionality difference except it adds a name to the module namespace (which can be changed or removed), and I've verified that it does allow me to run jsonargparse nearly 8x faster (minutes to seconds). What's the opposition to merging?

Also, even if you can't guarentee it won't break again, If you make a release with this diff I can always pin to that. My problem is solved, other people that use this can get orders of magnitude speedups with multiprocessing. So I'm a bit confused by the quick dismisal.

@Julian
Copy link
Member

Julian commented Apr 7, 2023

That prevents you from using multiprocessing

Not really -- I'll assume you mean the multiprocessing module rather than multiprocessing in general. But you don't need pickle validator objects across jobs to use even it. All that you're "prevented" from doing is trying to copy arbitrary objects to validate -- doing something like having one validator in each multiprocessing job is perfectly supported, and involves no copying of arbitrary objects via pickling.

And using any other better form of parallelization is indeed supported too.

Also, even if you can't guarentee it won't break again, If you make a release with this diff I can always pin to that. My problem is solved, other people that use this can get orders of magnitude speedups with multiprocessing.

That's not how merging fixes works unfortunately. Merging something sends the message that something is supported and can be relied on. It sends the opposite message that closing does, which is "what you're doing isn't really good, I'd really prefer you didn't do it personally". If next time there's some equally small PR which does nothing for the average user but is needed for pickle someone will again want to merge it for the same reasoning. Even if it's just a bit larger.

Not to mention that every PR has a cost, both in ensuring it doesn't break anything else, as well as even for these kinds of explanations themselves.

Again -- I do appreciate you're trying to do something and this PR seems small to you and like it'd fix all your problems. Looking at it, if you really insist on doing something like what you're doing and not some other supported mechanism, just create your own type checker and use that instead, it's equal amounts of lines to this PR.

@Erotemic
Copy link
Author

Erotemic commented Apr 9, 2023

It turns out I was wrong.

I made a MWE and it doesn't trigger the pickling error. I'm not exactly sure how I came to the conclusion that this was the issue, likely some autoreload IPython issue. Regardless, there is no issue and no patches are necessary.

For completeness, here is the MWE:

from jsonschema import validate import ubelt # A sample schema schema = { "type": "array", "contains": { "type" : "object", "properties" : { "price" : {"type" : "number"}, "name" : {"type" : "string"}, }, } } instances = [[{"name" : "Eggs" + str(_), "price" : 34.99 + _} for _ in range(100)] for _ in range(1000)] with ubelt.Timer(verbose=1, label='serial'): jobs = ubelt.JobPool(mode='serial', max_workers=0) for instance in instances: jobs.submit(validate, instance=instance, schema=schema) for job in jobs.as_completed(desc='serial'): job.result() with ubelt.Timer(verbose=1, label='parallel'): jobs = ubelt.JobPool(mode='process', max_workers=16) for instance in instances: jobs.submit(validate, instance=instance, schema=schema) for job in jobs.as_completed(desc='parallel'): job.result()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants