Replace lambda with def #1083

Erotemic · 2023-04-07T16:59:13Z

I'm running into an issue when trying to use jsonschema with concurrent.futures to check multiple schema in parallel. It is complaining that it can't pickle a lambda function. Changing this lambda to a globally defined function fixes the issue.

Currently pretty tight on time, so this PR is only the change without a test or MWE to demonstrate the issue, but I think it's simple enough and the rational is reasonble enough.

📚 Documentation preview 📚: https://python-jsonschema--1083.org.readthedocs.build/en/1083/

Julian · 2023-04-07T18:35:11Z

Hi there, I appreciate the PR, but pickling isn't supported -- I can't guarantee even if this was merged that it wouldn't break again.

Erotemic · 2023-04-07T21:04:20Z

Is this open to discussion?

Why isn't pickling supported? That prevents you from using multiprocessing, which is needed to execute jsonschema on large numbers of files in any reasonable amount of time. This PR provides meaningful zero functionality difference except it adds a name to the module namespace (which can be changed or removed), and I've verified that it does allow me to run jsonargparse nearly 8x faster (minutes to seconds). What's the opposition to merging?

Also, even if you can't guarentee it won't break again, If you make a release with this diff I can always pin to that. My problem is solved, other people that use this can get orders of magnitude speedups with multiprocessing. So I'm a bit confused by the quick dismisal.

Julian · 2023-04-07T22:47:06Z

That prevents you from using multiprocessing

Not really -- I'll assume you mean the multiprocessing module rather than multiprocessing in general. But you don't need pickle validator objects across jobs to use even it. All that you're "prevented" from doing is trying to copy arbitrary objects to validate -- doing something like having one validator in each multiprocessing job is perfectly supported, and involves no copying of arbitrary objects via pickling.

And using any other better form of parallelization is indeed supported too.

Also, even if you can't guarentee it won't break again, If you make a release with this diff I can always pin to that. My problem is solved, other people that use this can get orders of magnitude speedups with multiprocessing.

That's not how merging fixes works unfortunately. Merging something sends the message that something is supported and can be relied on. It sends the opposite message that closing does, which is "what you're doing isn't really good, I'd really prefer you didn't do it personally". If next time there's some equally small PR which does nothing for the average user but is needed for pickle someone will again want to merge it for the same reasoning. Even if it's just a bit larger.

Not to mention that every PR has a cost, both in ensuring it doesn't break anything else, as well as even for these kinds of explanations themselves.

Again -- I do appreciate you're trying to do something and this PR seems small to you and like it'd fix all your problems. Looking at it, if you really insist on doing something like what you're doing and not some other supported mechanism, just create your own type checker and use that instead, it's equal amounts of lines to this PR.

Erotemic · 2023-04-09T23:26:00Z

It turns out I was wrong.

I made a MWE and it doesn't trigger the pickling error. I'm not exactly sure how I came to the conclusion that this was the issue, likely some autoreload IPython issue. Regardless, there is no issue and no patches are necessary.

For completeness, here is the MWE:

from jsonschema import validate import ubelt # A sample schema schema = { "type": "array", "contains": { "type" : "object", "properties" : { "price" : {"type" : "number"}, "name" : {"type" : "string"}, }, } } instances = [[{"name" : "Eggs" + str(_), "price" : 34.99 + _} for _ in range(100)] for _ in range(1000)] with ubelt.Timer(verbose=1, label='serial'): jobs = ubelt.JobPool(mode='serial', max_workers=0) for instance in instances: jobs.submit(validate, instance=instance, schema=schema) for job in jobs.as_completed(desc='serial'): job.result() with ubelt.Timer(verbose=1, label='parallel'): jobs = ubelt.JobPool(mode='process', max_workers=16) for instance in instances: jobs.submit(validate, instance=instance, schema=schema) for job in jobs.as_completed(desc='parallel'): job.result()

Replace lambda with def

0443d0b

Julian closed this Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Replace lambda with def #1083

Replace lambda with def #1083

Uh oh!

Erotemic commented Apr 7, 2023 •

edited by github-actions bot

Loading

Julian commented Apr 7, 2023

Erotemic commented Apr 7, 2023 •

edited

Loading

Julian commented Apr 7, 2023

Erotemic commented Apr 9, 2023

Labels

2 participants

Uh oh!

Replace lambda with def #1083

Replace lambda with def #1083

Uh oh!

Conversation

Erotemic commented Apr 7, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Julian commented Apr 7, 2023

Erotemic commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Julian commented Apr 7, 2023

Erotemic commented Apr 9, 2023

Labels

2 participants

Erotemic commented Apr 7, 2023 •

edited by github-actions bot

Loading

Erotemic commented Apr 7, 2023 •

edited

Loading