Skip to content

Long intervals during resource iteration can lead to issues #141

@hermit-crab

Description

@hermit-crab

Hello.

Recently there was this issue #121 for which a batch read workaround was implemented. I am now experiencing from what I believe to be same or similar issue but now while using JSON instead of msgpack. Basically when I do for item in job.items.iter(..., count=X, ...): if there are long intervals during iteration the count can end up being ignored. I was able to reproduce it with the following snippet:

sh_client = ScrapinghubClient(APIKEY, use_msgpack=False) take = 10_000 job_id = '168012/276/1' for i, item in enumerate(sh_client.get_job(job_id).items.iter(count=take, meta='_key')): print(f'\r{i} ({item["_key"]})', end='') if i == 3000: print('\nsleeping') time.sleep(60*3) if i > take: print('\nWTF') break

With the sleep part removed the WTF section does not fire and the iterator stops on 168012/276/1/9999th item.

This seem to be more of a ScrapyCloud API platform problem but I am reporting it here to track nonetheless.

For now I am assuming resource/collections iteration is not robust if any delays are possible client side during retrieval (I haven't tested any other potential issues) and I will try either preloading all at once (.list()) or using .list_iter() when makes sense as a habit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions