- Notifications
You must be signed in to change notification settings - Fork 163
Description
Is your feature request related to a problem? Please describe.
My use case is to download a single, large blob (~16GBs) into memory in a Python application. This happens as part of a startup process that currently takes 5min. The command line utility, gsutil, has a way to enable sliced downloads and only takes 30s (same machine+network). I would like to take advantage of this optimization in a Pythonic way.
Describe the solution you'd like
Enable sliced downloads in the Python client library such as:
blob.download_to_filename(..., sliced_downloads=True, max_components=16)
This would match gsutil which copies the blob to the local filesystem. It would be great, however, if the blob could be downloaded into memory like:
blob.download_as_bytes(..., sliced_downloads=True, max_components=16)
Describe alternatives you've considered
Knowing that gsutil can run the download concurrently, I tried using the subprocess module to call it. This doesn't work bc it will not run more than one process unlike calling it from the command line. It's also not great to run a shell command from a Python process because it assumes Cloud SDK is setup.
I've tried using ChunkedDownloads in conjunction with multiprocessing but I have not been able to get it to download chunks in parallel. There is also the additional overhead of dealing with the byte stream buffer, transport authentication, checksum/data validation, etc making it non-trivial.
Additional context
Since gsutil is a Python executable itself, I would imagine this could be implemented in the client library (ultimately making the same HTTP Range Requests).
The gsutil command I used on a GCE instance with 16 vCPUs:
gsutil -o ‘GSUtil:parallel_thread_count=1’ -o ‘GSUtil:sliced_object_download_max_components=16’ cp gs://bucket/key /path/to/destination
Open to existing solution I'm not aware of, either, but documentation is sparse on this topic.