- Notifications
You must be signed in to change notification settings - Fork 4.9k
✨ Source S3: Add reading files inside zip archive #31340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
davydov-d left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests, preferably CAT with the new archive type
|
| Step | Result |
|---|---|
| Build source-s3 docker image for platform(s) linux/x86_64 | ✅ |
| Unit tests | ✅ |
| Acceptance tests | ✅ |
| Code format checks | ✅ |
| Validate metadata for source-s3 | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
davydov-d left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
| self._file.seek(file_start + 26) # Offset to the length fields in the ZIP header | ||
| name_len, extra_len = struct.unpack("<HH", self._file.read(4)) | ||
| return file_start + 30 + name_len + extra_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give more info about these hardcoded ints: 26 and 30 ?
e56c48c to 0fa7198 Compare |
| Step | Result |
|---|---|
| Build source-s3 docker image for platform(s) linux/x86_64 | ✅ |
| Unit tests | ✅ |
| Acceptance tests | ❌ |
| Code format checks | ✅ |
| Validate metadata for source-s3 | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test7217969 to eeafed3 Compare |
| Step | Result |
|---|---|
| Build source-s3 docker image for platform(s) linux/x86_64 | ✅ |
| Unit tests | ✅ |
| Acceptance tests | ✅ |
| Code format checks | ✅ |
| Validate metadata for source-s3 | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
What
In this PR, we've enhanced the capability of our S3 stream reader to efficiently access and read files within ZIP archives stored in S3. The key highlights include:
This implementation ensures a faster and more efficient way to access files within ZIP archives in S3, which is especially beneficial for large archives.
Closes:
#17021
How
The process of reading files from a ZIP archive stored in S3 is divided into two main steps:
1. Reading the Central Directory of the ZIP Archive:
2. Reading Individual Files Using Their Offset:
In summary, our implementation efficiently reads files from a ZIP archive in S3 by first parsing the Central Directory to gather metadata and then using the offset information to directly access and read individual files.