✨ Source S3: Add reading files inside zip archive #31340

tolik0 · 2023-10-12T12:18:51Z

What

In this PR, we've enhanced the capability of our S3 stream reader to efficiently access and read files within ZIP archives stored in S3. The key highlights include:

Central Directory Parsing: Introduced logic to read and parse the Central Directory of ZIP archives, which provides metadata about the files contained within.
Direct File Access: Leveraged the offset information from the Central Directory to directly seek and read individual files from the ZIP archive without the need to decompress the entire archive.

This implementation ensures a faster and more efficient way to access files within ZIP archives in S3, which is especially beneficial for large archives.

Closes:
#17021

How

The process of reading files from a ZIP archive stored in S3 is divided into two main steps:

1. Reading the Central Directory of the ZIP Archive:

The Central Directory (CD) is a crucial part of the ZIP file format. It contains metadata about all the files and directories stored inside the ZIP archive.
Our implementation begins by reading the Central Directory to gather information about the contents of the ZIP archive. The ZipFileHandler is responsible for this task.
By parsing the Central Directory, we obtain essential details about each file inside the ZIP, such as its name, compression method, compressed size, and most importantly, its offset within the ZIP archive. The offset is crucial because it tells us where the file's data starts in the ZIP archive.

2. Reading Individual Files Using Their Offset:

Once we have the offset of a file from the Central Directory, we can directly seek to that position in the ZIP archive to start reading the file's data.
The RemoteFileInsideArchive class represents each file inside the ZIP archive. It stores metadata about the file, including its offset, which is used to read the file.
When a request is made to read a specific file from the ZIP archive, our implementation seeks to the file's offset in the ZIP archive and begins reading the file's data from that position. This direct seek-and-read approach ensures efficient and fast access to individual files without the need to decompress or read the entire ZIP archive.

In summary, our implementation efficiently reads files from a ZIP archive in S3 by first parsing the Central Directory to gather metadata and then using the offset information to directly access and read individual files.

vercel · 2023-10-12T12:18:58Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 18, 2023 8:08am

github-actions · 2023-10-12T12:19:14Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

davydov-d

Please add tests, preferably CAT with the new archive type

airbyte-oss-build-runner · 2023-10-13T12:22:54Z

source-s3 test report (commit `e56c48c8b5`) - ✅

⏲️ Total pipeline duration: 14mn55s

Step	Result
Build source-s3 docker image for platform(s) linux/x86_64	✅
Unit tests	✅
Acceptance tests	✅
Code format checks	✅
Validate metadata for source-s3	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

davydov-d

👏

bazarnov · 2023-10-17T10:17:00Z

airbyte-integrations/connectors/source-s3/source_s3/v4/zip_reader.py

+ self._file.seek(file_start + 26) # Offset to the length fields in the ZIP header
+ name_len, extra_len = struct.unpack("<HH", self._file.read(4))
+ return file_start + 30 + name_len + extra_len


Could you please give more info about these hardcoded ints: 26 and 30 ?

airbyte-oss-build-runner · 2023-10-17T23:51:55Z

source-s3 test report (commit `7217969808`) - ❌

⏲️ Total pipeline duration: 20mn20s

Step	Result
Build source-s3 docker image for platform(s) linux/x86_64	✅
Unit tests	✅
Acceptance tests	❌
Code format checks	✅
Validate metadata for source-s3	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

airbyte-oss-build-runner · 2023-10-18T08:31:40Z

source-s3 test report (commit `5f74715bbe`) - ✅

⏲️ Total pipeline duration: 21mn00s

Step	Result
Build source-s3 docker image for platform(s) linux/x86_64	✅
Unit tests	✅
Acceptance tests	✅
Code format checks	✅
Validate metadata for source-s3	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

octavia-squidington-iii added area/connectors Connector related issues connectors/source/s3 labels Oct 12, 2023

tolik0 assigned tolik0 and roman-yermilov-gl and unassigned roman-yermilov-gl Oct 12, 2023

tolik0 requested review from davydov-d and roman-yermilov-gl October 12, 2023 12:20

davydov-d suggested changes Oct 12, 2023

View reviewed changes

tolik0 mentioned this pull request Oct 13, 2023

Source Amazon S3: new compression type .tar, or .zip #17021

Closed

tolik0 marked this pull request as ready for review October 13, 2023 11:48

vercel bot deployed to Preview October 13, 2023 11:52 View deployment

tolik0 changed the title ~~Source S3: Add reading files inside zip archive~~ ✨ Source S3: Add reading files inside zip archive Oct 13, 2023

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Oct 13, 2023

tolik0 requested a review from davydov-d October 13, 2023 12:23

davydov-d approved these changes Oct 13, 2023

View reviewed changes

tolik0 requested review from bazarnov and lazebnyi October 16, 2023 09:22

bazarnov approved these changes Oct 17, 2023

View reviewed changes

octavia-squidington-iv requested review from a team October 17, 2023 10:18

tolik0 and others added 4 commits October 17, 2023 23:53

Add reading files inside zip archive

0a2c935

Add tests

a4d8c96

Automated Commit - Formatting Changes

8b4cace

Update docstrings

0fa7198

tolik0 force-pushed the tolik0/source-s3/add-reading-inside-zip branch from e56c48c to 0fa7198 Compare October 17, 2023 23:13

update type annotations

eeafed3

tolik0 force-pushed the tolik0/source-s3/add-reading-inside-zip branch from 7217969 to eeafed3 Compare October 18, 2023 07:55

Automated Commit - Formatting Changes

5f74715

tolik0 merged commit 951605a into master Oct 18, 2023

tolik0 deleted the tolik0/source-s3/add-reading-inside-zip branch October 18, 2023 08:53

ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023

✨ Source S3: Add reading files inside zip archive (airbytehq#31340)

61ca31f

makz81 mentioned this pull request Feb 15, 2024

Source S3: ZIP file reading feature breaks stream on corrupt zip files #35321

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ Source S3: Add reading files inside zip archive #31340

✨ Source S3: Add reading files inside zip archive #31340

Uh oh!

tolik0 commented Oct 12, 2023 •

edited

Loading

vercel bot commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited by tolik0

Loading

davydov-d left a comment

airbyte-oss-build-runner commented Oct 13, 2023

davydov-d left a comment

bazarnov Oct 17, 2023

airbyte-oss-build-runner commented Oct 17, 2023

airbyte-oss-build-runner commented Oct 18, 2023

Labels

7 participants

✨ Source S3: Add reading files inside zip archive #31340

✨ Source S3: Add reading files inside zip archive #31340

Uh oh!

Conversation

tolik0 commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

vercel bot commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Oct 12, 2023 • edited by tolik0 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before Merging a Connector Pull Request

davydov-d left a comment

Choose a reason for hiding this comment

airbyte-oss-build-runner commented Oct 13, 2023

source-s3 test report (commit e56c48c8b5) - ✅

davydov-d left a comment

Choose a reason for hiding this comment

bazarnov Oct 17, 2023

Choose a reason for hiding this comment

airbyte-oss-build-runner commented Oct 17, 2023

source-s3 test report (commit 7217969808) - ❌

airbyte-oss-build-runner commented Oct 18, 2023

source-s3 test report (commit 5f74715bbe) - ✅

Labels

7 participants

tolik0 commented Oct 12, 2023 •

edited

Loading

vercel bot commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited by tolik0

Loading

source-s3 test report (commit `e56c48c8b5`) - ✅

source-s3 test report (commit `7217969808`) - ❌

source-s3 test report (commit `5f74715bbe`) - ✅