Skip to content

Conversation

@tolik0
Copy link
Contributor

@tolik0 tolik0 commented Oct 12, 2023

What

In this PR, we've enhanced the capability of our S3 stream reader to efficiently access and read files within ZIP archives stored in S3. The key highlights include:

  • Central Directory Parsing: Introduced logic to read and parse the Central Directory of ZIP archives, which provides metadata about the files contained within.
  • Direct File Access: Leveraged the offset information from the Central Directory to directly seek and read individual files from the ZIP archive without the need to decompress the entire archive.

This implementation ensures a faster and more efficient way to access files within ZIP archives in S3, which is especially beneficial for large archives.

Closes:
#17021

How

The process of reading files from a ZIP archive stored in S3 is divided into two main steps:

1. Reading the Central Directory of the ZIP Archive:

  • The Central Directory (CD) is a crucial part of the ZIP file format. It contains metadata about all the files and directories stored inside the ZIP archive.
  • Our implementation begins by reading the Central Directory to gather information about the contents of the ZIP archive. The ZipFileHandler is responsible for this task.
  • By parsing the Central Directory, we obtain essential details about each file inside the ZIP, such as its name, compression method, compressed size, and most importantly, its offset within the ZIP archive. The offset is crucial because it tells us where the file's data starts in the ZIP archive.

2. Reading Individual Files Using Their Offset:

  • Once we have the offset of a file from the Central Directory, we can directly seek to that position in the ZIP archive to start reading the file's data.
  • The RemoteFileInsideArchive class represents each file inside the ZIP archive. It stores metadata about the file, including its offset, which is used to read the file.
  • When a request is made to read a specific file from the ZIP archive, our implementation seeks to the file's offset in the ZIP archive and begins reading the file's data from that position. This direct seek-and-read approach ensures efficient and fast access to individual files without the need to decompress or read the entire ZIP archive.

In summary, our implementation efficiently reads files from a ZIP archive in S3 by first parsing the Central Directory to gather metadata and then using the offset information to directly access and read individual files.

@vercel
Copy link

vercel bot commented Oct 12, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2023 8:08am
@github-actions
Copy link
Contributor

github-actions bot commented Oct 12, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

Copy link
Contributor

@davydov-d davydov-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests, preferably CAT with the new archive type

@tolik0 tolik0 marked this pull request as ready for review October 13, 2023 11:48
@tolik0 tolik0 changed the title Source S3: Add reading files inside zip archive ✨ Source S3: Add reading files inside zip archive Oct 13, 2023
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Oct 13, 2023
@airbyte-oss-build-runner
Copy link
Collaborator

source-s3 test report (commit e56c48c8b5) - ✅

⏲️ Total pipeline duration: 14mn55s

Step Result
Build source-s3 docker image for platform(s) linux/x86_64
Unit tests
Acceptance tests
Code format checks
Validate metadata for source-s3
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test
@tolik0 tolik0 requested a review from davydov-d October 13, 2023 12:23
Copy link
Contributor

@davydov-d davydov-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@tolik0 tolik0 requested review from bazarnov and lazebnyi October 16, 2023 09:22
Comment on lines 138 to 140
self._file.seek(file_start + 26) # Offset to the length fields in the ZIP header
name_len, extra_len = struct.unpack("<HH", self._file.read(4))
return file_start + 30 + name_len + extra_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give more info about these hardcoded ints: 26 and 30 ?

@octavia-squidington-iv octavia-squidington-iv requested review from a team October 17, 2023 10:18
@tolik0 tolik0 force-pushed the tolik0/source-s3/add-reading-inside-zip branch from e56c48c to 0fa7198 Compare October 17, 2023 23:13
@airbyte-oss-build-runner
Copy link
Collaborator

source-s3 test report (commit 7217969808) - ❌

⏲️ Total pipeline duration: 20mn20s

Step Result
Build source-s3 docker image for platform(s) linux/x86_64
Unit tests
Acceptance tests
Code format checks
Validate metadata for source-s3
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test
@tolik0 tolik0 force-pushed the tolik0/source-s3/add-reading-inside-zip branch from 7217969 to eeafed3 Compare October 18, 2023 07:55
@airbyte-oss-build-runner
Copy link
Collaborator

source-s3 test report (commit 5f74715bbe) - ✅

⏲️ Total pipeline duration: 21mn00s

Step Result
Build source-s3 docker image for platform(s) linux/x86_64
Unit tests
Acceptance tests
Code format checks
Validate metadata for source-s3
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test
@tolik0 tolik0 merged commit 951605a into master Oct 18, 2023
@tolik0 tolik0 deleted the tolik0/source-s3/add-reading-inside-zip branch October 18, 2023 08:53
ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3

7 participants