🐛 fix(mongodb-v2): use max 10k documents while discovering MongoDB schema #8491
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
What
Describe what the change is solving
It helps to add screenshots if it affects the frontend.
The mongodb v2 implementation doesn't limit the amount of documents while performing discovery. This PR remedies this behavior. The documentation of the MongoDB states the following:
However, upon testing the integration I noticed it is taking way to long. Digging into the code I never found any references that it is limitted to 10k documents. As my source database has 1.2B documents, it makes sense why discovery taking so long 😓
Further looking around it was already mentioned on a previous PR.
the documentation probably originates from the v1 implementation, where there is an explicit limit.
https://github.com/airbytehq/airbyte/blob/5393070dc6c87c3fccad8b664cee8e30d3e52ba3/airbyte-integrations/connectors/source-mongodb/lib/mongodb_stream.rb
How
Describe the solution
This PR brings the integration in line with the behavior as described on the documentation.
It does so by sampling 10k records at max. Note, the original approach used limit. I'd argue that sample is better approach for discovery.Originally I implemented it with sample, as I thought it worked well. However, with sample of 10k documents, it is very slow. I reverted to the original limit implementation. Now it works :)
🚨 User Impact 🚨
Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret./gradlew :airbyte-integrations:connectors:<name>:integrationTest.README.mdbootstrap.md. See description and examplesdocs/SUMMARY.mddocs/integrations/<source or destination>/<name>.mdincluding changelog. See changelog exampledocs/integrations/README.mdairbyte-integrations/builds.mdAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>command is passing./publishcommand described hereUpdating a connector
Community member or Airbyter
airbyte_secret./gradlew :airbyte-integrations:connectors:<name>:integrationTest.README.mdbootstrap.md. See description and examplesdocs/integrations/<source or destination>/<name>.mdincluding changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>command is passing./publishcommand described hereConnector Generator
-scaffoldin their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplatesthen checking in your changes