Skip to content

Conversation

@cjwooo
Copy link
Contributor

@cjwooo cjwooo commented Sep 7, 2021

What

Our use case requires the full values in certain streams, such as the author and committer fields in the Commits stream, not just their ids.

How

Add an option to the GitHub Source spec, allowing the user to choose whether or not to minimize fields for all streams. By default, the option is true, meaning fields will still be minimized.

Pre-merge Checklist

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
@github-actions github-actions bot added the area/connectors Connector related issues label Sep 7, 2021
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is making me wonder if we should just not minimize fields at all. it seems to have been a fairly controversial move. I think I would prefer we do that since it's more in line with ELT. @cjwooo would you mind changing the PR to do that instead?

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 9, 2021

Yea I can do that. Do other sources have the same field minimization feature?

@sherifnada
Copy link
Contributor

I don't think so, no

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 10, 2021

@sherifnada Should we get some confirmation from the larger community before unminimizing everything? I'm only concerned it would break existing users' workflows.

@sherifnada
Copy link
Contributor

sherifnada commented Sep 13, 2021

@cjwooo apologies - I should have been clearer. Totally agreed with you that backwards compatibility is desirable here. I think we should:

  1. Keep the user_id field or any other minimized fields in records
  2. Add back the expanded fields into the recods
    so it would be something like:
record['user_id'] = record['user']['id'] 

and we can get rid of the minimization logic. I believe this would be totally backwards compatible. Does that help?

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 14, 2021

That makes sense, but one issue I see is that part of the current minimization logic is transforming a list of objects into a list of ids: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/streams.py#L200-L201. And the record key for this case stays the same, e.g. https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/schemas/pull_requests.json#L62-L66. I don't think we can suddenly switch the outputted records for the PullRequests stream to have an array of label objects under record['labels']

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 15, 2021

@sherifnada
Copy link
Contributor

@cjwooo hmm that's a good point. I think we have a few options then:

  1. keep the current schemas and in instances like what you describe, give the columns some other name which doesn't conflict (but that name would be out of line with what the API calls it, so we'd need to maintain that transformation forever)
  2. Rip the bandaid and change the schema to what it's supposed to be, accepting that it would cause some breakage

I think I would rather do 2 rather than have a scar in the output schema for the rest of time. We could do a major version release to indicate this and manage the roll out such that consumers can prepare themselves for this. WDYT?

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 16, 2021

I'd also rather go with option 2. I'll go ahead and implement that.

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 16, 2021

@sherifnada Updated PR.

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjwooo apologies to increase the scope on you here, but would you mind removing all minimization? if we're going to make a breaking change I would rather that we do it exactly once rather than phase out minimization in different stages.

Though if the current set of minimized fields will require no breaking changes in the future feel free to let it be and we'll finish that piece on our side. LMK.

Otherwise this looks great! thanks so much for trooping through this issue. I'll merge it as soon as you let me know the verdict on the above.

@cjwooo
Copy link
Contributor Author

cjwooo commented Sep 17, 2021

@sherifnada Removed all minimization logic.

@sherifnada sherifnada changed the title GitHub Source: Add option to not minimize fields GitHub Source: Don't minimize fields by default Sep 20, 2021
@sherifnada sherifnada changed the title GitHub Source: Don't minimize fields by default GitHub Source: Don't minimize output fields Sep 20, 2021
@sherifnada sherifnada mentioned this pull request Sep 20, 2021
@sherifnada sherifnada changed the title GitHub Source: Don't minimize output fields 🎉 GitHub Source: Don't minimize output fields Sep 20, 2021
@sherifnada sherifnada merged commit 775bb03 into airbytehq:master Sep 20, 2021
@cjwooo cjwooo deleted the cwu/github branch September 30, 2021 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment