Skip to content

Conversation

@keshav-space
Copy link
Member

@keshav-space keshav-space commented Aug 2, 2024

This PR introduces the aboutcode pipeline for defining and running improvements on data (see #1395). It also adds a new pipeline to flag ghost packages. Ghost packages are those packages that are not present in the upstream package manager index.

Screenshot from 2024-08-23 18-45-24
@keshav-space keshav-space marked this pull request as draft August 2, 2024 12:48
@keshav-space keshav-space marked this pull request as ready for review August 7, 2024 07:57
@keshav-space keshav-space changed the title Add improver pipeline to remove ghost packages #644 #917 #1395 Add improver pipeline to flag ghost packages #644 #917 #1395 Aug 21, 2024
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few nits for your review.

Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
- Package can have malicious, ghost, yanked, valid and unknown status Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... here a few nits for your review... I wonder if there is a real need to convert versions with univers for this use case.

Also:

  • Thanks for adding logging!
  • Please add a CHANGELOG entry for logging and pipeline introduction
version_class = RANGE_CLASS_BY_SCHEMES[purl.type].version_class

try:
return {version_class(v.value) for v in versions(str(purl))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check for a ghost, why do we need to convert things to a version class?
Could we just check the if the version string exists?
Otherwise we are eventually ghosting all the packages with a version we cannot parse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check for a ghost, why do we need to convert things to a version class?
Could we just check the if the version string exists?

Simple string comparison may lead to incorrectly flagging legitimate versions as ghost packages:

"1.2.0" != "1.02.0" "v1.2.3" != "1.2.3" "1.0" != "1.0.0"

Otherwise we are eventually ghosting all the packages with a version we cannot parse

If we cannot parse the version for a given base_purl, we skip those and don’t mark them as ghosts.

keshav-space and others added 3 commits August 23, 2024 20:45
 Signed-off-by: Keshav Priyadarshi <git@keshav.space> Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@keshav-space keshav-space merged commit 1e3afdc into main Aug 26, 2024
@keshav-space keshav-space deleted the 917-ghost-remover branch August 26, 2024 07:50
@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@keshav-space
Copy link
Member Author

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@keshav-space
Copy link
Member Author

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@armijnhemel Yes, this pipeline assumes that package index is right. We should have seperate pipeline for case where index is wrong. We will need some generalized way to conclude that package index is reporting wrong version.

@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@armijnhemel Yes, this pipeline assumes that package index is right. We should have seperate pipeline for case where index is wrong. We will need some generalized way to conclude that package index is reporting wrong version.

So far I have only seen it once (in Alpine, see #915) so perhaps it could just be added as an exception without building a separate pipeline but I guess it might happen more often.

@pombredanne
Copy link
Member

@armijnhemel The general approach to curation is going to be:

  1. use federatedcode to store backing data in VCS repos
  2. treat a curation outcome as ultimately creating a new "advisory" like record there
  3. use an importer that treat this advisories as any other data sources, but with an increased priority, and also being aware of the historical context of this curation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 participants