You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v32 of ScanCode is all about improved license detections!
We have more licenses and rules, and major updates on post-processing matches to license detections. We also have major improvements in package license detections and unknown references, along with top level detection summaries for licenses, and reference data for the licenses detected too. There are also a couple of API changes due to model changes in license data.
Update GemfileLockParser to track the gem which the Gemfile.lock is for, which we assign to the new GemfileLockParser.primary_gem field. Update GemfileLockHandler.parse() to handle the case where there is a primary gem detected from a gemfile.lock. If there is a primary gem, a single Package is created and the detected gem data within the gemfile.lock are assigned as dependencies. If there is no primary gem, then all of the dependencies are collected into Package with no name and yielded.
Fix issue where dependencies were not reported when scanning an extracted Python project by modifying BaseExtractedPythonLayout.assemble() to favor using package data from a PKG-INFO file from an egg-info directory. Package data from a PKG-INFO file from an egg-info directory contains the dependency information collected from the requirements.txt file along side PKG-INFO.
Code for parsing a Maven POM, npm package.json, freebsd manifest and haxelib JSON have been separated into two functions: one that creates a PackageData object from the parsed Resource, and another that calls the previous function and yields the PackageData. This was done such that we can use the package manifest data parsing code outside of the scancode-toolkit context in other libraries.
The PackageData model now includes a holder field, which is populated with holder data extracted from the copyright field if copyright data is present, otherwise it remains empty.
DatafileHandlers now have a classmethod named get_top_level_resources(), which is supposed to yield the top-level Resources of a Package codebase, relative to a Package manifest file. maven.MavenPomXmlHandler is the first DatafileHandler that has this method implemented.
License detection:
The SPDX license list has been updated to the latest v3.20
This is a major update to license detection where we now combine one or more license matches in a larger license detection. This approach improves the accuracy of license detection and removes a larger number of false positive or ambiguous license detections. See for details RFC: a plan for false positive license detection #2878
There is a new license_detections codebase level attribute with all the unique license detections in the whole scan, both in resources and packages. This has the 3 attributes also present in package/resource level license detections: license_expression, identifier and detection_log (present optionally if the --license-diagnostics option is enabled) with an additional attribute:
count: Number of times in the codebase this unique license detection was encountered.
The data structure of the JSON output has changed for licenses at file level:
The licenses attribute is deleted.
A new license_detections attribute contains license detections in that file. This object has three attributes: license_expression, identifier and matches. matches is a list of license matches and is roughly the same as licenses in the previous version with additional structure changes detailed below. Identifier is the detected license-expression with an UUID generated from the content of matches such that this is unique for unique detections. We also have another attribute detection_log with diagnostics information if the --license-diagnostics option is enabled.
A new attribute license_clues contains license matches with the same data structure as the matches attribute in license_detections. This contains license matches that are mere clues and where not considered to be a proper conclusive license detection.
The license_expressions list of license expressions is deleted and replaced by a detected_license_expression single expression. Similarly spdx_license_expressions was removed and replaced by detected_license_expression_spdx.
See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-resource>_ for examples and details.
The data structure of license attributes in package_data and the codebase level packages has been updated accordingly:
There is a new license_detections attribute for the primary, top-level declared licenses of a package and an other_license_detections attribute for the other secondary detections.
The license_expression is replaced by the declared_license_expression and other_license_expression attributes with their SPDX counterparts declared_license_expression_spdx and other_license_expression_spdx. These expressions are parallel to detections.
The declared_license attribute is renamed extracted_license_statement and is now a YAML-encoded string, which can be parsed to recreate the original extracted license statement. Previously this used to be nested python objects lists/dicts/string, but now this is always a YAML string.
See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-package>_ for examples and details.
The license matches structure has changed: we used to report one match for each license key of a matched license expression. We now report instead one single match for each matched license expression, and list the license keys as a licenses attribute. This avoids data duplication. Inside each match, we list each match and matched rule attributred directly avoiding nesting. See license updates doc <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#licensematch-result-data>_ for examples and details.
There are new and codebase level attributes with --license-references to report reference license metadata and texts once for each license matched across the scan; we now have two codebase level attributes: license_references and license_rule_references that list unique detected license and license rules. for examples and details. This reference data is also removed from license matches in all levels i.e. from codebase, package and resource level license detections and resource level license clues, irrespective of this CLI option being used, i.e. default with --licenses. See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#comparision-before-after-license-references>_
We replaced the scancode --reindex-licenses command line option with a new separate command named scancode-reindex-licenses.
The --reindex-licenses-for-all-languages CLI option is also moved to the scancode-reindex-licenses command as an option --all-languages.
We can now detect licenses using custom license texts and license rules stored in a directory or packaged as a plugin for consistent reuse and deployment.
There is an --additional-directory option with the scancode-reindex-licenses command to add the licenses from a directory.
There is also a --only-builtin option to use ony builtin licenses ignoring any additional license plugins.
We combined the license data file and text file of each license in a single file with a .LICENSE extension. The .yml data file is now included at the top of each .LICENSE file as "YAML frontmatter". The same applies to license rules and their .RULE and .yml files. This halves the number of data files from about 60,000 to 30,000. Git line history is preserved for the combined text + yml files.
There is a new console script scancode-license-data to export license data in JSON, YAML and HTML, with indexes and a static website for use in the licensedb web site. This becomes the API way to getr scancode license data.
The deprecated "--is-license-text" option has been removed. This is now built-in with the --license-text option and --info and exposed with the "percentage_of_license_text" attribute.
The license dump() has been modified to add an extra space at empty newlines for license files which also have multiple indentation levels as this was generating invalid YAML output files when --license-text or --license-references was enabled.
A bugfix has been added to the --unknown-licenses option where we would crash when using this option without using --matched-text option. This is now working correctly and also better tested.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
v32 of ScanCode is all about improved license detections!
We have more licenses and rules, and major updates on post-processing matches to license detections.
We also have major improvements in package license detections and unknown references, along with top level detection
summaries for licenses, and reference data for the licenses detected too. There are also a couple of API changes due to
model changes in license data.
See also https://github.com/nexB/scancode.io/ for a complete, customizable SCA solution using ScanCode and
https://github.com/nexB/scancode-workbench/releases for visualizing data generated by ScanCode Toolkit.
Important API changes:
This is a major release with major API and output format changes and significant
feature updates.
In particular the output format has changed for the licenses and packages, and
also for some of the command line options.
The output format version is now 3.0.0.
See https://github.com/nexB/scancode-toolkit/milestone/15 for more details on this release.
Package detection:
Update
GemfileLockParserto track the gem which the Gemfile.lock is for,which we assign to the new
GemfileLockParser.primary_gemfield. UpdateGemfileLockHandler.parse()to handle the case where there is a primary gemdetected from a gemfile.lock. If there is a primary gem, a single
Packageis created and the detected gem data within the gemfile.lock are assigned as
dependencies. If there is no primary gem, then all of the dependencies are
collected into Package with no name and yielded.
Repeated package and dependency results when scanning extracted rubygem #3072
Fix issue where dependencies were not reported when scanning an extracted
Python project by modifying
BaseExtractedPythonLayout.assemble()to favorusing package data from a PKG-INFO file from an egg-info directory. Package
data from a PKG-INFO file from an egg-info directory contains the dependency
information collected from the requirements.txt file along side PKG-INFO.
No dependency results when scanning celery-5.2.7.tar.gz #3083
Fix issue where we were returning incorrect purl package
typefor cocoapods.podswas being returned as a purl type for cocoapods, it should becocoapodsinstead.https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#cocoapods
Incorrect purl type for cocoapods #3081
Code for parsing a Maven POM, npm package.json, freebsd manifest and haxelib
JSON have been separated into two functions: one that creates a PackageData
object from the parsed Resource, and another that calls the previous function
and yields the PackageData. This was done such that we can use the package
manifest data parsing code outside of the scancode-toolkit context in other
libraries.
The PackageData model now includes a
holderfield, which is populated withholder data extracted from the copyright field if copyright data is present,
otherwise it remains empty.
Add Copyright Holder information for Packages #3290
DatafileHandlers now have a classmethod named
get_top_level_resources(),which is supposed to yield the top-level Resources of a Package codebase,
relative to a Package manifest file.
maven.MavenPomXmlHandleris the firstDatafileHandler that has this method implemented.
License detection:
The SPDX license list has been updated to the latest v3.20
This is a major update to license detection where we now combine one or more
license matches in a larger license detection. This approach improves the
accuracy of license detection and removes a larger number of false positive
or ambiguous license detections. See for details
RFC: a plan for false positive license detection #2878
There is a new
license_detectionscodebase level attribute with all theunique license detections in the whole scan, both in resources and packages.
This has the 3 attributes also present in package/resource level license
detections:
license_expression,identifieranddetection_log(present optionally if the
--license-diagnosticsoption is enabled) withan additional attribute:
count: Number of times in the codebase this unique license detectionwas encountered.
The data structure of the JSON output has changed for licenses at file level:
The
licensesattribute is deleted.A new
license_detectionsattribute contains license detections in that file.This object has three attributes:
license_expression,identifierand
matches.matchesis a list of license matches and is roughlythe same as
licensesin the previous version with additional structurechanges detailed below. Identifier is the detected license-expression with an
UUID generated from the content of
matchessuch that this is unique forunique detections. We also have another attribute
detection_logwithdiagnostics information if the
--license-diagnosticsoption is enabled.A new attribute
license_cluescontains license matches with thesame data structure as the
matchesattribute inlicense_detections.This contains license matches that are mere clues and where not considered
to be a proper conclusive license detection.
The
license_expressionslist of license expressions is deleted andreplaced by a
detected_license_expressionsingle expression.Similarly
spdx_license_expressionswas removed and replaced bydetected_license_expression_spdx.See
license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-resource>_for examples and details.
The data structure of license attributes in
package_dataand the codebaselevel
packageshas been updated accordingly:There is a new
license_detectionsattribute for the primary, top-leveldeclared licenses of a package and an
other_license_detectionsattributefor the other secondary detections.
The
license_expressionis replaced by thedeclared_license_expressionand
other_license_expressionattributes with their SPDX counterpartsdeclared_license_expression_spdxandother_license_expression_spdx.These expressions are parallel to detections.
The
declared_licenseattribute is renamedextracted_license_statementand is now a YAML-encoded string, which can be parsed to recreate the
original extracted license statement. Previously this used to be nested
python objects lists/dicts/string, but now this is always a YAML string.
See
license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-package>_for examples and details.
The license matches structure has changed: we used to report one match for each
license
keyof a matched license expression. We now report instead onesingle match for each matched license expression, and list the license keys
as a
licensesattribute. This avoids data duplication.Inside each match, we list each match and matched rule attributred directly
avoiding nesting. See
license updates doc <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#licensematch-result-data>_for examples and details.
There are new and codebase level attributes with
--license-referencesto reportreference license metadata and texts once for each license matched across the
scan; we now have two codebase level attributes:
license_referencesandlicense_rule_referencesthat list unique detected license and license rules.for examples and details. This reference data is also removed from license matches
in all levels i.e. from codebase, package and resource level license detections and
resource level license clues, irrespective of this CLI option being used, i.e. default
with
--licenses.See
license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#comparision-before-after-license-references>_We replaced the
scancode --reindex-licensescommand line option with anew separate command named
scancode-reindex-licenses.The
--reindex-licenses-for-all-languagesCLI option is also moved tothe
scancode-reindex-licensescommand as an option--all-languages.We can now detect licenses using custom license texts and license rules
stored in a directory or packaged as a plugin for consistent reuse and deployment.
There is an
--additional-directoryoption with thescancode-reindex-licensescommand to add the licenses from a directory.
There is also a
--only-builtinoption to use ony builtin licensesignoring any additional license plugins.
See Add support for "extra", e.g. private or local licenses #480 for more details.
We combined the license data file and text file of each license in a single
file with a .LICENSE extension. The .yml data file is now included at the
top of each .LICENSE file as "YAML frontmatter". The same applies to license
rules and their .RULE and .yml files. This halves the number of data files
from about 60,000 to 30,000. Git line history is preserved for the combined
text + yml files.
There is a new console script
scancode-license-datato exportlicense data in JSON, YAML and HTML, with indexes and a static website for use
in the licensedb web site. This becomes the API way to getr scancode license
data.
See Add a command line option to dump the license data #2738
The deprecated "--is-license-text" option has been removed.
This is now built-in with the --license-text option and --info
and exposed with the "percentage_of_license_text" attribute.
The license dump() has been modified to add an extra space at empty
newlines for license files which also have multiple indentation levels
as this was generating invalid YAML output files when
--license-textor
--license-referenceswas enabled.See scancode license scan produces invalid yaml #3219
A bugfix has been added to the
--unknown-licensesoption wherewe would crash when using this option without using
--matched-textoption. This is now working correctly and also better tested.
See crash when using --unknown-licenses #3343
What's Changed
scancode-reindex-licensessubcommand instead of using--reindex-licensesflag by @abhi-kr-2100 in Fix issue 3155 by runningscancode-reindex-licensessubcommand instead of using--reindex-licensesflag #3159get_top_level_resources()toDatafileHandlerclass by @JonoYang in Addget_top_level_resources()toDatafileHandlerclass #3315New Contributors
scancode-reindex-licensessubcommand instead of using--reindex-licensesflag #3159Full Changelog: v31.2.4...v32.0.0
This discussion was created from the release v32.0.0.
Beta Was this translation helpful? Give feedback.
All reactions