A License Classifier

License Classifier

Build status

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A "confidence level" is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

  1. Create a file in licenses/.

    • The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
    • If the license is the "header" version of the license, append the suffix ".header" to it. See licenses/README.md for more details.
  2. Add the license name to the list in license_type.go.

  3. Regenerate the licenses.db file by running the license serializer:

    $ license_serializer -output licenseclassifier/licenses
  4. Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)

License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Owner
Google
Google ❤️ Open Source
Google
Comments
  • Fix runtime panic on LoadLicenses()

    Fix runtime panic on LoadLicenses()

    It may be possible that the amount of segments is lower than 3 which caused a runtime panic in the license loader. We now skip those paths to fix that issue.

    Found: https://github.com/cri-o/cri-o/runs/7037925519

  • identity_license: prevent send on closed channel

    identity_license: prevent send on closed channel

    I've occasionally been seeing the following panic when running identity_license:

    panic: send on closed channel
    
    goroutine 51 [running]:
    github.com/google/licenseclassifier/tools/identify_license/backend.(*ClassifierBackend).ClassifyLicenses.func1.1(0xc0043a1910, 0xc003466900)
            licenseclassifier/tools/identify_license/backend/backend.go:82 +0x4e
    github.com/google/licenseclassifier/tools/identify_license/backend.(*ClassifierBackend).ClassifyLicenses.func1(0x7ffeefbff9b9, 0x13)
            licenseclassifier/tools/identify_license/backend/backend.go:87 +0xb8
    created by github.com/google/licenseclassifier/tools/identify_license/backend.(*ClassifierBackend).ClassifyLicenses
            licenseclassifier/tools/identify_license/backend/backend.go:92 +0x1ea
    

    The commit message explains this issue and the fix:

    The analyze function calls Done on the waitgroup and then opens up
    a new slot by writing into the `task` channel. If this is the last
    goroutine running, this can cause an issue: there is a goroutine
    running that is waiting for the waitgroup to yield. After this it
    closes the `task` channel. If it gets closed before the
    `analyze` goroutine manages to send true into the channel, a panic
    can occur.
    
    The channel send and `wg.Done()` call could be re-ordeded, but also
    there is no strict need to close the channel in this instance, as
    there is no receiver for it after this point.
    

    Happy to adjust to something else if there are different opinions on how to deal with this.

  • Update embed.go

    Update embed.go

    Fixed comment to work when using go get command

    Currently when using go get -u github.com/google/licenseclassifier/... I get the following error:

    /opt/gocode/pkg/mod/github.com/google/[email protected]/licenses/embed.go:8:12: pattern *.db: no matching files found /opt/gocode/pkg/mod/github.com/google/[email protected]/licenses/embed.go:8:12: pattern *.db: no matching files found

    But when I changed ( locally ) the attached change it seems to work.

    Env: Ubuntu 16 with Go 16.15 installed

  • Switch to go:embed

    Switch to go:embed

    These two .db files are basically looked up directly at runtime. Does anyone have any objection to switching this to go:embed? And is there a maintainer around who would want to merge such a fix? This would need golang 1.16 as a minimum, but that's pretty reasonable.

    https://github.com/google/licenseclassifier/blob/main/licenses/forbidden_licenses.db

    Cheers!

  • Add dummy.go to licenses dir

    Add dummy.go to licenses dir

    This allows users of https://github.com/google/go-licenses to run the tool in a go module compliant way by depending on this package, and running the tool. Without this that invariably fails with 'licenses.db cannot be found'.

    With this, users can add

    github.com/google/go-licenses and github.com/google/licenseclassifier/licenses

    to their hack/tools.go file, and then the run of the license classifier is hermetic (uses only the versions known by go modules, which prevents weird inconsistencies due to mismatched tool versions).

  • licenses: add BSD-3-Clause-Go

    licenses: add BSD-3-Clause-Go

    This is the BSD-3-Clause license used by the Go language project.

    The motivation for adding this is to add the header version, which is otherwise not detected.

    The license text is copied from: https://go.googlesource.com/go/+/refs/heads/master/LICENSE

  • Refactor License construction to permit loading from provided bytes.

    Refactor License construction to permit loading from provided bytes.

    This permits using this package in environments without a GOPATH or without requiring filesystem operations (e.g. by embedding licenses.db).

    Also remove unnecessary mutex and loop counter from registerLicenses.

  • match offset and extent out of range

    match offset and extent out of range

    Thank you for this amazing library! I'm amazed by its speed and accuracy, this is truly great work!

    One problem I notice is that:

    When I use MultipleMatch and parse the offset & extent, I found them to be inaccurate. This is understandable if the algorithm is not perfect. However, I found cases when offset and extent is invalid for the input full text -- offset + extent is greater than full text length.

    My guess is that offset and extent did not take the text normalization step into consideration: https://github.com/google/licenseclassifier/blob/bb04aff29e72e636ba260ec61150c6e15f111d7e/classifier.go#L162. So they are in fact offset and extent of the normalized text, which doesn't match the original full text.

  • Make the dummy dep on licenses dir stronger

    Make the dummy dep on licenses dir stronger

    Without an actual code-level dependency, go mod vendor will elide the licenses directory, which is where the actual DB files are stored. Without those, it's not very useful to vendor this lib.

    @wcn3

  • Add a way to specify the db file

    Add a way to specify the db file

    In the case where I distribute the binary and the db file separately, I want to be able to specify where licenseclassifier should look for the db. This is related to go-licences (https://github.com/google/go-licenses/blob/master/licenses/classifier.go)

    As of today, you need the source of the go-license binary in the right path (GOPATH, …) to be able to use it.

    Another alternative would be to embedded the .db in the binary :angel:

  • Updated archive options to take user-provided bytes or a function

    Updated archive options to take user-provided bytes or a function

    This allows library users the flexibility to load their DB from an alternate location, including the network or an embedded array, if necessary.

    Also fixed a bug whereby archive serialization would store the whole path. Now, the archive properly stores only the ID instead of the entire path to the ID. This happened when an absolute path was used for the directory, which is useful if you're using a DB in an alternate location.

  • Publish binary releases for common platforms

    Publish binary releases for common platforms

    We're considering to add the identify_license tool as what we call a "scanner" to ORT, see here for some context. As such it would be nice if we could easily bootstrap identify_license by simply downloading a binary for the respective platform.

    So, would it be possible to cut a new release (also see https://github.com/google/licenseclassifier/issues/37) and attach binary assets to it?

  • Compiled default classifier

    Compiled default classifier

    DefaultClassifier is a bit expensive as it tokenizes and normalizes licenses. We don't want to perform it every run. Is there any way to pass a parsed docs that is a private field? What if we will add NewClassifierWithDocs or something like that? https://github.com/aquasecurity/licenseclassifier/blob/c913e304a1534c4580fa70c2c3af5cd85d99fc9c/v2/classifier.go#L194

  • Bug in the computeQ function - v2 classifier

    Bug in the computeQ function - v2 classifier

    Describe the issue In the computeQ when the threshold is set to 1.0 the granularity is being calculated as 10, but if we set the threshold to 0.95, 0.99, or 0.999 the granularity is being calculated as 19, 99, 999, respectively where there is exponential growth and also the granularity is greater than the granularity set at maxThresold(1.0) which is 10.

    Is this intentional?

    A problem occurring due to this issue is that when we set the threshold to 0.95 or greater a lot of licenses are not being detected which in the case we set to 0.9 are easily being detected.

    I ran the program for around 17,300 license files out of which around 2950 BSD-3-Clause, 850 BSD-2-Clause and some other licenses were not at all detected which were otherwise detected at a granularity of 10 because at that threshold the granularity is greater than 20 and nearly reaches 100.

    A possible solution would be to set the granularity to 10 for a threshold greater than 0.9 and it will also handle the divide by zero cases.

  • Add main package file in v2

    Add main package file in v2

    Perhaps it would be good to have a main package file in v2 of licenseclassifier as in v1. Alongside licensedetection capabilites, it can also support copyright detection, JSON results and scanning entire directory for faster and more efficient results.

    Thoughts?

  • Add blessing license to unencumbered licenses?

    Add blessing license to unencumbered licenses?

    I'm using sqllite3 with blessing license, I'd like to add it to unencumbered licenses in this library. Can I do that?

    https://spdx.org/licenses/blessing.html

A Naive Bayes SMS spam classifier written in Go.
A Naive Bayes SMS spam classifier written in Go.

Ham (SMS spam classifier) Summary The purpose of this project is to demonstrate a simple probabilistic SMS spam classifier in Go. This supervised lear

Sep 9, 2022
Tpu-traffic-classifier - This small program creates ipsets and iptables rules for nodes in the Solana network

TPU traffic classifier This small program creates ipsets and iptables rules for

Nov 23, 2022
A full-featured license tool to check and fix license headers and resolve dependencies' licenses.
A full-featured license tool to check and fix license headers and resolve dependencies' licenses.

SkyWalking Eyes A full-featured license tool to check and fix license headers and resolve dependencies' licenses. Usage You can use License-Eye in Git

Dec 26, 2022
License-cli - simple LICENSE file generator

?? license-cli simple LICENSE file generator 2022-01-04.12-28-26.mp4 Install / U

Jun 2, 2022
License-API - Basic license based authentication API with discord account integration

License-API Basic license based authentication API with discord account integrat

Feb 18, 2022
Bayesian text classifier with flexible tokenizers and storage backends for Go

Shield is a bayesian text classifier with flexible tokenizer and backend store support Currently implemented: Redis backend English tokenizer Example

Nov 25, 2022
A Naive Bayes SMS spam classifier written in Go.
A Naive Bayes SMS spam classifier written in Go.

Ham (SMS spam classifier) Summary The purpose of this project is to demonstrate a simple probabilistic SMS spam classifier in Go. This supervised lear

Sep 9, 2022
tfacon is a CLI tool for connecting Test Management Platforms and Test Failure Analysis Classifier.

Test Failure Classifier Connector Description tfacon is a CLI tool for connecting Test Management Platforms and Test Failure Analysis Classifier. Test

Jun 23, 2022
Tpu-traffic-classifier - This small program creates ipsets and iptables rules for nodes in the Solana network

TPU traffic classifier This small program creates ipsets and iptables rules for

Nov 23, 2022
Fetch license information for all direct and indirect dependencies of your Golang project
Fetch license information for all direct and indirect dependencies of your Golang project

gocomply beta Give open source Golang developers the credit they deserve, follow your legal obligations, and save time with gocomply. This tiny little

Nov 1, 2022
Keygen SDK for Go. Integrate license activation and automatic updates for Go binaries.

Keygen Go SDK Package keygen allows Go programs to license and remotely update themselves using the keygen.sh service. Usage keygen.Validate(fingerpri

Dec 18, 2022
kyoto uikit - UIKit for rapid development License Go Reference Go Report Card
kyoto uikit - UIKit for rapid development  License Go Reference Go Report Card

kyoto uikit UIKit for rapid development Requirements kyoto page configured SSA basic knowledge of kyoto (twui) configured tailwindcss Installation <ki

Jun 27, 2022
A High Performance Object Storage released under Apache License
A High Performance Object Storage released under Apache License

MinIO Quickstart Guide MinIO is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storag

Sep 30, 2021
Command line access to NYC parking violations owed by a given license plate

nyc-parking-violations Finds the total a given license plate owes in parking tic

Feb 18, 2022
A GitHub CLI extension to view and generate license files.

gh-license A GitHub CLI extension to view and generate license files. All license information is obtained from the GitHub API. NOTE: The only purpose

Oct 8, 2022
Proxywv - Simplified Widevine license proxy server written in Go (Golang)

license-proxy Simplified Widevine license proxy server written in Go (Golang). U

Feb 13, 2022