Gives criticality score for an open source project

Open Source Security Foundation (OpenSSF)

Last update: Dec 30, 2022

Comments: 16

Open Source Project Criticality Score (Beta)

This project is maintained by members of the Securing Critical Projects WG.

Goals

Generate a criticality score for every open source project.
Create a list of critical projects that the open source community depends on.
Use this data to proactively improve the security posture of these critical projects.

Criticality Score

A project's criticality score defines the influence and importance of a project. It is a number between 0 (least-critical) and 1 (most-critical). It is based on the following algorithm by Rob Pike:

We use the following parameters to derive the criticality score for an open source project:

Parameter (S_i)	Weight (α_i)	Max threshold (T_i)	Description	Reasoning
created_since	1	120	Time since the project was created (in months)	Older project has higher chance of being widely used or being dependent upon.
updated_since	-1	120	Time since the project was last updated (in months)	Unmaintained projects with no recent commits have higher chance of being less relied upon.
contributor_count	2	5000	Count of project contributors (with commits)	Different contributors involvement indicates project's importance.
org_count	1	10	Count of distinct organizations that contributors belong to	Indicates cross-organization dependency.
commit_frequency	1	1000	Average number of commits per week in the last year	Higher code churn has slight indication of project's importance. Also, higher susceptibility to vulnerabilities.
recent_releases_count	0.5	26	Number of releases in the last year	Frequent releases indicates user dependency. Lower weight since this is not always used.
closed_issues_count	0.5	5000	Number of issues closed in the last 90 days	Indicates high contributor involvement and focus on closing user issues. Lower weight since it is dependent on project contributors.
updated_issues_count	0.5	5000	Number of issues updated in the last 90 days	Indicates high contributor involvement. Lower weight since it is dependent on project contributors.
comment_frequency	1	15	Average number of comments per issue in the last 90 days	Indicates high user activity and dependence.
dependents_count	2	500000	Number of project mentions in the commit messages	Indicates repository use, usually in version rolls. This parameter works across all languages, including C/C++ that don't have package dependency graphs (though hack-ish). Plan to add package dependency trees in the near future.

NOTE:

We are looking for community ideas to improve upon these parameters.
There will always be exceptions to the individual reasoning rules.

Usage

The program only requires one argument to run, the name of the repo:

$ pip3 install criticality-score

$ criticality_score --repo github.com/kubernetes/kubernetes
name: kubernetes
url: https://github.com/kubernetes/kubernetes
language: Go
created_since: 79
updated_since: 0
contributor_count: 3664
org_count: 5
commit_frequency: 102.7
recent_releases_count: 76
closed_issues_count: 2906
updated_issues_count: 5136
comment_frequency: 5.7
dependents_count: 407254
criticality_score: 0.9862

You can add your own parameters to the criticality score calculation. For example, you can add internal project usage data to re-adjust the project's criticality score for your prioritization needs. This can be done by adding the --params :: ... argument on the command line.

Authentication

Before running criticality score, you need to:

For GitHub repos, you need to create a GitHub access token and set it in environment variable GITHUB_AUTH_TOKEN. This helps to avoid the GitHub's api rate limits with unauthenticated requests.

# For posix platforms, e.g. linux, mac:
export GITHUB_AUTH_TOKEN=<your access token>

# For windows:
set GITHUB_AUTH_TOKEN=<your access token>

For GitLab repos, you need to create a GitLab access token and set it in environment variable GITLAB_AUTH_TOKEN. This helps to avoid the GitLab's api limitations for unauthenticated users.

# For posix platforms, e.g. linux, mac:
export GITLAB_AUTH_TOKEN=<your access token>

# For windows:
set GITLAB_AUTH_TOKEN=<your access token>

Formatting Results

There are three formats currently: default, json, and csv. Others may be added in the future.

These may be specified with the --format flag.

Public Data

If you're only interested in seeing a list of critical projects with their criticality score, we publish them in csv format.

This data is available on Google Cloud Storage and can be downloaded via the gsutil command-line tool or the web browser here.

NOTE: Currently, these lists are derived from projects hosted on GitHub ONLY. We do plan to expand them in near future to account for projects hosted on other source control systems.

$ gsutil ls gs://ossf-criticality-score/*.csv
gs://ossf-criticality-score/c_top_200.csv
gs://ossf-criticality-score/cplusplus_top_200.csv
gs://ossf-criticality-score/csharp_top_200.csv
gs://ossf-criticality-score/go_top_200.csv
gs://ossf-criticality-score/java_top_200.csv
gs://ossf-criticality-score/js_top_200.csv
gs://ossf-criticality-score/php_top_200.csv
gs://ossf-criticality-score/python_top_200.csv
gs://ossf-criticality-score/ruby_top_200.csv
gs://ossf-criticality-score/rust_top_200.csv
gs://ossf-criticality-score/shell_top_200.csv

This data is generated using this generator script. For example, to generate a list of top 200 C language projects, run:

$ pip3 install python-gitlab PyGithub
$ python3 -u -m criticality_score.generate \
    --language c --count 200 --sample-size 5000 --output-dir output

We have also aggregated the results over 100K repositories in GitHub (language-independent) and are available for download here.

Contributing

If you want to get involved or have ideas you'd like to chat about, we discuss this project in the Securing Critical Projects WG meetings.

See the Community Calendar for the schedule and meeting invitations.

See the Contributing documentation for guidance on how to contribute.

Owner

Open Source Security Foundation (OpenSSF)

https://github.com/ossf/criticality_score

Comments

GeoTools not showing in top 200 for java projects, run criticality score on larger sample set
I looked at the top 200 Java projects, out of curiosity, to see if any of the projects I'm working on, like GeoTools, is included in the list. It was not, which is not an issue per se, but then I've computed the criticality score from command line, getting this:

criticality_score --repo "https://github.com/geotools/geotools" name: geotools url: https://github.com/geotools/geotools language: Java created_since: 111 updated_since: 0 contributor_count: 315 org_count: 6 commit_frequency: 9.7 recent_releases_count: 16 closed_issues_count: 150 updated_issues_count: 161 comment_frequency: 1.0 dependents_count: 337 criticality_score: 0.66477

The score alone would place the project at around position 100 of the top 200 projects. Since it's a no show, I'm wondering if there is any other criteria used to include/exclude projects, besides the pure score?
Use project first commit date for created_since, instead of github project creation date

For many projects the github creation date might not match the project creation date.

Would it be better to look at the date of the oldest commit in the repository?

For example, for OpenSSL the computed creation_since value is 95 months, as the date of creation of a github mirror (2013-01-15T22:34:48Z), but the project is almost 22 years old (the first commit in the master branch dates back to 1998-12-21T10:52:45+00:00)!

The cap for the field is 10 years anyway, so it's not that bad, but still it is one parameter in the equation that might be adjusted.

Edit: this also affects other fields (e.g. recent_releases) when they are computed based on estimates based on the time since creation.

Thoughts?
What is dependents_count parameter, looks suspect ?

I asked for the criticality info on several projects in my industry's ecosystem, and the dependents_count really confuses me and makes me suspicious about how it's computed. Some of the projects I checked are hard dependencies of others, so if transitive dependencies are being properly tracked, the former should always have higher dependents_count than the latter, no? But this is not the case.

One project that I run is very specialized and is of no use to casual small projects, only making sense as an embedded component of a large open source or commercial app. So while certainly very important in my industry and having a large number of end users touch those things in which it is embedded, I expect it to have a tiny number of directly downstream projects. Yet it has an absurdly, implausibly high dependents_count. Other projects I checked on that I know are directly used by orders of magnitude more projects, have implausibly low dependents_count.

Is there some kind of verbose mode that prints details that would give us more information about how these scores are computed? Like, more insight into why it thinks a project has few or many dependent projects?

I should mention that these are C++ projects, so perhaps the means by which dependencies are tracked is very flawed compared to a python (say) which may have a requirements.txt. How is it computed for C++? Has anybody considered promoting a GitHub convention of having a particularly named file serve as a manifest for what other projects a code base is dependent on? (Informationally only, since no C++ build system cares about such things.)
Maven and Gradle not in the Top 2000 java list
Hi,

I just saw that the Maven and Gradle projects are less important that 2000 java projects where they are used in as a build tool. Maybe due to the fact that they:

are not a declared dependency

https://github.com/ossf/criticality_score/issues/14

https://github.com/ossf/criticality_score/issues/23

external issue tracker

All the parts (pluggable, not a dependency!) are split into many repositories

Mosten downloaded via maven.org, sdkman, package systems, etc

Probably the same for other languages and build-tools, but haven’t checked.

Installation does not work as described in README

I get:

$ pip3 install criticality-score
Collecting criticality-score
  Could not find a version that satisfies the requirement criticality-score (from versions: )
No matching distribution found for criticality-score

Add Watchers/Description Metrics

I wanted to submit a suggestion to include GitHub Watchers (to help assess popularity) and the GitHub Description (to clarify the project's overall goal). I am currently helping contribute to OSSF's Security Metrics project, in which we are retrieving several of the GitHub metrics covered in this project (but also need to analyze the two mentioned above to help with our overall security assessment). If these can be included via the pull request I have submitted that would be extremely helpful. Thank you!
Handle empty repo case
When I was running the script, I bumped into these repos that they fall into the filter due to high number of stars but they're actually empty and the script throws an exception: https://github.com/fossasia/libregraphics.asia https://github.com/libredesktop/libredesktop-events https://github.com/libredesktop/libredesktop-project-list https://github.com/libredesktop/LibreDesktop-Specs https://github.com/meilix/arch-meilix https://github.com/meilix/deb-meilix https://github.com/meilix/meilix-addons https://github.com/meilix/meilix-art https://github.com/meilix/meilix-connect https://github.com/meilix/meilix-web https://github.com/susiai/susi_partners https://github.com/susiai/susi_sdk https://github.com/ascoders/blog https://github.com/bigdongdongCLUB/newGCP https://github.com/koush/support-wiki https://github.com/mariobehling/ai-packages https://github.com/mariobehling/mb-sandbox https://github.com/meilix/meilix-docs https://github.com/paulirish/devtools-addons https://github.com/QingDaoIT/BlackList https://github.com/zhengzhouqiuzhi/zhengzhouqiuzhi

To handle it, for GitLab, checking the commits length was enough:

if len(repo.commits.list()) == 0:

For GitHub, I couldn't find any proper way to understand whether the repo is empty. When we call "get_commits().totalCount", it already throws an exception. What I did is to force it to throw the exception by assigning "totalCount" to an unused variable (I could do it by printing the value as well?). Not an ideal solution, so let me know what you think.

try: repo = get_github_auth_token().get_repo(repo_url) # Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception total_commits = repo.get_commits().totalCount except github.GithubException as exp: if exp.status == 404 or exp.status == 409: return None return GitHubRepository(repo)

Another remark is that we're spending one more request from our rate limit when calling "get_commits()" to make this validation. I only tested this for GitHub, but I'm assuming it's the same for GitLab as well.

Alternatively, we can make all these calls before initializing the repo, do the validations, and pass them to repo object as arguments? This would also help us reducing the number of call to the API, but making these changes would take some time.

To be able to test my changes, I created empty repos on both GitHub & GitLab btw: https://github.com/coni2k/empty-repo https://gitlab.com/coni2k/empty-repo

Last, I also added this bit to "generate" script. Otherwise it fails when there are no processed repos:

if len(stats) == 0: return
Adds repolist command line parameter

The new --repolist parameter takes the name of a file containing a list of repositories to score.

usage: run.py [-h] (--repo REPO | --repolist REPOLIST | --local-file L_FILE) [--format {default,csv,json}] [--params PARAMS [PARAMS ...]]

Gives criticality score for an open source project or a list of projects.

optional arguments: -h, --help show this help message and exit --repo REPO repository url --repolist REPOLIST listfile of repository urls --local-file L_FILE path of a local csv file with repo stats --format {default,csv,json} output format. allowed values are [default, csv, json] --params PARAMS [PARAMS ...] Additional parameters in form ::<max_threshold>

This at least partially addresses Issue #97

Signed-off-by: Arnaud J Le Hors [email protected]

why apache/spark isn't in Java top 200 public data?

Spark has much higher score than ElasticSearch and Beam, Spark is missing but ElasticSearch and Beam are there, why?

apache/spark:

$ criticality_score --repo github.com/apache/spark
name: spark
url: https://github.com/apache/spark
language: Scala
created_since: 83
updated_since: 0
contributor_count: 2374
org_count: 4
commit_frequency: 53.8
recent_releases_count: 20
closed_issues_count: 1252
updated_issues_count: 1456
comment_frequency: 12.1
dependents_count: 396346
criticality_score: 0.96476

elastic/elasticsearch:

$ criticality_score --repo github.com/elastic/elasticsearch
name: elasticsearch
url: https://github.com/elastic/elasticsearch
language: Java
created_since: 132
updated_since: 0
contributor_count: 1709
org_count: 3
commit_frequency: 127.1
recent_releases_count: 21
closed_issues_count: 7966
updated_issues_count: 9234
comment_frequency: 1.0
dependents_count: 95320
criticality_score: 0.88175

apache/beam:

$ criticality_score --repo github.com/apache/beam
name: beam
url: https://github.com/apache/beam
language: Java
created_since: 59
updated_since: 0
contributor_count: 980
org_count: 7
commit_frequency: 67.1
recent_releases_count: 7
closed_issues_count: 725
updated_issues_count: 826
comment_frequency: 4.3
dependents_count: 11397
criticality_score: 0.8319

How are the top 200 lists computed?

I am directly responsible for two open source projects. I was shocked to see that one is on your "top 200" list of C++ projects. The other project has been around longer, has more contributors, more PRs, surely has an order of magnitude more downstream users, and in fact has a much higher criticality score. But it's not on the list. I can't quite figure out what the top 200 would be measuring (I would think the 200 projects with the very highest criticality score itself? But apparently not?) for the first project to show up on the list but not the other.

Can you give any insight about WHAT is being ranked in your "top" lists?
Language implementation is less critical than language project generator, create list for TypeScript projects inside JS list.

tsdx, a TypeScript project generator, appears in the top 200 list for JavaScript packages; however, TypeScript itself does not. That seems somewhat counterintuitive.
Publish Docker Images to ghcr.io
Published the docker images to ghcr.io

Here is an example: https://github.com/nathannaveen?tab=packages&repo_name=criticality_score, https://github.com/nathannaveen/criticality_score/actions/runs/3810049085

After https://github.com/ossf/criticality_score/pull/293 gets merged in, I will include docker images for scorer.

Signed-off-by: nathannaveen [email protected]
Included Docker File for Scorer
It is important to include a Dockerfile with a command-line interface (Scorer) project for the following reasons:

Reproducibility: A Dockerfile allows others to easily build the same environment in which the Scorer was developed and tested. This ensures that the Scorer will behave consistently across different systems.

Portability: With a Dockerfile, users can run the Scorer on any system that has Docker installed, regardless of the underlying operating system or dependencies.

Collaboration: A Dockerfile allows other developers to easily contribute to the Scorer project by providing a consistent and well-defined environment for development and testing.

In summary, including a Dockerfile with a CLI project makes it easier to use, test, and collaborate on the project, and ensures that the CLI will behave consistently across different systems.

Signed-off-by: nathannaveen [email protected]
Included Build Targets for Binaries
Included Make file targets for the binaries

Included the check in github ci

Signed-off-by: nathannaveen [email protected]

CLI seems to require more setup than shown in the documentation

After installing the CLI (from @main, see #288), I'm trying to run the example shown in the README. However, I'm getting an error:

$ criticality_score github.com/kubernetes/kubernetes
> 2022-12-21 11:40:20.631 INFO    Preparing default scorer
> 2022-12-21 11:40:20.639 ERROR   Failed to create collector      {"error": "init deps.dev source: bigquery: constructing client: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information."}
> main.main
>       /Users/pnacht/go/pkg/mod/github.com/ossf/[email protected]/cmd/criticality_score/main.go:160
> runtime.main
>         /usr/local/go/src/runtime/proc.go:250

Seems to require additional credentials...

I've also found cmd/criticality_score/README.md, which says we need to log into GCP before using the CLI. Maybe that's what it needs.

I therefore ran gcloud auth login --update-adc (btw, the README has the command as gcloud login --update-adc, which isn't recognized) and repeated the criticality_score command:

$ criticality_score github.com/kubernetes/kubernetes
> 2022-12-21 14:47:20.748 INFO    Preparing default scorer
> 2022-12-21 14:47:20.750 ERROR   Failed to create collector      {"error": "init deps.dev source: unable to detect projectID, please refer to docs for DetectProjectID"}
> main.main
>        /Users/pnacht/go/pkg/mod/github.com/ossf/[email protected]/cmd/criticality_score/main.go:160
> runtime.main
>         /usr/local/go/src/runtime/proc.go:250

The error is different now, something about "DetectProjectID"? Looking through the criticality_score codebase, I only found one reference to it and honestly didn't know how to proceed from here.

What else is required to run the CLI as a standalone?

Can only install the standalone CLI from @main

The README suggests installing the CLI with

go install github.com/ossf/criticality_score/cmd/criticality_score

However, I get an error:

go: 'go install' requires a version when current directory is not in a module
Try 'go install github.com/ossf/criticality_score/cmd/criticality_score@latest' to install the latest version

However, when I then try @latest (and @v1.0.7, the latest release tag), I get another error:

go: github.com/ossf/criticality_score/cmd/criticality_score@latest: module github.com/ossf/criticality_score@latest found (v1.0.7), but does not contain package github.com/ossf/criticality_score/cmd/criticality_score

The only method I've found that works is using @main, but that's not an optimal solution since it requires that the main branch be in a usable state.

Gives criticality score for an open source project

Open Source Project Criticality Score (Beta)

Goals

Criticality Score

Usage

Authentication

Formatting Results

Public Data

Contributing

Owner

Open Source Security Foundation (OpenSSF)

Comments

GeoTools not showing in top 200 for java projects, run criticality score on larger sample set

Use project first commit date for created_since, instead of github project creation date

What is dependents_count parameter, looks suspect ?

Maven and Gradle not in the Top 2000 java list

Installation does not work as described in README

Add Watchers/Description Metrics

Handle empty repo case

Adds repolist command line parameter

why apache/spark isn't in Java top 200 public data?

How are the top 200 lists computed?

Language implementation is less critical than language project generator, create list for TypeScript projects inside JS list.

Publish Docker Images to ghcr.io

Included Docker File for Scorer

Included Build Targets for Binaries

CLI seems to require more setup than shown in the documentation

Can only install the standalone CLI from @main

Related tags

Bubbly is an open-source platform that gives you confidence in your continuous release process.

Get live cricket score right in your terminal.

Parse NYT crossword puzzle score screenshots and extract the times.

Go library for calculating the Nutri-Score of foods and beverages.

SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Magma: Gives network operators an open, flexible and extendable mobile core network solution

Configure is a Go package that gives you easy configuration of your project through redundancy

Open-IM-Server is open source instant messaging Server.Backend in Go.

Open-IM-Server is open source instant messaging Server.Backend in Go.

go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data.

mesh-kridik is an open-source security scanner that performs various security checks on a Kubernetes cluster with istio service mesh and is leveraged by OPA (Open Policy Agent) to enforce security rules.

go-opa-validate is an open-source lib that evaluates OPA (open policy agent) policy against JSON or YAML data.

onnx-go gives the ability to import a pre-trained neural network within Go without being linked to a framework or library.

Pixie gives you instant visibility by giving access to metrics, events, traces and logs without changing code.

A cowin bot that gives you an update whenever it finds a vacancy in your region

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

A Telegram Repo For Bots Under Maintenance Which Gives Faster Response To Users

Emulate a Vikings War of Clans battle with the real game mechanics and gives you the results of your emulated rapport!

`ctxio` gives `io.copy` operations the ability to cancel with context and retrieve progress data.