|Authors||L. Moonen and L. Vidziunas|
|Title||CVEfixes Dataset v1.0.7: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software|
|Project(s)||Data-Driven Software Engineering Department|
|Year of Publication||2022|
|Keywords||dataset, Security vulnerabilities, software repository mining, source code repair., vulnerability classification, vulnerability prediction|
CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.
This release, v1.0.7, covers all published CVEs up to 27 August 2022. All open-source projects that were reported in CVE records in the NVD in this time frame _and_ had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 7798 vulnerability fixing commits in 2487 open source projects for a total of 7637 CVEs in 209 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after changing 29309 files and 98250 functions.
The repository includes the SQL dump of the dataset, as well as the JSON for the CVEs and XML of the CWEs at the time of collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software", which is published in the Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). You will find a copy of the paper in the Doc folder.