-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyPI OSV #632
Add PyPI OSV #632
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ziadhany Thank you ++ do you mind to add a few tests?
Also may be rename this module to pysec.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW you may want to amend your commit message in 2fb225d
"Add PyPL OSV" could be something like:
Add PyPI OSV importer
Reference: https://github.com/nexB/vulnerablecode/issues/607
Signed-off-by: Ziad <[email protected]>
and you could squash your commits too
3598d65
to
cf3d432
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You for your contribution, please check for items before accessing them, if they are not there log it, please check alpine importer for reference. I have added some comments for your consideration.
@ziadhany Please rebase your branch as well :) |
006b85b
to
6a82e35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for researching on OSV! There are a few points I'd like you to consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ++ for the updates!
I added some comments for your kind consideration.
vulnerabilities/importers/pysec.py
Outdated
f"fixed_version InvalidVersion - PypiVersion - {raw_data['id'] !r}" | ||
) | ||
|
||
if fix_range["type"] == "SEMVER": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it be possible for a PyPI range to use semver?
Also this code is too complicated for me to understand as is the same code above... can you find a way to decompose this in something simpler? Also it is duplicated almost exactly with the code above... so creating a function would make sense to avoid duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ex: GHSA-2ghc-6v89-pw9j.json
https://osv.dev/vulnerability/GHSA-2ghc-6v89-pw9j
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ziadhany Thanks ... so in https://osv.dev/vulnerability/GHSA-2ghc-6v89-pw9j I see three different affected packages:
-
npm/body-parser-xml
which would become the purlpkg:npm/body-parser-xml
and is reported as SEMVER versioning which would become thenpm
vers scheme on our side. This is found at https://www.npmjs.com/package/body-parser-xml -
NuGet/body-parser-xml-nuget
which would become the purlpkg:nuget/body-parser-xml-nuget
and is reported as ECOSYSTEM versioning and would become thenuget
vers scheme on our side... BUT this does NOT exist as a nuget package at all. https://www.nuget.org/packages?q=body-parser-xml ... not even close -
PyPI/body-parser-xml-pip
which would become the purlpkg:nuget/body-parser-xml-pip
and is reported as ECOSYSTEM versioning and would become thepypi
vers scheme on our side. BUT this does NOT exist as a pypi package at all: https://pypi.org/search/?q=body-parser-xml
So what to make of this? OSV and GitHub data are likely not really reviewed after all. That's not the first time I see junk ghost records like this that are supposedly reviewed.
In anycase this are not a Pypi package... but it is going to be hard to figure this out ... at least not at import time.
I note that this is not seen at all in https://github.com/pypa/advisory-database
So practically:
- I entered Add improver to validate that collected purl are real. #644 to add improvers to validate purls
- short term we could flag these with lower confidence but I am not sure we have enough at import time
- so we can import these ... each should use their own specific vers scheme though and PyPI is NOT semver alright
There is not much harm done to import non-existing purl BTW because these would never be found in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ziad, for your changes, I have added some comments for your consideration
@ziadhany there is a field called "database_specific" in test data, do see if we can derive some severities from there |
bf016a2
to
19d993a
Compare
df196c6
to
8766874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
I am nitpicking on a few extra things because I am picky!, but the code overall looks nice as it is.
Some general remarks:
- create variables early rather than looking up a dict over and over
- a small docstring to explain what a function does is not mandatory but useful for no-trivial, non-obvious functions.
vulnerabilities/tests/test_pysec.py
Outdated
} | ||
) == [ | ||
VulnerabilitySeverity( | ||
system=SCORING_SYSTEMS["cvssv3_vector"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this may not be CVSS:3.1 and cvssv3.1_vector
instead? https://github.com/nexB/vulnerablecode/blob/47f6ae62d919fff5094b960c1f56bc0c6b8b4be3/vulnerabilities/severity_systems.py#L83
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm not sure.
but have a look at osv-schema:
https://ossf.github.io/osv-schema/#severitytype-field
Severity Type: CVSS_V3
Score Description : A CVSS vector string ...
ex GHSA-xc3p-ff3m-f46v.json:
"score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N"
vulnerabilities/importers/pysec.py
Outdated
return [] | ||
|
||
|
||
def get_aff_purl(affected_pkg, raw_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this? Spelling out whole words is fine
def get_aff_purl(affected_pkg, raw_id): | |
def get_affected_purl(affected_pkg, raw_id): | |
""" | |
Return an affected PackageURL given an ``affected_pkg`` mapping and a ``raw_id`` | |
""" |
BTW, why do you need a raw_id there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for logs (to track the root cause of errors and fix them easily)
vulnerabilities/importers/pysec.py
Outdated
logger.error(f"purl affected_pkg not found - {raw_id !r}") | ||
|
||
|
||
def get_aff_version_range(affected_pkg, raw_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above...
def get_aff_version_range(affected_pkg, raw_id): | |
def get_affected_version_range(affected_pkg, raw_id): |
vulnerabilities/importers/pysec.py
Outdated
|
||
|
||
def get_aff_version_range(affected_pkg, raw_id): | ||
if "versions" in affected_pkg: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like this instead?
affected_versions = affected_pkg.get("versions")
if affected_versions:
try:
return PypiVersionRange(affected_versions)
vulnerabilities/importers/pysec.py
Outdated
if "versions" in affected_pkg: | ||
try: | ||
return PypiVersionRange(affected_pkg["versions"]) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be the exceptions raised here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think InvalidVersionRange
vulnerabilities/importers/pysec.py
Outdated
if "type" in fixed_range: | ||
list_fixed = fixed_filter(fixed_range) | ||
for i in list_fixed: | ||
if fixed_range["type"] == "ECOSYSTEM": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always prefer to create a named variable in such case:
if fixed_range["type"] == "ECOSYSTEM": | |
fixed_range_type = fixed_range["type"] | |
if fixed_range_type == "ECOSYSTEM": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See a few final nit pickings for your kind consideration.
As a general note, think about creating a variable early and then using this rather than doing multiple lookups in a mapping: this is more readable and more robust.
vulnerabilities/helpers.py
Outdated
|
||
def dedupe(original: List) -> List: | ||
""" | ||
Remove all duplicate items and return a new list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all duplicate items and return a new list | |
Remove all duplicate items and return a new list preserving ordering |
vulnerabilities/importers/pysec.py
Outdated
if "affected" in raw_data: | ||
for affected_pkg in raw_data["affected"]: | ||
purl = get_affected_purl(affected_pkg, raw_data["id"]) | ||
if purl.type == "pypi": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if purl.type == "pypi": | |
if purl.type != "pypi": | |
logger.error(f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data!r}") | |
else: |
vulnerabilities/importers/pysec.py
Outdated
for affected_pkg in raw_data["affected"]: | ||
purl = get_affected_purl(affected_pkg, raw_data["id"]) | ||
if purl.type == "pypi": | ||
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to give a name early to avoid multiple lookups and carrying around less common names:
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) | |
vulnid = raw_data["id"] | |
affected_version_range = get_affected_version_range(affected_pkg, vulnid) |
vulnerabilities/importers/pysec.py
Outdated
if purl.type == "pypi": | ||
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) | ||
for fixed_range in affected_pkg.get("ranges", []): | ||
fixed_version = get_fixed_version(fixed_range, raw_data["id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_version = get_fixed_version(fixed_range, raw_data["id"]) | |
fixed_version = get_fixed_version(fixed_range, vulnid) |
vulnerabilities/importers/pysec.py
Outdated
|
||
|
||
def get_references(raw_data, severity) -> []: | ||
if "references" in raw_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You replace the whole function body by:
if "references" in raw_data: | |
references = raw_data.get("references") or [] | |
return [Reference(url=ref["url"], severities=severity) for ref in references if ref] |
Also what is severity
here? a list since it afterward assigned to severities
?
how does it apply to the whole list of references and not just to a single one?
Do you have an example of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference field (url) and a severity field are separated from each other in osv-schema. I think we can make use of ADVISORY
type field in reference and assign the severities to it but What if there is two ADVISORY
?
{
"database_specific": {
"severity": "MODERATE",
"github_reviewed": true,
"cwe_ids": []
},
"references": [
{
"type": "ADVISORY",
"url": "https://nvd.nist.gov/vuln/detail/CVE-2011-1950"
},
{
"type": "WEB",
"url": "https://exchange.xforce.ibmcloud.com/vulnerabilities/67695"
},
{
"type": "ADVISORY",
"url": "https://github.com/advisories/GHSA-2qx8-589j-gcpx"
},
...
]
}
{
"database_specific": {
"severity": "CRITICAL",
}
"severity": [
{
"type": "CVSS_V3",
"score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
}],
"references": [
{
"type": "ADVISORY",
"url": "https://nvd.nist.gov/vuln/detail/CVE-2022-28346"
},
{
"type": "WEB",
"url": "https://github.com/django/django/commit/93cae5cb2f9a4ef1514cf1a41f714fef08005200"
},
{
"type": "WEB",
"url": "https://lists.debian.org/debian-lts-announce/2022/04/msg00013.html"
},
{
"type": "PACKAGE",
"url": "https://github.com/django/django"
}
]
]
...}
Any suggestions?
vulnerabilities/importers/pysec.py
Outdated
except ValueError: | ||
logger.error(f"PackageURL ValueError - {raw_id !r} - purl: {package['purl'] !r}") | ||
|
||
if "ecosystem" in package and "name" in package: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit as above: you are doing two lookups in package: here and below on line 184.
You should collect ecosystem and name in variables once instead
vulnerabilities/importers/pysec.py
Outdated
if "type" in fixed_range: | ||
list_fixed = fixed_filter(fixed_range) | ||
for i in list_fixed: | ||
fixed_range_type = fixed_range["type"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be outside of the loop rather?
vulnerabilities/importers/pysec.py
Outdated
if fixed_range_type == "GIT": | ||
# TODO add GitHubVersion univers fix_version | ||
logger.error(f"NotImplementedError GIT Version - {raw_id !r} - {i !r}") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should put error checks first at the top and invert this if/else block and if there no "type" you should return right way.
This will make the code simpler and remove one indentation level
vulnerabilities/importers/pysec.py
Outdated
>>> fixed_filter({"type": "ECOSYSTEM","events":[{"introduced": "0"},{"fixed": "1.0.0"},{"fixed": "9.0.0"}]}) | ||
['1.0.0', '9.0.0'] | ||
""" | ||
filter_fixed = list(filter(lambda x: x.keys() == {"fixed"}, fixed_range["events"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this instead? It may be easier to read than a filter with a lambda?
for event in fixed_range.get("events") or []:
fixed = event.get("fixed")
if fixed:
yield fixed
vulnerabilities/importers/pysec.py
Outdated
if "published" in raw_data: | ||
return parse(raw_data["published"]) | ||
else: | ||
logger.warning(f"date_published not found {raw_data['id'] !r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the date is important enough to warrant logging
vulnerabilities/importers/pysec.py
Outdated
def get_severity(raw_data) -> []: | ||
severities = [] | ||
if "severity" in raw_data: | ||
for sever_list in raw_data["severity"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sever_list in raw_data["severity"]: | |
for sever_list in raw_data.get("severity") or []: |
vulnerabilities/importers/pysec.py
Outdated
logger.warning(f"date_published not found {raw_data['id'] !r}") | ||
|
||
|
||
def get_severity(raw_data) -> []: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_severity --> get_severities since there are more than one
vulnerabilities/importers/pysec.py
Outdated
|
||
|
||
def get_severity(raw_data) -> []: | ||
severities = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you should rather yield as you go instead of accumulating in a list. The code will be leaner then
63ca14e
to
a197a84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates. A very last round few nitpickings for your consideration!
vulnerabilities/importers/pysec.py
Outdated
|
||
|
||
def get_severities(raw_data) -> []: | ||
if "severity" in raw_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed since the code below (for sever_list in raw_data.get("severity") or []:
handles all the possible cases:
if "severity" in raw_data: |
vulnerabilities/importers/pysec.py
Outdated
|
||
affected_packages = [] | ||
if "affected" not in raw_data: | ||
logger.error(f"affected_packages not found - {raw_data['id'] !r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always best to return early if there is nothing left to do.
This way you can de-indent the code that comes after
logger.error(f"affected_packages not found - {raw_data['id'] !r}") | |
logger.error(f"affected_packages not found - {raw_data['id'] !r}") | |
return |
vulnerabilities/importers/pysec.py
Outdated
if "affected" not in raw_data: | ||
logger.error(f"affected_packages not found - {raw_data['id'] !r}") | ||
|
||
if "affected" in raw_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following from above, this becomes not needed:
if "affected" in raw_data: | |
if "affected" in raw_data: |
vulnerabilities/importers/pysec.py
Outdated
logger.error(f"affected_packages not found - {raw_data['id'] !r}") | ||
|
||
if "affected" in raw_data: | ||
for affected_pkg in raw_data["affected"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
affected may not be a list, so to be safe, I would use this:
for affected_pkg in raw_data["affected"]: | |
for affected_pkg in raw_data.get("affected") or []: |
vulnerabilities/importers/pysec.py
Outdated
|
||
if "affected" in raw_data: | ||
for affected_pkg in raw_data["affected"]: | ||
purl = get_affected_purl(affected_pkg, raw_data["id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a raw_id variable defined above
vulnerabilities/importers/pysec.py
Outdated
logger.error( | ||
f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}" | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return early and de-indent:
else: | |
return |
vulnerabilities/importers/pysec.py
Outdated
f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}" | ||
) | ||
else: | ||
vulnid = raw_data["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use raw_id variable
Reference: aboutcode-org#607 Signed-off-by: Ziad <[email protected]> add the necessary changes Signed-off-by: Ziad <[email protected]> remove aliases de-duplicate Signed-off-by: Ziad <[email protected]> reslove conflicts Signed-off-by: Ziad <[email protected]> add dateparser to setup.cfg Signed-off-by: Ziad <[email protected]> Resolving conflicts Signed-off-by: Ziad <[email protected]> resolve conflicts Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]> add a require changes Signed-off-by: Ziad <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging! Thank you ++ for your peseverance!r
Signed-off-by: Philippe Ombredanne <[email protected]>
@@ -267,3 +267,16 @@ def _get_gh_response(gh_token, graphql_query): | |||
endpoint = "https://api.github.com/graphql" | |||
headers = {"Authorization": f"bearer {gh_token}"} | |||
return requests.post(endpoint, headers=headers, json=graphql_query).json() | |||
|
|||
|
|||
def dedupe(original: List) -> List: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also have implemented this using a OrderedDict
from collections
.
>>> import collections
>>> collections.OrderedDict.fromkeys(["z","i","a","a","d","d"]).keys()
odict_keys(['z', 'i', 'a', 'd'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! actually on Python 3.6+, this could be even shorter: dict.fromkeys(["z","i","a","a","d","d"]).keys()
#607
Signed-off-by: Ziad [email protected]