Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyPI OSV #632

Merged
merged 2 commits into from
May 15, 2022
Merged

Add PyPI OSV #632

merged 2 commits into from
May 15, 2022

Conversation

ziadhany
Copy link
Collaborator

#607
Signed-off-by: Ziad [email protected]

@pombredanne pombredanne changed the title Add PyPL OSV Add PyPY OSV Feb 21, 2022
@pombredanne pombredanne changed the title Add PyPY OSV Add PyPI OSV Feb 21, 2022
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziadhany Thank you ++ do you mind to add a few tests?
Also may be rename this module to pysec.py?

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW you may want to amend your commit message in 2fb225d
"Add PyPL OSV" could be something like:

Add PyPI OSV importer

Reference: https://github.com/nexB/vulnerablecode/issues/607
Signed-off-by: Ziad <[email protected]>

and you could squash your commits too

@ziadhany ziadhany force-pushed the osv_import branch 3 times, most recently from 3598d65 to cf3d432 Compare March 4, 2022 03:42
Copy link
Contributor

@TG1999 TG1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You for your contribution, please check for items before accessing them, if they are not there log it, please check alpine importer for reference. I have added some comments for your consideration.

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
@TG1999
Copy link
Contributor

TG1999 commented Mar 4, 2022

@ziadhany Please rebase your branch as well :)

@ziadhany ziadhany force-pushed the osv_import branch 4 times, most recently from 006b85b to 6a82e35 Compare March 6, 2022 09:12
Copy link
Collaborator

@Hritik14 Hritik14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for researching on OSV! There are a few points I'd like you to consider

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_data/pysec/pysec_test.json Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ++ for the updates!
I added some comments for your kind consideration.

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
f"fixed_version InvalidVersion - PypiVersion - {raw_data['id'] !r}"
)

if fix_range["type"] == "SEMVER":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it be possible for a PyPI range to use semver?
Also this code is too complicated for me to understand as is the same code above... can you find a way to decompose this in something simpler? Also it is duplicated almost exactly with the code above... so creating a function would make sense to avoid duplicated code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziadhany Thanks ... so in https://osv.dev/vulnerability/GHSA-2ghc-6v89-pw9j I see three different affected packages:

  • npm/body-parser-xml which would become the purl pkg:npm/body-parser-xml and is reported as SEMVER versioning which would become the npm vers scheme on our side. This is found at https://www.npmjs.com/package/body-parser-xml

  • NuGet/body-parser-xml-nuget which would become the purl pkg:nuget/body-parser-xml-nuget and is reported as ECOSYSTEM versioning and would become the nuget vers scheme on our side... BUT this does NOT exist as a nuget package at all. https://www.nuget.org/packages?q=body-parser-xml ... not even close

  • PyPI/body-parser-xml-pip which would become the purl pkg:nuget/body-parser-xml-pip and is reported as ECOSYSTEM versioning and would become the pypi vers scheme on our side. BUT this does NOT exist as a pypi package at all: https://pypi.org/search/?q=body-parser-xml

So what to make of this? OSV and GitHub data are likely not really reviewed after all. That's not the first time I see junk ghost records like this that are supposedly reviewed.

In anycase this are not a Pypi package... but it is going to be hard to figure this out ... at least not at import time.
I note that this is not seen at all in https://github.com/pypa/advisory-database
So practically:

  • I entered Add improver to validate that collected purl are real. #644 to add improvers to validate purls
  • short term we could flag these with lower confidence but I am not sure we have enough at import time
  • so we can import these ... each should use their own specific vers scheme though and PyPI is NOT semver alright

There is not much harm done to import non-existing purl BTW because these would never be found in the wild.

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
Copy link
Contributor

@TG1999 TG1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ziad, for your changes, I have added some comments for your consideration

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Show resolved Hide resolved
vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/tests/test_pysec.py Outdated Show resolved Hide resolved
@TG1999
Copy link
Contributor

TG1999 commented Mar 17, 2022

@ziadhany there is a field called "database_specific" in test data, do see if we can derive some severities from there

@ziadhany
Copy link
Collaborator Author

ziadhany commented Mar 20, 2022

@ziadhany there is a field called "database_specific" in test data, do see if we can derive some severities from there

@TG1999 Do we support CWE Scoring System ?

@ziadhany ziadhany force-pushed the osv_import branch 2 times, most recently from df196c6 to 8766874 Compare April 17, 2022 12:09
@ziadhany ziadhany requested a review from Hritik14 April 17, 2022 12:12
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!
I am nitpicking on a few extra things because I am picky!, but the code overall looks nice as it is.

Some general remarks:

  • create variables early rather than looking up a dict over and over
  • a small docstring to explain what a function does is not mandatory but useful for no-trivial, non-obvious functions.

}
) == [
VulnerabilitySeverity(
system=SCORING_SYSTEMS["cvssv3_vector"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm not sure.
but have a look at osv-schema:
https://ossf.github.io/osv-schema/#severitytype-field
Severity Type: CVSS_V3
Score Description : A CVSS vector string ...

ex GHSA-xc3p-ff3m-f46v.json:
"score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N"

vulnerabilities/tests/test_pysec.py Show resolved Hide resolved
return []


def get_aff_purl(affected_pkg, raw_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this? Spelling out whole words is fine

Suggested change
def get_aff_purl(affected_pkg, raw_id):
def get_affected_purl(affected_pkg, raw_id):
"""
Return an affected PackageURL given an ``affected_pkg`` mapping and a ``raw_id``
"""

BTW, why do you need a raw_id there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for logs (to track the root cause of errors and fix them easily)

logger.error(f"purl affected_pkg not found - {raw_id !r}")


def get_aff_version_range(affected_pkg, raw_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above...

Suggested change
def get_aff_version_range(affected_pkg, raw_id):
def get_affected_version_range(affected_pkg, raw_id):



def get_aff_version_range(affected_pkg, raw_id):
if "versions" in affected_pkg:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like this instead?

    affected_versions = affected_pkg.get("versions")
    if affected_versions:
        try:
            return PypiVersionRange(affected_versions)

if "versions" in affected_pkg:
try:
return PypiVersionRange(affected_pkg["versions"])
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be the exceptions raised here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think InvalidVersionRange

if "type" in fixed_range:
list_fixed = fixed_filter(fixed_range)
for i in list_fixed:
if fixed_range["type"] == "ECOSYSTEM":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always prefer to create a named variable in such case:

Suggested change
if fixed_range["type"] == "ECOSYSTEM":
fixed_range_type = fixed_range["type"]
if fixed_range_type == "ECOSYSTEM":

vulnerabilities/importers/pysec.py Show resolved Hide resolved
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See a few final nit pickings for your kind consideration.
As a general note, think about creating a variable early and then using this rather than doing multiple lookups in a mapping: this is more readable and more robust.


def dedupe(original: List) -> List:
"""
Remove all duplicate items and return a new list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Remove all duplicate items and return a new list
Remove all duplicate items and return a new list preserving ordering

if "affected" in raw_data:
for affected_pkg in raw_data["affected"]:
purl = get_affected_purl(affected_pkg, raw_data["id"])
if purl.type == "pypi":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if purl.type == "pypi":
if purl.type != "pypi":
logger.error(f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data!r}")
else:

for affected_pkg in raw_data["affected"]:
purl = get_affected_purl(affected_pkg, raw_data["id"])
if purl.type == "pypi":
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to give a name early to avoid multiple lookups and carrying around less common names:

Suggested change
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"])
vulnid = raw_data["id"]
affected_version_range = get_affected_version_range(affected_pkg, vulnid)

if purl.type == "pypi":
affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"])
for fixed_range in affected_pkg.get("ranges", []):
fixed_version = get_fixed_version(fixed_range, raw_data["id"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fixed_version = get_fixed_version(fixed_range, raw_data["id"])
fixed_version = get_fixed_version(fixed_range, vulnid)

vulnerabilities/importers/pysec.py Show resolved Hide resolved


def get_references(raw_data, severity) -> []:
if "references" in raw_data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You replace the whole function body by:

Suggested change
if "references" in raw_data:
references = raw_data.get("references") or []
return [Reference(url=ref["url"], severities=severity) for ref in references if ref]

Also what is severity here? a list since it afterward assigned to severities?
how does it apply to the whole list of references and not just to a single one?

Do you have an example of data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference field (url) and a severity field are separated from each other in osv-schema. I think we can make use of ADVISORY type field in reference and assign the severities to it but What if there is two ADVISORY ?

ex: GHSA-2qx8-589j-gcpx

{
  "database_specific": {
    "severity": "MODERATE",
    "github_reviewed": true,
    "cwe_ids": []
  },
"references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2011-1950"
    },
    {
      "type": "WEB",
      "url": "https://exchange.xforce.ibmcloud.com/vulnerabilities/67695"
    },
    {
      "type": "ADVISORY",
      "url": "https://github.com/advisories/GHSA-2qx8-589j-gcpx"
    },
...
  ]

}

ex: GHSA-2gwj-7jmv-h26r

{
  "database_specific": {
       "severity": "CRITICAL",
  }
  "severity": [
    {
      "type": "CVSS_V3",
      "score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
    }],
"references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2022-28346"
    },
    {
      "type": "WEB",
      "url": "https://github.com/django/django/commit/93cae5cb2f9a4ef1514cf1a41f714fef08005200"
    },
    {
      "type": "WEB",
      "url": "https://lists.debian.org/debian-lts-announce/2022/04/msg00013.html"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/django/django"
    }
  ]
  ]
...}

Any suggestions?

except ValueError:
logger.error(f"PackageURL ValueError - {raw_id !r} - purl: {package['purl'] !r}")

if "ecosystem" in package and "name" in package:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit as above: you are doing two lookups in package: here and below on line 184.
You should collect ecosystem and name in variables once instead

vulnerabilities/importers/pysec.py Show resolved Hide resolved
if "type" in fixed_range:
list_fixed = fixed_filter(fixed_range)
for i in list_fixed:
fixed_range_type = fixed_range["type"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be outside of the loop rather?

if fixed_range_type == "GIT":
# TODO add GitHubVersion univers fix_version
logger.error(f"NotImplementedError GIT Version - {raw_id !r} - {i !r}")
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put error checks first at the top and invert this if/else block and if there no "type" you should return right way.
This will make the code simpler and remove one indentation level

>>> fixed_filter({"type": "ECOSYSTEM","events":[{"introduced": "0"},{"fixed": "1.0.0"},{"fixed": "9.0.0"}]})
['1.0.0', '9.0.0']
"""
filter_fixed = list(filter(lambda x: x.keys() == {"fixed"}, fixed_range["events"]))
Copy link
Member

@pombredanne pombredanne May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this instead? It may be easier to read than a filter with a lambda?

    for event in fixed_range.get("events") or []:
        fixed = event.get("fixed")
        if fixed:
            yield fixed

if "published" in raw_data:
return parse(raw_data["published"])
else:
logger.warning(f"date_published not found {raw_data['id'] !r}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the date is important enough to warrant logging

def get_severity(raw_data) -> []:
severities = []
if "severity" in raw_data:
for sever_list in raw_data["severity"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for sever_list in raw_data["severity"]:
for sever_list in raw_data.get("severity") or []:

logger.warning(f"date_published not found {raw_data['id'] !r}")


def get_severity(raw_data) -> []:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_severity --> get_severities since there are more than one



def get_severity(raw_data) -> []:
severities = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you should rather yield as you go instead of accumulating in a list. The code will be leaner then

@ziadhany ziadhany force-pushed the osv_import branch 2 times, most recently from 63ca14e to a197a84 Compare May 6, 2022 18:51
@ziadhany ziadhany requested a review from pombredanne May 6, 2022 19:09
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates. A very last round few nitpickings for your consideration!



def get_severities(raw_data) -> []:
if "severity" in raw_data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed since the code below (for sever_list in raw_data.get("severity") or []: handles all the possible cases:

Suggested change
if "severity" in raw_data:

vulnerabilities/importers/pysec.py Show resolved Hide resolved
vulnerabilities/importers/pysec.py Show resolved Hide resolved

affected_packages = []
if "affected" not in raw_data:
logger.error(f"affected_packages not found - {raw_data['id'] !r}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always best to return early if there is nothing left to do.
This way you can de-indent the code that comes after

Suggested change
logger.error(f"affected_packages not found - {raw_data['id'] !r}")
logger.error(f"affected_packages not found - {raw_data['id'] !r}")
return

if "affected" not in raw_data:
logger.error(f"affected_packages not found - {raw_data['id'] !r}")

if "affected" in raw_data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from above, this becomes not needed:

Suggested change
if "affected" in raw_data:
if "affected" in raw_data:

logger.error(f"affected_packages not found - {raw_data['id'] !r}")

if "affected" in raw_data:
for affected_pkg in raw_data["affected"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

affected may not be a list, so to be safe, I would use this:

Suggested change
for affected_pkg in raw_data["affected"]:
for affected_pkg in raw_data.get("affected") or []:


if "affected" in raw_data:
for affected_pkg in raw_data["affected"]:
purl = get_affected_purl(affected_pkg, raw_data["id"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a raw_id variable defined above

logger.error(
f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}"
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return early and de-indent:

Suggested change
else:
return

f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}"
)
else:
vulnid = raw_data["id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use raw_id variable

Reference: aboutcode-org#607
Signed-off-by: Ziad <[email protected]>

add the necessary changes

Signed-off-by: Ziad <[email protected]>

remove aliases de-duplicate

Signed-off-by: Ziad <[email protected]>

reslove conflicts

Signed-off-by: Ziad <[email protected]>

add dateparser to setup.cfg

Signed-off-by: Ziad <[email protected]>

Resolving conflicts

Signed-off-by: Ziad <[email protected]>

resolve conflicts

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>

add a require changes

Signed-off-by: Ziad <[email protected]>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging! Thank you ++ for your peseverance!r

vulnerabilities/importers/pysec.py Outdated Show resolved Hide resolved
vulnerabilities/importers/pysec.py Show resolved Hide resolved
@pombredanne pombredanne merged commit b53e4f4 into aboutcode-org:main May 15, 2022
@@ -267,3 +267,16 @@ def _get_gh_response(gh_token, graphql_query):
endpoint = "https://api.github.com/graphql"
headers = {"Authorization": f"bearer {gh_token}"}
return requests.post(endpoint, headers=headers, json=graphql_query).json()


def dedupe(original: List) -> List:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also have implemented this using a OrderedDict from collections.

>>> import collections
>>> collections.OrderedDict.fromkeys(["z","i","a","a","d","d"]).keys()
odict_keys(['z', 'i', 'a', 'd'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! actually on Python 3.6+, this could be even shorter: dict.fromkeys(["z","i","a","a","d","d"]).keys()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants