Use non-lowercased project names #4

jayvdb · 2019-05-08T15:31:03Z

All project names are lower case, not matching the name shown on pypi.org. e.g. pyyaml instead of PyYAML. I suspect that may be the data this project has, in which case the problem is upstream.

That lowercasing is not very helpful - the name of projects can (and does) change over time in all sorts of ways, not just the case.

Applying lowercase can be done after the fact - it is a simple transform, but it is not reversible without the post-processing of all entries as suggested in the followup comments on #1

My use-case is I need to match the list up with openSUSE package names, which must use the PyPI package name, exactly, including casing and hyphen-vs-dash. The task is slightly more difficult and slower if I dont have the exact name to begin with.

If it cant be obtained from the source data, it is likely quicker for me to add post-processing to get the real name , rather than try to get exact results from case insensitive openSUSE package searches.

The text was updated successfully, but these errors were encountered:

hugovk · 2019-05-09T10:05:51Z

This repo doesn't alter the names, it dumps the result from pypinfo:

/usr/local/bin/pypinfo --json --indent 0 --limit 5000 --days 30 "" project > top-pypi-packages-30-days.json

Having a quick look in pypinfo, it's not changing the name of projects received from the Google BigQuery client.

pypinfo does have this:

def normalize(name):
    """https://www.python.org/dev/peps/pep-0503/#normalized-names"""
    return re.sub(r'[-_.]+', '-', name).lower()

But that's only used for normalising the input when wanting info about a single project, and is blank in this case.

https://www.python.org/dev/peps/pep-0503/#normalized-names says:

This PEP references the concept of a "normalized" project name. As per PEP 426 the only valid characters in a name are the ASCII alphabet, ASCII numbers, ., -, and _. The name should be lowercased with all runs of the characters ., -, or _ replaced with a single - character. This can be implemented in Python with the re module:

(And then gives the same function.)

I didn't check if the Google BigQuery can also return the un-normalised name, if so, that'd need a change to pypinfo before being added here.

If that's not possible or easy, then I'd be fine adding extra data here. Rather than post-processing, I think a second JSON file would be better rather than post-processing.

Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

If so, can you normalise PyYAML into pyyaml and then use the data here?

jayvdb · 2019-05-09T10:59:18Z

Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

yes, with a python- prefix.

https://build.opensuse.org/package/show/openSUSE:Factory/python-PyYAML

I would prefer to be using this data first, and looking up against openSUSE, rather than the other way around, or building a database of both and cross referencing.

I'll see what is happening inside pypinfo

jayvdb · 2019-05-10T03:56:03Z

The schema is at https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema , and both url and file.filename have the proper project name, and I have got them working with adhoc queries. So now I just need to propose a PR to pypinfo to use the filename. It might be slightly slower, depending on whether bigquery supports some more advanced SQL join syntax, and possibly even using https://bigquery.cloud.google.com/table/the-psf:pypi.simple_requests instead.

jayvdb · 2019-05-10T04:25:51Z

hugovk · 2019-05-10T05:10:44Z

Sounds good! One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota. Hopefully it won't increase the amount used too much, but it would be nice to see the difference.

pypinfo reports how big each query is, you can see it in the json here.

hugovk · 2019-05-10T05:24:32Z

Good list! (I need to make a list of things using this data, too.)

Of those, https://github.com/psincraian/pepy and https://github.com/crflynn/pypistats.org are websites which essentially cache BigQuery data.

The latter is especially good and has an API, for which I've written a CLI client:

https://pypistats.org/api/
https://github.com/hugovk/pypistats

The data is limited to 6 months, and both pepy and pypistats.org don't have this specific mapping we're talking about. But maybe they could?

jayvdb · 2019-05-10T12:17:50Z

One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota.

It shouldnt be extra queries - just slightly slower queries, assuming the SQL engine is halfway decent.

Based on your recommendation, I've created issues in both of those projects to see which, if any, have an interest.

You'll be interested to learn that pepy is growing an API psincraian/pepy@b3cf4ee

jayvdb · 2019-05-13T03:48:03Z

Now I have the SQL changes needed (see queries at psincraian/pepy#128 (comment)), I've also created an issue at ofek/pypinfo#73 before doing the change there.

This was referenced May 10, 2019

Non-normalised package name crflynn/pypistats.org#18

Open

Non-normalised package name psincraian/pepy#128

Open

jayvdb mentioned this issue May 13, 2019

Non-normalised package name ofek/pypinfo#73

Open

jayvdb mentioned this issue May 15, 2019

added checking for release versions on PyPI di/pip-api#24

Closed

jayvdb mentioned this issue Jul 17, 2019

Support other languages DanielVenturini/vigilant-lamp#3

Open

hugovk closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use non-lowercased project names #4

Use non-lowercased project names #4

jayvdb commented May 8, 2019

hugovk commented May 9, 2019

jayvdb commented May 9, 2019

jayvdb commented May 10, 2019

jayvdb commented May 10, 2019

hugovk commented May 10, 2019

hugovk commented May 10, 2019

jayvdb commented May 10, 2019

jayvdb commented May 13, 2019

Use non-lowercased project names #4

Use non-lowercased project names #4

Comments

jayvdb commented May 8, 2019

hugovk commented May 9, 2019

jayvdb commented May 9, 2019

jayvdb commented May 10, 2019

jayvdb commented May 10, 2019

hugovk commented May 10, 2019

hugovk commented May 10, 2019

jayvdb commented May 10, 2019

jayvdb commented May 13, 2019