-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use non-lowercased project names #4
Comments
This repo doesn't alter the names, it dumps the result from pypinfo:
Having a quick look in pypinfo, it's not changing the name of projects received from the Google BigQuery client. pypinfo does have this: def normalize(name):
"""https://www.python.org/dev/peps/pep-0503/#normalized-names"""
return re.sub(r'[-_.]+', '-', name).lower() But that's only used for normalising the input when wanting info about a single project, and is blank in this case. https://www.python.org/dev/peps/pep-0503/#normalized-names says:
(And then gives the same function.) I didn't check if the Google BigQuery can also return the un-normalised name, if so, that'd need a change to pypinfo before being added here. If that's not possible or easy, then I'd be fine adding extra data here. Rather than post-processing, I think a second JSON file would be better rather than post-processing. Or are the openSUSE package names identical to the PyPI names (eg. If so, can you normalise |
yes, with a https://build.opensuse.org/package/show/openSUSE:Factory/python-PyYAML I would prefer to be using this data first, and looking up against openSUSE, rather than the other way around, or building a database of both and cross referencing. I'll see what is happening inside |
The schema is at https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema , and both url and file.filename have the proper project name, and I have got them working with adhoc queries. So now I just need to propose a PR to pypinfo to use the filename. It might be slightly slower, depending on whether bigquery supports some more advanced SQL join syntax, and possibly even using https://bigquery.cloud.google.com/table/the-psf:pypi.simple_requests instead. |
Other potential tools using bigquery which might be usable, especially as some are doing post-processing to get more info from pypi https://github.com/cclauss/python3wos_asyncio & https://github.com/ubershmekel/python3wos, https://github.com/mara/bigquery-downloader , https://github.com/capicue/ncf/blob/master/packages/get-descriptions.py, https://github.com/fmenabe/pypi-stats , https://github.com/psincraian/pepy , https://github.com/ehfeng/installstats , https://github.com/datawrestler/lametric-pypi , https://github.com/OzymandiasTheGreat/pypes , https://github.com/rth/pypi-stats-viz , https://github.com/okfn/measure , https://github.com/crflynn/pypistats.org, https://github.com/RootLUG/aura, https://github.com/jantman/pypi-download-stats, https://github.com/scikit-hep/scikit-hep-orgstats, https://github.com/di/pyreadiness |
Sounds good! One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota. Hopefully it won't increase the amount used too much, but it would be nice to see the difference. pypinfo reports how big each query is, you can see it in the json here. |
Good list! (I need to make a list of things using this data, too.) Of those, https://github.com/psincraian/pepy and https://github.com/crflynn/pypistats.org are websites which essentially cache BigQuery data. The latter is especially good and has an API, for which I've written a CLI client: https://pypistats.org/api/ The data is limited to 6 months, and both pepy and pypistats.org don't have this specific mapping we're talking about. But maybe they could? |
It shouldnt be extra queries - just slightly slower queries, assuming the SQL engine is halfway decent. Based on your recommendation, I've created issues in both of those projects to see which, if any, have an interest. You'll be interested to learn that pepy is growing an API psincraian/pepy@b3cf4ee |
Now I have the SQL changes needed (see queries at psincraian/pepy#128 (comment)), I've also created an issue at ofek/pypinfo#73 before doing the change there. |
All project names are lower case, not matching the name shown on pypi.org. e.g.
pyyaml
instead ofPyYAML
. I suspect that may be the data this project has, in which case the problem is upstream.That lowercasing is not very helpful - the name of projects can (and does) change over time in all sorts of ways, not just the case.
Applying lowercase can be done after the fact - it is a simple transform, but it is not reversible without the post-processing of all entries as suggested in the followup comments on #1
My use-case is I need to match the list up with openSUSE package names, which must use the PyPI package name, exactly, including casing and hyphen-vs-dash. The task is slightly more difficult and slower if I dont have the exact name to begin with.
If it cant be obtained from the source data, it is likely quicker for me to add post-processing to get the real name , rather than try to get exact results from case insensitive openSUSE package searches.
The text was updated successfully, but these errors were encountered: