Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query doesn't report reliably #1

Open
sdruskat opened this issue Dec 16, 2021 · 3 comments
Open

Query doesn't report reliably #1

sdruskat opened this issue Dec 16, 2021 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@sdruskat
Copy link
Owner

The GitHub query doesn't retrieve reliable number unfortunately. As suggested by @arfon, this may be due to ongoing work in the GitHub backend at query time.

Some options for working around this would be to clean results, e.g., by removing any unexpected spikes (delete rows that deviate from a general trend (1 measurements before, 2 after the spike) retroactively.
Or to run the script several times a day, then averaging.

@sdruskat sdruskat self-assigned this Dec 16, 2021
@sdruskat sdruskat added the bug Something isn't working label Dec 16, 2021
@sdruskat
Copy link
Owner Author

This may be to do with searching for a filename only. Recent weeks (including work done in https://github.com/sdruskat/cff-in-the-wild) have shown that the GitHub Search API doesn not reliably produce results:

  • Vastly different numbers reported for consecutive days
  • UI shows only a handful (like: 2) files for 7k+ results

One solution could be to combine filename search with a string we expect in each and every (real) CFF file, e.g., cff-version (required from the start).

sdruskat pushed a commit that referenced this issue Sep 26, 2022
* Create CITATION.cff

* Create cffconvert.yml

* Fix typo
sdruskat added a commit that referenced this issue Sep 26, 2022
[Enhancement] Add `CITATION.cff` (#1)
@sdruskat
Copy link
Owner Author

sdruskat commented Feb 2, 2024

A current search for path:**/*.cff yields around ~42k results. However, this includes all files present in the cff-corpus project, also hosted on GitHub. The same search ´, excluding my own repos (https://github.com/search?q=path%3A*.cff+-user%3Asdruskat&type=code) currently yields ~15.3k results. This ~matches the ~15.5k unique repositories for which I have CFF files (including historical versions) in cff-corpus.

This means that by using the metadata encoded in the corpus, we can now again construct a more reliable history of CFF files on GitHub!

/cc @jspaaks

@sdruskat
Copy link
Owner Author

sdruskat commented Feb 2, 2024

a more reliable history of CFF files on GitHub!

Currently, the metadata in the corpus excludes deleted files, although this information could be retrieved by looking at the commits including a change in a CFF file more carefully during harvesting, looking at whether the file was removed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant