Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sample Data] Request a list of all repositories in the open-digger dataset, in the format "GithubId/repository name". #1408

Closed
ZhangChunXian opened this issue Oct 13, 2023 · 6 comments · Fixed by #1412
Assignees
Labels
waiting for author need issue author's feedback

Comments

@ZhangChunXian
Copy link

Usage

For personal research

Extract SQL

I wanna the list of all repositories in the open-digger dataset, in the format "GithubId/repository name, such as "X-lab2017/open-digger". I'd appreciate it if you could provide it.
我想要open-digger数据集中收录的所有仓库的名字, 格式为"githubId/仓库名", 就比如"X-lab2017/open-digger". 如果能提供的话, 万分感激.

Does this dataset need to be updated regularly?

No response

@github-actions github-actions bot added the waiting for repliers need other's feedback label Oct 13, 2023
@Zzzzzhuzhiwei
Copy link
Collaborator

Hi, you can use the data in the file for your research. or you can use the labeled data we released. they both contain the repositories list.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Oct 13, 2023
@frank-zsy
Copy link
Contributor

@Zzzzzhuzhiwei I think @ZhangChunXian is requesting the whole repo and user list that OpenDigger export which is not currently in OpenDigger sample data and exported data.

I think we can do this in monthly export task to a csv or JSONL files, however the files maybe really large.

@Zzzzzhuzhiwei
Copy link
Collaborator

I see! If we export this file, it might be too large. Now, there are 328,032,951 different repositories in the clickhouse.

@frank-zsy
Copy link
Contributor

Although there are lots of repos on GitHub but we only export about 500 thousand repos. I try to find out how large the file will be:

image

If we use csv file with id and name in a line, the file will be about 17MB.

@frank-zsy
Copy link
Contributor

And the user file will be about 5MB, I think this is feasible for monthly export task.

/self-assign

@frank-zsy
Copy link
Contributor

@ZhangChunXian Thanks for the issue, the lists have been exported to repo_list.csv and user_list.csv, please feel free to use them in your research. Welcome to any further suggestions and questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author need issue author's feedback
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants