Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Geo data] How can I contribute developer's geographic data to open-digger #1414

Closed
PureNatural opened this issue Oct 24, 2023 · 9 comments
Closed
Labels
waiting for author need issue author's feedback

Comments

@PureNatural
Copy link
Contributor

Description

I currently have about 1 million developer's geographic location information, and there will be more data to be parsed in the future.

How can I submit this data to open-digger? @frank-zsy

@github-actions github-actions bot added the waiting for repliers need other's feedback label Oct 24, 2023
@frank-zsy
Copy link
Contributor

@PureNatural Thanks, I am curious about what is developers' identification? GitHub users or other platforms? And what is the geographic location information format?

Currently, OpenDigger contains about 215 thousands users with location information on GitHub.

The location information in OpenDigger right now is:

  • location: the original location user provided.
  • country: the country of the location
  • administrative_area_level_1: the province or state name of the location
  • administrative_area_level_2: the city of the location
  • locality: the detailed district or street of the location

We will try our best to parse the location into detailed info, so the administrative_area_level_1 and more detailed fields may be null.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Oct 25, 2023
@srsxyc
Copy link

srsxyc commented Oct 27, 2023

Developers are GitHub users.
The location information is:

  • adminDistrict: the province or state name of the location
  • adminDistrict2: district level
  • countryRegion: the country of location
  • formattedAddress: formatting of geolocation information
  • latitude
  • longitude
    The above are parsed from the location. Also to get as much geolocation information as possible, we also parsed the email addresses of GitHub users to get some country-level geolocation information. Currently there are roughly 660,000 processed data, and this raw data is stored in json files. If you need guidance on the specifics of the data, I can show you the json file first.
    @frank-zsy

@frank-zsy
Copy link
Contributor

@srsxyc That will be really great if you give me the JSON file and I will import it into ClickHouse for further use. A few more questions to confirm:

  • When did these data collected?
  • How did you parse the user provided location info to Geo info? I used Google Geo location API to do the work.

@srsxyc
Copy link

srsxyc commented Oct 28, 2023

The data source we used was the GitHub log data from 2015 to present that you provided. We'll start by getting the de-duplicated usernames from the log data. Then after crawling the user's information through the GitHub API. Finally it is parsed through Bing MAP API. @frank-zsy

@frank-zsy
Copy link
Contributor

Great, can you give me the JSON file and I can import into ClickHouse. By when I mean users may change their location information any time, so I need to know when did you call the GitHub API and retrieve the data from GitHub.

And are latitude and longitude often used for analysis? I can add the columns too.

@srsxyc
Copy link

srsxyc commented Oct 30, 2023

We've only crawled data from 3 million users so far, with the earliest being roughly March 2023 and the latest being roughly May 2023. Because there is still a large amount of data that has not been captured, we have not updated it.

Latitude and longitude is an important piece of information when performing geolocation analysis. I think it's best to keep it.

How do I give you the JSON file?

@frank-zsy
Copy link
Contributor

@srsxyc Any form you like, send me by WeChat after compress, or Baidu pan, or upload it to OSS and share the link. All is fine with me.

@frank-zsy
Copy link
Contributor

@srsxyc Thanks for the data, all the users data has been insert into the gh_user_info table and I will insert the location info to location_info table too.

image

@frank-zsy
Copy link
Contributor

@srsxyc All the data right now have been updated into the ClickHouse, thanks a lot.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author need issue author's feedback
Projects
None yet
Development

No branches or pull requests

3 participants