Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull Request #60

Open
rzitomer opened this issue May 10, 2019 · 2 comments
Open

Pull Request #60

rzitomer opened this issue May 10, 2019 · 2 comments

Comments

@rzitomer
Copy link
Collaborator

rzitomer commented May 10, 2019

@gvwilson Here's the info we can get from Github about Pull Requests (see here: http://ghtorrent.org/relational.html):

  • All pull requests and their current state (including the head repo the PR is from and the base repo its to; whether or not the head and base repos are the same; and the associated commit ids)
  • History of all previous states the pull requests were in (opened, closed, merged, reopened, synchronized)
  • All the commits associated with a pull request
  • All the code review comments associated with the pull request
  • We can also get all issues that are associated with each pull request

Details on code reviews data:
We pulled all of the code reviews from the ~10k projects we sampled. Around 0.5% of the projects had at least one code review, which was a little lower than expected. Our hypothesis as to why this is low is that it is only recording comments directly on code in a PR, and not recording general comments on a pull request (even though we might want to consider these code reviews). Looks like these 'general' comments are stored in a separate table called issue_comments, along with all the comments on issues (for more info, see the 4th paragraph in the Challenges and Limitations section here: http://www.ghtorrent.org/files/ghtorrent-data.pdf). Next steps here are to pull in these 'general' comments on PRs as well.

Details on pull request data:
We pulled all of the pull requests from the ~10k projects we sampled. Around 6.6% of the projects had at least one PR.
Below we have plotted each cluster (where a cluster is a group of 'similar' GitHub projects based on Graph2Vec embeddings) by what % of the cluster is just a single chain (no branching and merging) against % of projects in the cluster with at least one pull request. We expected these 2 variables to be negatively correlated, which is (kind of) what we see
image

@gvwilson
Copy link
Contributor

Thanks very much - I'll try to get you comments tonight.

@rzitomer
Copy link
Collaborator Author

rzitomer commented May 22, 2019

A few followups on this:

  • As discussed last week, we'd expect the percent of the projects in a cluster that have at least one PR to be negatively correlated with clusters where a high percentage of motifs (length 5) are just a single chain (which I'm assuming is a proxy for graph complexity; or at very least a proxy of number of branches and merges), but we didn't really see that. Here is the graph based on updated clusters, not really seeing the pattern here either:
    image

Next we looked at the relationship at the project level by examining the 10k sampled projects we used to create the clusters.

  • I divided them into repos with at least one PR and repos without a PR. I did see that repos without a PR were more likely to just be a single chain.
    image
    Mean single chain (no PRs): .915
    Median single chain (no PRs): 1
    Mean single chain (>=1 PRs): .659
    Median single chain (>=1 PRs): .709

I assume a big confounding variable here is number of authors: my hypothesis is that a project having one author is highly correlated with having no PRs and highly correlated with single chain percentage. Will look into this.

  • Still see no relationship between # of PRs and single chain motifs at the project level (given that a project has at least one PR)
    image

So we don't even see this relationship on "interesting" (projects with >2 PRs) projects

  • As a sanity check of the data and our approach to pulling it, I checked that there is a relationship between Commits and PRs, which there is:
    image
    image

I took an initial stab at controlling for commits by just looking at projects with at least 100 commits and still see no obvious pattern between single chain % and # of PRs:
image

  • I spot checked some projects that had low single chain percentage (complex graphs) and < 2 PRs and found they were mostly 2-3 people working together on 1-3 branches.

  • I spot checked some projects that had high single chain percentage (simple graphs) and > 3 PRs and found a lot of people (~5-20 authors), a lot of commits (and no obvious pattern in branches).

  • We'll formalize this analysis by pulling number of authors, number of commits, number of branches etc. by repo and getting averages of these stats for high and low graph complexity and high and low number of PRs (and issues and code reviews).

  • I looked at number of issues by repo as well (issues include PRs). 13% of sampled repos had at least one issue. The patterns in issues are similar to the ones seen in PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants