Skip to content


Merge pull request #9 from UBC-MDS/presentation_structure
Browse files Browse the repository at this point in the history
Proposal Presentation
  • Loading branch information
rzitomer authored Apr 26, 2019
2 parents 237cf4b + e9beec8 commit fa152de
Show file tree
Hide file tree
Showing 22 changed files with 894 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
13 changes: 13 additions & 0 deletions RStudio-GitHub-Analysis.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 4
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX
File renamed without changes
Binary file added docs/imgs/branch_test1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/degree_distribution.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/dna_encoding.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/dna_image.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/dna_matrix.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/g2v_clusters.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/g2v_flow.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/g2v_flow2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/g2v_paper.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/g2v_repo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/graph_1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/graph_2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/sub2vec.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/wl_kernel.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions docs/libs/remark-css/default-fonts.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@import url(;
@import url(,700,400italic);
@import url(,700);

body { font-family: 'Droid Serif', 'Palatino Linotype', 'Book Antiqua', Palatino, 'Microsoft YaHei', 'Songti SC', serif; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
.remark-code, .remark-inline-code { font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace; }
72 changes: 72 additions & 0 deletions docs/libs/remark-css/default.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
a, a > code {
color: rgb(249, 38, 114);
text-decoration: none;
.footnote {
position: absolute;
bottom: 3em;
padding-right: 4em;
font-size: 90%;
.remark-code-line-highlighted { background-color: #ffff88; }

.inverse {
background-color: #272822;
color: #d6d6d6;
text-shadow: 0 0 20px #333;
.inverse h1, .inverse h2, .inverse h3 {
color: #f3f3f3;
/* Two-column layout */
.left-column {
color: #777;
width: 20%;
height: 92%;
float: left;
.left-column h2:last-of-type, .left-column h3:last-child {
color: #000;
.right-column {
width: 75%;
float: right;
padding-top: 1em;
.pull-left {
float: left;
width: 47%;
.pull-right {
float: right;
width: 47%;
.pull-right ~ * {
clear: both;
img, video, iframe {
max-width: 100%;
blockquote {
border-left: solid 5px lightgray;
padding-left: 1em;
.remark-slide table {
margin: auto;
border-top: 1px solid #666;
border-bottom: 1px solid #666;
.remark-slide table thead th { border-bottom: 1px solid #ddd; }
th, td { padding: 5px; }
.remark-slide thead, .remark-slide tfoot, .remark-slide tr:nth-child(even) { background: #eee }

@page { margin: 0; }
@media print {
.remark-slide-scaler {
width: 100% !important;
height: 100% !important;
transform: scale(1) !important;
top: 0 !important;
left: 0 !important;
288 changes: 288 additions & 0 deletions docs/proposal_presentation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
title: "What the Git is going on here? <br>"
subtitle: "<br>RStudio Capstone Project Proposal"
#author: "Juno Chen, Ian Flores Siaca, Rayce Rossum, Richie Zitomer"
date: "2019/04/24"
lib_dir: libs
css: xaringan-themer.css
highlightStyle: github
highlightLines: true
countIncrementalSlides: false

class: inverse, center, middle

# Introduction

```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
duo(primary_color = "#D8CEC5", secondary_color = "#49475B")

# Introduction

- Git is a Version Control System to track changes to different files
- People use Git to collaborate from SE to DS
- However when using Git we might encounter some problems


<img src='' height='350'>

# Introduction

<img src='' height='350'>

# Introduction

- RStudio is interested in developing a new tool for Git users
- For this we want to understand how people use Git
- What works for workflows
- What is hindering workflows
- **What are those workflows?**

- We only have data to answer one of these questions
- Access to commit history

# Introduction - Getting the data

- GitHub API
- Sampling & Rate Limiting
- GitHub Torrent
- Mines the GitHub API for all latest pushs
- Tracks all of the repos and makes it available in a MySQL database
- This means 4TB of overall data



# Introduction - Getting the data

- Multiple tables containing information about projects, commits, users, issues, etc.
- Pipeline process:
- Sample 1 million projects in the DB
- Get the commits for all the projects
- Get the parents of the commits for all the projects
- Save to Buckets for export and storage
- Reproducibility in scope
- SQL Versioning
- Data Versioning


# Introduction - Data Structure

- How do we represent a history of commits?


#### Graphs
- Git is not any type of graph, it is a Directed Acyclic Graph (DAG)
- Nodes/Vertices --> Commits
- Edges --> Connection from one commit to the other

<img src='' height='300'>


# Introduction - EDA: Simple Repo


# Introduction - EDA: Complex Repo


# Introduction - Questions

- With the scope of designing a new tool to fix issues with Git and with the data that we have available we try to answer two questions:


### What are common sub-patterns in the way people use Git?


### What are workflow patterns across Git repositories?


class: inverse, center, middle
# Analysis
## What are common sub-patterns in the way people use Git?


## Inspiration - genetic data

- comparing to git workflow representation

- similarity: sequence, i.e. directed

- difference: fixed length, fixed variation (can apply one-hot encoding)




## Inspiration - genetic data

- current trend of genetic data study

- DeepVariant

- converting DNA sequences to images and feeding them through a convolutional neural network




## Inspiration - social network analysis (SNA)

- comparing to git workflow representation

- similarity: directed

- difference: goal is to predict linkage existence

- can learn from

- the first step of SNA: learning structural features of connected graph

- using sequence generating algorithms: node2vec


## Approach - `Node2vec`

.pull-left[- Samples network neighborhoods of each node using the biased random walks
- Based on `Weisfeiler-Lehman Graph Kernels`
- iterate nodes and edges, relabel and group, represent the features in a vector]



## Approach - `sub2vec`

- learn a feature representation of each subgraph, maximize properties in the latent feature space

- preserve two properties

- `Neighborhood`: neighborhood information of all the nodes, sets of all paths(annotated by node IDs)

- `Structural`: the subgraph structure (clique, degree, size of subgraph)


- advantage: better accuracy, incorporate the properties of entire subgraphs

- disadvantage: assume unweighted undirected graphs, but can be extended




## Approach - Motifs

- What is a Motif?

- A subgraph which occurs in a network at a much higher frequency than random chance

.pull-left[<img src="imgs/graph_1.png" width="250" /> <img src="imgs/graph_2.png" width="250" />]
.pull-right[<img src="imgs/degree_distribution.png" width="500" />]

class: inverse, center, middle
# Analysis
## What are workflow patterns across Git repositories?


## Graph2Vec Background

> "[Node2Vec and Sub2Vec] only model local similarity within a confined neighborhood and fails to learn global structural similarities that help to classify similar graphs together"

> "a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs."

> "graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic."
## Graph2Vec Background



## Clustering Embeddings from Graph2Vec Model



## Graph2Vec Limitations

> Graph2Vec currently works with undirected graphs, therefore we will have to make modifications to support directed graphs.

> Graph2Vec only helps us address the first question (unless we can find a way to extract the learned subgraphs from the neural network).


# Projected Timeline

| Milestone | Date |
| Proposal Presentation | 4/26 |
| Proposal Report (to mentor) | 4/30 |
| Proposal Report (to partner) | 5/3 |
| End-to-end analysis | 5/10 |
| Complete workflow patterns across Git repositories | 5/24 |
| Choose best method for subgraph analysis | 5/31 |
| Choose and demonstrate output from subgraph analysis | 6/7 |
| Complete subgraph analysis | 6/14 |
| Final Presentation | 6/17-18 |
| Final Report (to mentor) | 6/21 |
| Final Report (to partner) and Data Product | 6/26 |

class: inverse, middle

# Acknowledgments

- RStudio
- Greg Wilson

- UBC-MDS Teaching Team
- Tiffany Timbers

- UBC-MDS Students

0 comments on commit fa152de

Please sign in to comment.