-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #9 from UBC-MDS/presentation_structure
Proposal Presentation
- Loading branch information
Showing
22 changed files
with
894 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,3 +2,4 @@ | |
*DS_Store* | ||
*Rhistory* | ||
*.json | ||
.Rproj.user |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Version: 1.0 | ||
|
||
RestoreWorkspace: Default | ||
SaveWorkspace: Default | ||
AlwaysSaveHistory: Default | ||
|
||
EnableCodeIndexing: Yes | ||
UseSpacesForTab: Yes | ||
NumSpacesForTab: 4 | ||
Encoding: UTF-8 | ||
|
||
RnwWeave: Sweave | ||
LaTeX: pdfLaTeX |
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz); | ||
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic); | ||
@import url(https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700); | ||
|
||
body { font-family: 'Droid Serif', 'Palatino Linotype', 'Book Antiqua', Palatino, 'Microsoft YaHei', 'Songti SC', serif; } | ||
h1, h2, h3 { | ||
font-family: 'Yanone Kaffeesatz'; | ||
font-weight: normal; | ||
} | ||
.remark-code, .remark-inline-code { font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace; } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
a, a > code { | ||
color: rgb(249, 38, 114); | ||
text-decoration: none; | ||
} | ||
.footnote { | ||
position: absolute; | ||
bottom: 3em; | ||
padding-right: 4em; | ||
font-size: 90%; | ||
} | ||
.remark-code-line-highlighted { background-color: #ffff88; } | ||
|
||
.inverse { | ||
background-color: #272822; | ||
color: #d6d6d6; | ||
text-shadow: 0 0 20px #333; | ||
} | ||
.inverse h1, .inverse h2, .inverse h3 { | ||
color: #f3f3f3; | ||
} | ||
/* Two-column layout */ | ||
.left-column { | ||
color: #777; | ||
width: 20%; | ||
height: 92%; | ||
float: left; | ||
} | ||
.left-column h2:last-of-type, .left-column h3:last-child { | ||
color: #000; | ||
} | ||
.right-column { | ||
width: 75%; | ||
float: right; | ||
padding-top: 1em; | ||
} | ||
.pull-left { | ||
float: left; | ||
width: 47%; | ||
} | ||
.pull-right { | ||
float: right; | ||
width: 47%; | ||
} | ||
.pull-right ~ * { | ||
clear: both; | ||
} | ||
img, video, iframe { | ||
max-width: 100%; | ||
} | ||
blockquote { | ||
border-left: solid 5px lightgray; | ||
padding-left: 1em; | ||
} | ||
.remark-slide table { | ||
margin: auto; | ||
border-top: 1px solid #666; | ||
border-bottom: 1px solid #666; | ||
} | ||
.remark-slide table thead th { border-bottom: 1px solid #ddd; } | ||
th, td { padding: 5px; } | ||
.remark-slide thead, .remark-slide tfoot, .remark-slide tr:nth-child(even) { background: #eee } | ||
|
||
@page { margin: 0; } | ||
@media print { | ||
.remark-slide-scaler { | ||
width: 100% !important; | ||
height: 100% !important; | ||
transform: scale(1) !important; | ||
top: 0 !important; | ||
left: 0 !important; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
--- | ||
title: "What the Git is going on here? <br>" | ||
subtitle: "<br>RStudio Capstone Project Proposal" | ||
#author: "Juno Chen, Ian Flores Siaca, Rayce Rossum, Richie Zitomer" | ||
date: "2019/04/24" | ||
output: | ||
xaringan::moon_reader: | ||
lib_dir: libs | ||
css: xaringan-themer.css | ||
nature: | ||
highlightStyle: github | ||
highlightLines: true | ||
countIncrementalSlides: false | ||
--- | ||
|
||
class: inverse, center, middle | ||
|
||
# Introduction | ||
|
||
```{r setup, include=FALSE} | ||
options(htmltools.dir.version = FALSE) | ||
library(xaringanthemer) | ||
duo(primary_color = "#D8CEC5", secondary_color = "#49475B") | ||
``` | ||
|
||
--- | ||
# Introduction | ||
|
||
- Git is a Version Control System to track changes to different files | ||
- People use Git to collaborate from SE to DS | ||
- However when using Git we might encounter some problems | ||
|
||
-- | ||
|
||
<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/1.png' height='350'> | ||
|
||
--- | ||
# Introduction | ||
|
||
<img src='https://knightlab.northwestern.edu/wp-content/uploads/2014/12/2.png' height='350'> | ||
|
||
--- | ||
# Introduction | ||
|
||
- RStudio is interested in developing a new tool for Git users | ||
- For this we want to understand how people use Git | ||
- What works for workflows | ||
- What is hindering workflows | ||
- **What are those workflows?** | ||
|
||
- We only have data to answer one of these questions | ||
- Access to commit history | ||
|
||
--- | ||
# Introduction - Getting the data | ||
|
||
- GitHub API | ||
- Sampling & Rate Limiting | ||
- GitHub Torrent | ||
- Mines the GitHub API for all latest pushs | ||
- Tracks all of the repos and makes it available in a MySQL database | ||
- This means 4TB of overall data | ||
-- | ||
|
||
![](https://cdn-images-1.medium.com/max/1200/1*A8liBoeAwAZg7rDu394jYg.png) | ||
|
||
--- | ||
|
||
# Introduction - Getting the data | ||
|
||
- Multiple tables containing information about projects, commits, users, issues, etc. | ||
- Pipeline process: | ||
- Sample 1 million projects in the DB | ||
- Get the commits for all the projects | ||
- Get the parents of the commits for all the projects | ||
- Save to Buckets for export and storage | ||
- Reproducibility in scope | ||
- SQL Versioning | ||
- Data Versioning | ||
|
||
--- | ||
|
||
# Introduction - Data Structure | ||
|
||
- How do we represent a history of commits? | ||
|
||
-- | ||
|
||
#### Graphs | ||
- Git is not any type of graph, it is a Directed Acyclic Graph (DAG) | ||
- Nodes/Vertices --> Commits | ||
- Edges --> Connection from one commit to the other | ||
|
||
<img src='https://upload.wikimedia.org/wikipedia/commons/c/c6/Topological_Ordering.svg' height='300'> | ||
|
||
--- | ||
|
||
# Introduction - EDA: Simple Repo | ||
|
||
![](imgs/branch_test1.png) | ||
|
||
--- | ||
# Introduction - EDA: Complex Repo | ||
|
||
|
||
![](imgs/branch_test.png) | ||
|
||
--- | ||
# Introduction - Questions | ||
|
||
- With the scope of designing a new tool to fix issues with Git and with the data that we have available we try to answer two questions: | ||
|
||
-- | ||
|
||
### What are common sub-patterns in the way people use Git? | ||
|
||
-- | ||
|
||
### What are workflow patterns across Git repositories? | ||
|
||
--- | ||
|
||
class: inverse, center, middle | ||
# Analysis | ||
## What are common sub-patterns in the way people use Git? | ||
|
||
--- | ||
|
||
## Inspiration - genetic data | ||
|
||
- comparing to git workflow representation | ||
|
||
- similarity: sequence, i.e. directed | ||
|
||
- difference: fixed length, fixed variation (can apply one-hot encoding) | ||
|
||
![](imgs/dna_matrix.png) | ||
|
||
![](imgs/dna_encoding.png) | ||
|
||
--- | ||
|
||
## Inspiration - genetic data | ||
|
||
- current trend of genetic data study | ||
|
||
- DeepVariant | ||
|
||
- converting DNA sequences to images and feeding them through a convolutional neural network | ||
|
||
![](imgs/dna_image.png) | ||
|
||
[Source: https://blog.floydhub.com/exploring-dna-with-deep-learning/] | ||
|
||
--- | ||
|
||
## Inspiration - social network analysis (SNA) | ||
|
||
- comparing to git workflow representation | ||
|
||
- similarity: directed | ||
|
||
- difference: goal is to predict linkage existence | ||
|
||
- can learn from | ||
|
||
- the first step of SNA: learning structural features of connected graph | ||
|
||
- using sequence generating algorithms: node2vec | ||
|
||
[Source: http://terpconnect.umd.edu/~kpzhang/paper/INFOCOMM2018.pdf] | ||
--- | ||
|
||
## Approach - `Node2vec` | ||
|
||
.pull-left[- Samples network neighborhoods of each node using the biased random walks | ||
- Based on `Weisfeiler-Lehman Graph Kernels` | ||
- iterate nodes and edges, relabel and group, represent the features in a vector] | ||
|
||
.pull-right[![](imgs/wl_kernel.png)] | ||
|
||
--- | ||
|
||
## Approach - `sub2vec` | ||
|
||
- learn a feature representation of each subgraph, maximize properties in the latent feature space | ||
|
||
- preserve two properties | ||
|
||
- `Neighborhood`: neighborhood information of all the nodes, sets of all paths(annotated by node IDs) | ||
|
||
- `Structural`: the subgraph structure (clique, degree, size of subgraph) | ||
|
||
-- | ||
|
||
- advantage: better accuracy, incorporate the properties of entire subgraphs | ||
|
||
- disadvantage: assume unweighted undirected graphs, but can be extended | ||
|
||
![](imgs/sub2vec.png) | ||
|
||
[Source: https://link.springer.com/chapter/10.1007/978-3-319-93037-4_14] | ||
|
||
--- | ||
|
||
## Approach - Motifs | ||
|
||
- What is a Motif? | ||
|
||
- A subgraph which occurs in a network at a much higher frequency than random chance | ||
|
||
.pull-left[<img src="imgs/graph_1.png" width="250" /> <img src="imgs/graph_2.png" width="250" />] | ||
.pull-right[<img src="imgs/degree_distribution.png" width="500" />] | ||
|
||
|
||
--- | ||
class: inverse, center, middle | ||
# Analysis | ||
## What are workflow patterns across Git repositories? | ||
|
||
--- | ||
|
||
## Graph2Vec Background | ||
|
||
> "[Node2Vec and Sub2Vec] only model local similarity within a confined neighborhood and fails to learn global structural similarities that help to classify similar graphs together" | ||
-- | ||
|
||
> "a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs." | ||
-- | ||
|
||
> "graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic." | ||
--- | ||
## Graph2Vec Background | ||
|
||
![](imgs/g2v_flow2.png) | ||
|
||
[Source: https://arxiv.org/pdf/1707.05005.pdf] | ||
--- | ||
|
||
## Clustering Embeddings from Graph2Vec Model | ||
|
||
![](imgs/g2v_clusters.png) | ||
|
||
[Source: https://www.datascience.com/blog/k-means-clustering] | ||
|
||
--- | ||
## Graph2Vec Limitations | ||
|
||
> Graph2Vec currently works with undirected graphs, therefore we will have to make modifications to support directed graphs. | ||
-- | ||
|
||
> Graph2Vec only helps us address the first question (unless we can find a way to extract the learned subgraphs from the neural network). | ||
|
||
--- | ||
|
||
# Projected Timeline | ||
|
||
| Milestone | Date | | ||
|---|---| | ||
| Proposal Presentation | 4/26 | | ||
| Proposal Report (to mentor) | 4/30 | | ||
| Proposal Report (to partner) | 5/3 | | ||
| End-to-end analysis | 5/10 | | ||
| Complete workflow patterns across Git repositories | 5/24 | | ||
| Choose best method for subgraph analysis | 5/31 | | ||
| Choose and demonstrate output from subgraph analysis | 6/7 | | ||
| Complete subgraph analysis | 6/14 | | ||
| Final Presentation | 6/17-18 | | ||
| Final Report (to mentor) | 6/21 | | ||
| Final Report (to partner) and Data Product | 6/26 | | ||
|
||
--- | ||
class: inverse, middle | ||
|
||
# Acknowledgments | ||
|
||
- RStudio | ||
- Greg Wilson | ||
|
||
- UBC-MDS Teaching Team | ||
- Tiffany Timbers | ||
|
||
- UBC-MDS Students |
Oops, something went wrong.