Skip to content

Commit

Permalink
Merge pull request #5 from spgroup/master
Browse files Browse the repository at this point in the history
Getting master
  • Loading branch information
Rafael Mota Alves authored Dec 23, 2019
2 parents cd7c461 + ff7299c commit ed89d50
Show file tree
Hide file tree
Showing 15 changed files with 258 additions and 205 deletions.
108 changes: 68 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,93 @@
# Mining Framework
Framework for mining git projects.
This is a framework for mining and analyzing git projects.

## Getting Started
* This project uses [Apache Groovy](http://groovy-lang.org/). Install it to execute the program
* It also uses a [Python](https://www.python.org/) script to convert the output to a SOOT compatible format and fetch the project build files. Install the version 3.7.x or newer to run with the SOOT output.
We focus on analyzing merge commits, although this could be easily changed to analyze any kind of commit.

We basically have variability points (hot spots) for
* preprocessing the set of projects to be analyzed (like forking projects and enabling Travis CI services for such forks)
* filtering the set of merge commits in such projects (like for focusing only on merge commits with parents that involve changes to the same method)
* collecting experimental data from each merge commit (like revisions of the files declaring the method that was changed in both parents, commit hashes, line numbers of the changes in each parent, overall statistics about the merge commit, result of replaying the merge operation with different tools, etc.)
* postprocessing the collected experimental data (like aggregating and summarizing data, or any kind of operation that is more expensive to perform in a per merge commit basis, such as downloading Travis CI ".jar" files for each merge revision, merging spreadsheets created by different data collectors, etc.), after all projects have been analyzed

We also have a number of implementations for such variability points, so that one can reuse or adapt them as needed for instantiating the framework.
The examples illustrated above correspond to some of the implementations we provide here.

* If you want to run the tests, you must use the command to clone the repository:

## Getting Started
* Fork and clone the project. If you want to run the project tests, you must clone the repository with the recursive option:
``` git clone --recursive https://github.com/spgroup/miningframework ```

## Develop
This framework uses [Google Guice](https://github.com/google/guice) to deal with dependency injection.
* This project uses [Apache Groovy](http://groovy-lang.org/). You have to install version 2.5.x or newer to use the framework and start mining projects.

* For one of the implementation of the postprocessing variability point ([OutputProcessorImpl](https://github.com/spgroup/miningframework/tree/master/src/services/OutputProcessorImpl.groovy)), you also have to install [Python](https://www.python.org/) version 3.7.x or newer. This is needed for a script that fetches build files from Travis CI, and another script that converts collected data to a format that is used by the SOOT static analyses invoked by this instantiation. If you don't wish to use this specific implementation of the postprocessing variability point, there is no need to install Python.

It's necessary to implement these interfaces:
* **Commit Filter** defines conditions (filter) to analyze a commit.
* **Data Collector** retrieves data about each analyzed merge commit (you can add multiple implementations).
* **Project Processor** does some pre processing in the projects list
* **Output Processor** runs at the finish of the analysis, intended to add extra steps to the analisys

The [services/](https://github.com/spgroup/miningframework/tree/master/src/services/) directory contains models for these dependencies. Also, the [MiningModule](https://github.com/spgroup/miningframework/blob/master/src/services/MiningModule.groovy) class acts as the dependency injector.
## Instantiating or extending the framework

> Obs: If you intend to use the framework multithreading option, be aware of the necessity to synchronize the access to the output files.
You need to implement the following interfaces (see [interfaces/](https://github.com/spgroup/miningframework/tree/master/src/main/interfaces)) or choose their existing implementations (see [services/](https://github.com/spgroup/miningframework/tree/master/src/services/)):
* ProjectProcessor
* CommitFilter
* DataCollector
* OutputProcessor

## Projects List
Another input file is a `.csv` file, that must contain information about the projects to be analyzed. Its lines should have the following structure (similar to the [projects](https://github.com/spgroup/miningframework/blob/master/projects.csv) file):
They correspond to the four variability points described at the beginning of the page.

**output name**,**path**[,**relative**]
The framework uses [Google Guice](https://github.com/google/guice) to implement dependency injection, and inject the interface implementations.
So, to select the interface implementations you want to use in your desired instantiation of the framework, you also need to write a class such as [MiningModule](https://github.com/spgroup/miningframework/blob/master/src/services/MiningModule.groovy), which acts as the dependency injector. This one, in particular, is used as a default injector if no other is specified when invoking the framework.

Where:
* **output name** refers to the name that should appear in output files;
* **path** is a local path or it's an url of a git project (https://github.com/...);
* **relative** (`true|false`), optional, indicates if **path** is a directory containing multiple projects or it is a project directory. The default is `false`.

## Running
One can run the framework by including `src` in the classpath and executing `src/main/script/MiningFramework.groovy`.
## Running a specific framework instantiation

You can run the framework by including the [src](https://github.com/spgroup/miningframework/blob/master/src) directory in the classpath and executing `src/main/app/MiningFramework.groovy`.

This can be done by configuring an IDE or executing the following command in a terminal:
* Windows/Linux/Mac: `groovy -cp src src/main/script/MiningFramework.groovy [options] [input] [output]`
* Windows/Linux/Mac: `groovy -cp src src/main/app/MiningFramework.groovy [options] [input] [output]`

`[input]` is a mandatory argument and refers to the path of the projects list's file. It's useful to type `--help` in the `[options]` field to see more details, including information about parameterization of the input files.
`[input]` is the path to a CSV file containing the list of projects to be analyzed (like [projects.csv](https://github.com/spgroup/miningframework/blob/master/projects.csv)), one project per line. The list can contain external projects to be downloaded by the framework (the path field should be an URL to a git project hosted in the cloud), or local projects (the path field should refer to a local directory).

To get the SOOT framework output format execute the following command:
* Windows/Linux/Mac `groovy -cp src src/main/script/MiningFramework.groovy --post-script "python scripts/parse_to_soot.py [output] " [options] [input] [output]`
`[output]` is the path to a directory that the framework should create containing the results (collected experimental data, statistics, etc.) of the mining process.

To get the build files in the output pass a github token to execution:
* Windows/Linux/Mac `groovy -cp src src/main/script/MiningFramework.groovy --access-key "github-token" [options] [input] [output]`
> Obs: The Github account must be registered in [Travis](https://travis-ci.org/) also. Forks will be created for each project, the builds will be generated via travis, and deployed to the forks github releases
`[options]` a combination of our command line configuration options. It's useful to type `--help` in the `[options]` field to see the supported options and associated information.

To automatically download the build files, wait for the builds succeced in travis then run the script:
* Windows/Linux/Mac `python scripts/fetch_jars.py <input file> <output path> <github token>`
> The options are available to all variability points implementations, but some of the implementations might not make use of all options. So check the documentation of the variability points implementations you need to confirm that they really make use of the options of interest.
## Testing
One can the framework tests by including `src` in the classpath and executing `src/test/TestSuite.groovy`
> If you intend to use the framework multithreading option, be aware of the need to synchronize the access to output files or state manipulated by the implementations of the framework variability points.
This can be done by configuring an IDE or executing the following command in a terminal:
* Windows/Linux/Mac: `groovy -cp src src/test/TestSuite.groovy`
> For example, for running the study we use as an example to illustrate the variability points at the beginning of the page, we invoke the following command at the project top folder:
* Windows/Linux/Mac: `groovy -cp src src/main/app/MiningFramework.groovy --access-key github-personal-access-token --threads 2 ./projects.csv SOOTAnalysisOutput`

To create new tests, you have to create a git repository with a merge scenario simulating, add it to the `test_repositories` directory and add it to `src/test/input.csv` like a project and then create the Test class.
> For the used variability point implementation, the provided GitHub [personal access token](https://github.com/settings/tokens) (opt for repo scope) should be associated with a GitHub account also registered in [Travis](https://travis-ci.org/). Forks will be created for each project, the builds will be generated via Travis, and deployed to the forks as GitHub releases.
## Scripts
* `parse_to_soot.py` - This script receives as input the path to a directory generated by the miningframework, it reads the output files and creates a [output]/data/results-soot.csv with the output in a format suported by a SOOT analysis framework
* `fetch_jars.py`- This script receives as input the path to a framework input file, the path to a directory generated by the miningframework and a github acess token, it downloads the release files from github and moves the files to the directory passed as input.
> The CLI has the following help page:
```
usage: miningframework [options] [input] [output]
the Mining Framework take an input csv file and a name for the output dir
(default: output)
Options:
-a,--access-key <access key> Specify the access key of the git account
for when the analysis needs user access to
GitHub
-h,--help Show help for executing commands
-i,--injector <class> Specify the class of the dependency
injector (Must provide full name, default
src.services.MiningModule)
-k,--keep-projects Keep projects in disk after analysis
-p,--push <link> Specify a git repository to upload the
output in the end of the analysis (format
https://github.com/<owner>/<name>
-s,--since <date> Use commits more recent than a specific
date (format DD/MM/YYY)
-t,--threads <threads> Number of cores used in analysis (default:
1)
-u,--until <date> Use commits older than a specific
date(format DD/MM/YYYY)
```


## Testing
One can run the framework tests by including `src` in the classpath and executing `src/test/TestSuite.groovy`

For example, for running the study we use as an example to illustrate the variability points at the beginning of the page, we invoke the following command at the project top folder:
* Windows/Linux/Mac: `groovy -cp src src/main/app/MiningFramework.groovy --access-key github-personal-access-token --threads 2 ./projects.csv SOOTAnalysisOutput`

* To create new tests, you have to create a git repository with a merge scenario simulating a specific situation you want to test, add it to the `test_repositories` directory, add a corresponding entry to `src/test/input.csv`, and then create the Test class.
Binary file modified dependencies/soot-analysis.jar
Binary file not shown.
6 changes: 4 additions & 2 deletions projects.csv
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
name,path,relative
jsoup,https://github.com/jhy/jsoup
path,name
https://github.com/jhy/jsoup,jsoup
https://github.com/guilhermejccavalcanti/jFSTMerge
https://github.com/rbonifacio/soot-analysis
95 changes: 49 additions & 46 deletions scripts/fetch_jars.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# This script receives as input the path to a framework input file, the path to a directory generated by the miningframework and a github acess token, it downloads the release files from github and moves the files to the directory passed as input.


import sys
import requests
import json
Expand All @@ -17,65 +20,65 @@
MESSAGE_PREFIX="Trigger build #"
RELEASE_PREFIX= "fetchjar-"

inputPath = sys.argv[1] # input path passed as cli argument
outputPath = sys.argv[2] # output path passed as cli argument
input_path = sys.argv[1] # input path passed as cli argument
output_path = sys.argv[2] # output path passed as cli argument
token = sys.argv[3] # token passed as cli argument

def fetchJars(inputPath, outputPath, token):
def fetchJars(input_path, output_path, token):
# this method reads a csv input file, with the projects name and path
# for each project it downloads the build generated via github releases
# and moves the builds to the output generated by the framework

print("Starting build collection")

tokenUser = get_github_user(token)[LOGIN]
token_user = get_github_user(token)[LOGIN]

parsedInput = read_input(inputPath)
parsedOutput = read_output(outputPath)
newResultsFile = []
parsed_input = read_input(input_path)
parsed_output = read_output(output_path)
new_results_file = []

for project in parsedInput:
for project in parsed_input:

splitedProjectPath = project[PATH].split('/')
projectName = splitedProjectPath[len(splitedProjectPath) - 1]
githubProject = tokenUser + '/' + projectName
print (projectName)
splited_project_path = project[PATH].split('/')
project_name = splited_project_path[len(splited_project_path) - 1]
github_project = token_user + '/' + project_name
print (project_name)

get_builds_and_wait(githubProject)
get_builds_and_wait(github_project)

releases = get_github_releases(token, githubProject)
releases = get_github_releases(token, github_project)

# download the releases for the project moving them to the output directories
for release in releases:
# check if release was generated by the framework
if (release[NAME].startswith(RELEASE_PREFIX)):
commitSHA = release[NAME].replace(RELEASE_PREFIX, '')
print ("Downloading " + commitSHA )
commit_sha = release[NAME].replace(RELEASE_PREFIX, '')
print ("Downloading " + commit_sha )
try:
downloadPath = mount_download_path(outputPath, project, commitSHA)
downloadUrl = release[ASSETS][0][DOWNLOAD_URL]
download_file(downloadUrl, downloadPath)
if (commitSHA in parsedOutput):
newResultsFile.append(parsedOutput[commitSHA])
untar_and_remove_file(downloadPath)
print (downloadPath + ' is ready')
download_path = mount_download_path(output_path, project, commit_sha)
download_url = release[ASSETS][0][DOWNLOAD_URL]
download_file(download_url, download_path)
if (commit_sha in parsed_output):
new_results_file.append(parsed_output[commit_sha])
untar_and_remove_file(download_path)
print (download_path + ' is ready')
except:
pass

remove_commit_files_without_builds (outputPath, projectName)
remove_commit_files_without_builds (output_path, project_name)

with open(outputPath + "/data/results-with-builds.csv", 'w') as outputFile:
outputFile.write("project;merge commit;className;method;left modifications;left deletions;right modifications;right deletions\n")
outputFile.write("\n".join(newResultsFile))
outputFile.close()
with open(output_path + "/data/results-with-builds.csv", 'w') as output_file:
output_file.write("project;merge commit;className;method;left modifications;left deletions;right modifications;right deletions\n")
output_file.write("\n".join(new_results_file))
output_file.close()

def read_output(outputPath):
fo = open(outputPath + "/data/results.csv")
def read_output(output_path):
fo = open(output_path + "/data/results.csv")
file = fo.read()
fo.close()

fileOutLines = file.split("\n")
return parse_output(fileOutLines)
file_out_lines = file.split("\n")
return parse_output(file_out_lines)

def parse_output(lines):
result = {}
Expand All @@ -85,13 +88,13 @@ def parse_output(lines):
result[cells[1]] = line
return result

def read_input(inputPath):
f = open(inputPath, "r")
def read_input(input_path):
f = open(input_path, "r")
file = f.read()
f.close()

bruteLines = file.split("\n")
return parse_input(bruteLines)
brute_lines = file.split("\n")
return parse_input(brute_lines)

def parse_input(lines):
# parse framework input csv file
Expand All @@ -112,15 +115,15 @@ def download_file(url, target_path):
with open(target_path, 'wb') as f:
f.write(response.raw.read())

def mount_download_path(outputPath, project, commitSHA):
def mount_download_path(output_path, project, commit_sha):
# mount path where the downloaded build will be moved to
return outputPath + '/files/' + project[NAME] + '/' + commitSHA + '/result.tar.gz'
return output_path + '/files/' + project[NAME] + '/' + commit_sha + '/result.tar.gz'

def untar_and_remove_file(downloadPath):
downloadDir = downloadPath.replace('result.tar.gz', '')
subprocess.call(['mkdir', downloadDir + 'build'])
subprocess.call(['tar', '-xf', downloadPath, '-C', downloadDir + '/build', ])
subprocess.call(['rm', downloadPath])
def untar_and_remove_file(download_path):
download_dir = download_path.replace('result.tar.gz', '')
subprocess.call(['mkdir', download_dir + 'build'])
subprocess.call(['tar', '-xf', download_path, '-C', download_dir + '/build', ])
subprocess.call(['rm', download_path])

def get_builds_and_wait(project):
has_pendent = True
Expand Down Expand Up @@ -156,8 +159,8 @@ def get_headers(token):
}


def remove_commit_files_without_builds (outputPath, projectName):
files_path = outputPath + "/files/" + projectName + "/"
def remove_commit_files_without_builds (output_path, project_name):
files_path = output_path + "/files/" + project_name + "/"

if (os.path.exists(files_path)):
commit_dirs = os.listdir(files_path)
Expand All @@ -172,4 +175,4 @@ def remove_commit_files_without_builds (outputPath, projectName):
if (len (os.listdir(files_path)) == 0 ):
shutil.rmtree(files_path)

fetchJars(inputPath, outputPath, token)
fetchJars(input_path, output_path, token)
Loading

0 comments on commit ed89d50

Please sign in to comment.