Skip to content

May Tasks

snadi edited this page Apr 20, 2018 · 5 revisions

I created a page describing the project and its goals.

Given that I'm away for 3 weeks, I wanted to give you a concrete set of tasks you can start with. The first step would be the background tasks you need to do to get you more familiar with the area etc. Then, there are some concrete implementation tasks that follow. Please keep a wiki page for each week (let Mehran or Benyamin show you their wikis to give you an idea). I created a sample wiki format page that you can look at.

Background Tasks

Readings

Here is a list of papers that might be useful to read

Tools to explore

There are a couple of ways you can mine a project history

  • Directly from git
    • can create your own bash scripts to process the git log history (wouldn't recommend if there's an existing tool you can use since it's kind of re-inventing the wheel)
    • can use a library that processes the git log history for you. Example: PyGit
  • Using the GitHub API, which also has libraries for it.
  • Using existing repository mining tools. For example:
  • Existing data sets
    • GHTorrent -- this is a mirror of github that allows you to query things more easily without the API request limit etc. However, I'm not sure if it gets you down to the level of the code changes in a commit.

If BOA can give you what you need, then I would recommend using it, since it allows you to do large-scale mining. The disadvantage is that the data set is old (2015), but the authors told me they are updating it soon, and even finding bugs in older projects is not problematic at this point -- we can simply run things again when the new data set comes out. Sven Amann, who is the main student behind MuBench has his BOA script online. He was searching for specific APIs though so you might need to edit it a bit.

Implementation Tasks

  1. Select 10 popular Java repositories (you can start with 1 but 10 just gives you a higher probability of finding something). You can use this paper as a guide to choose "good" repo candidates.

  2. Using one of the above mining techniques, mine the history of these projects to identify commit messages that have the word "Fix" and any of the keywords that describe non-functional requirements.

  3. Look at the diff of collected commits and see if you can understand the source of the problem and what the fix was. Then, document the example in the format specified in the project overview and similar to what MuBench did (note that for now, descriptions may be regular English sentences.. we can come up with concrete categories later.

  4. Other things you can try out:

    • when identifying commits, is it better to filter out commits that don't change source files (e.g., .java files)? The reasoning is that some commits may change documentation files or something. I would rather start without the filter and then based on the retrieved data, see if there's a recurring trend of "false positive"/useless commits that can be filtered out, then you can try the heuristic. The reason I say that is that maybe it is interesting to see fixes to non-functional requirements that rely on changing a library dependency in pom.xml files or certain configuration parameters. If the data doesn't seem promising on the first go, maybe also think of other heuristics too can be used to improve the search?
    • For projects that use the GitHub tracking system, maybe it's worth going the other way around: identify issues that are related to non-functional requirements (Fernando already did this before for security and performance) and then look at their associated commits and then look at the diffs of these commits. Note that this all can be done using the Github API. Given your interest in NLP and linguistics, can we maybe do something smarter than just look for keywords?
    • I've never really tried doing any kind of repository mining on python repos, so i would be curious on what we find in python repos. You could try repeating the analysis on a bunch of python repos. I imagine with many of them being related to machine learning and data science, things like performance changes come up a lot ?