Skip to content

Latest commit

 

History

History
 
 

MulticlassClassification-GitHubLabeler

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

GitHub Labeler

ML.NET version API type Status App Type Data sources Scenario ML Task Algorithms
v1.0.0-preview Dynamic API Up-to-date Console app .csv file and GitHub issues Issues classification Multi-class classification SDCA multi-class classifier

This is a simple prototype application to demonstrate how to use ML.NET APIs. The main focus is on creating, training, and using ML (Machine Learning) model that is implemented in Predictor.cs class.

Overview

GitHubLabeler is a .NET Core console application that:

  • trains ML model on your labeled GitHub issues to teach the model what label should be assigned for a new issue. (As an example, you can use corefx-issues-train.tsv file that contains issues from public corefx repository)
  • labeles a new issue. The application will get all unlabeled open issues from the GitHub repository specified at the appsettings.json file and label them using the trained ML model created on the step above.

This ML model is using multi-class classification algorithm (SdcaMultiClassTrainer) from ML.NET.

Enter you GitHub configuration data

  1. Provide your GitHub data in the appsettings.json file:

    To allow the app to label issues in your GitHub repository you need to provide the folloving data into the appsettings.json file.

        {
          "GitHubToken": "YOUR-GUID-GITHUB-TOKEN",
          "GitHubRepoOwner": "YOUR-REPO-USER-OWNER-OR-ORGANIZATION",
          "GitHubRepoName": "YOUR-REPO-SINGLE-NAME"
        }

    Your user account (GitHubToken) should have write rights to the repository (GitHubRepoName).

    Check out here how to create a Github Token.

    GitHubRepoOwner can be a GitHub user ID (i.e. "MyUser") or it can also be a GitHub Organization (i.e. "dotnet")

  2. Provide training file

    a. You can use existing corefx_issues.tsv data file for experimenting with the program. In this case the predicted labels will be chosen among labels from corefx repository. No changes required.

    b. To work with labels from your GitHub repository, you will need to train the model on your data. To do so, export GitHub issues from your repository in .tsv file with the following columns:

    • ID - issue's ID
    • Area - issue's label (named this way to avoid confusion with the Label concept in ML.NET)
    • Title - issue's title
    • Description - issue's description

    and add the file in Data folder. Update DataSetLocation field to match your file's name:

let dataSetLocation = sprintf @"%s/corefx-issues-train.tsv" baseDatasetsLocation

Training

Training is a process of running an ML model through known examples (in our case - issues with labels) and teaching it how to label new issues. In this sample it is done by calling this method at the console app:

buildAndTrainModel dataSetLocation modelFilePathName MyTrainerStrategy.SdcaMultiClassTrainer

After the training is completed, the model is saved as a .zip file in MLModels\GitHubLabelerModel.zip.

Labeling

When the model is trained, it can be used for predicting new issue's label.

For a single test/demo without connecting to a real GitHub repo, call this method from the console app:

testSingleLabelPrediction modelFilePathName

For accessing the real issues of a GitHub repo, you call this other method from the console app:

predictLabelsAndUpdateGitHub configuration modelFilePathName

For testing convenience when reading issues from your GitHub repo, it will only load not labeled issues that were created in the past 10 minutes and are subject to be labeled. You can chenge that config, though:

Since = Nullable (DateTimeOffset(DateTime.Now.AddMinutes(-10.)))

You can modify those settings. After predicting the label, the program updates the issue with the predicted label on your GitHub repo.