SubDisc is a Data Mining tool for discovering local patterns in data. SubDisc features a generic Subgroup Discovery algorithm that can be configured in many ways, in order to implement various forms of local pattern discovery. The tool can deal with a range of data types, both for the input attributes as well as the target attributes, including nominal, numeric and binary.
A unique feature of SubDisc is its ability to deal with a range of Subgroup Discovery settings, determined by the type and number of target attributes. Where regular SD algorithms only consider a single target attribute, nominal or sometimes numeric, Cortana is able to deal with targets consisting of multiple attributes, in a setting called Exceptional Model Mining.
SubDisc was previously developed under the name Cortana.
- Generic parameterized Subgroup Discovery algorithm.
- Multiple data types supported.
- Implemented in Java, so works on all major platforms, including Windows, Linux and Mac OS.
- Works on propositional (tabular) data from flat files, .TXT or .ARFF.
- Includes Exceptional Model Mining settings.
- Statistical validation of mining results.
- Graphical presentation of results, such as ROC curves, scatter plots, and exceptional models.
- Additional bioinformatics module for literature-based gene set enrichment (see bioinformatics below).
- Free binary version and open-source access.
- Wrapper available for R (https://github.com/SubDisc/rSubDisc) and Python (soon)
The code is compatible with Java 15.
- Either download the last released version jar file (https://github.com/SubDisc/SubDisc/releases/) or build it yourself (below).
- Double-click on the .jar file or use java cli (ex.:
java -jar target/subdisc-gui.jar
).
The interface should appear, and you are ready to open a data file and discover subgroups!
- Clone the repository:
git clone https://github.com/SubDisc/SubDisc.git
- Use maven to assemble the .jar file:
mvn package
- The .jar file is created in
./target
and named something likesubdisc-gui-2.1094.jar
.
Technical details concerning the algorithms behind Cortana can be found in various scientific publications:
- Subgroup Discovery in Ranked Data, with an Application to Gene Set Enrichment
- Exceptional Model Mining
- Exploiting False Discoveries - Statistical Validation of Patterns and Quality Measures in Subgroup Discovery
- Diverse Subgroup Set Discovery
- Flexible Enrichment with Cortana (Software Demo)
- Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data
- Non-Redundant Subgroup Discovery in Large and Complex Data
- Subgroup Discovery meets Bayesian networks: an Exceptional Model Mining approach
- Discovering Local Subgroups, with an Application to Fraud Detection
- A Bayesian Scoring Technique for Mining Predictive and Non-Spurious Rules
The following people have contributed in various ways to the development of SubDisc/Cortana: