This package contains the code for executing 15 external clustering validity indices in Spark. Chi index Python implementation can be found in this repository: Chi Index Python . This package includes the following indices in Scala:
- Chi Index
- Entropy
- Purity
- Mutual-Information
- F-Measure
- Variation Of Information
- Goodman-Kruskal
- Rand Index
- Adjusted Rand Index
- Jaccard
- Fowlkes-Mallows
- Hubert
- Minkowski
- Criterion H
- CSI
- PSI
Chi Index takes a value in [0, 2], where 0 is given by the worst clustering solution, and 2 is the best value that Chi Index can achieve.
The cluster indices can be executed using K-means and Bisecting K-Means from Spark MLlib, and Linkage method.
Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)
The package is ready to be used. You only have to download it and import it into your workspace. The package includes iris dataset as example and it can be used with the main test class.
MainExternalTest class has been configured for being executed in a laptop. You can execute this class directly but there are some variables that you should be noticed:
- val numIterations = Maximum number of iterations for the clustering algorithm.
- var minClusters = Minimum number of clusters to begin the execution.
- var maxClusters = Maximum number of clusters to finish the execution.
- val origen = The path of the dataset.
- val destino: String = Utils.whatTimeIsIt() This is the name of the output file.
- var idIndex = The index of the instances' ID. If there is no id, set it as "-1".
- val classIndex = The index of the class, 0 is the first element.
- val delimiter = The delimiter between the columns.
val spark = SparkSession
.builder()
.appName("Featuring Clusters")
.master("local[*]")
.getOrCreate()
val irisFile = "iris.data"
val numIterations = 1000
var minClusters = 2
var maxClusters = 10
val origen = irisFile
val destino: String = Utils.whatTimeIsIt()
var idIndex = "-1"
val classIndex = 4
val delimiter = ","
This application can be used with KMeans, GaussianMixture, LDA and BisectingKMeans from SPARK ML. Just comment the methods that are not going to be used:
val clusteringResult = new KMeans()
//val clusteringResult = new GaussianMixture()
//val clusteringResult = new LDA()
//val clusteringResult = new BisectingKMeans()
.setK(numClusters)
.setSeed(1L)
.setMaxIter(numIterations)
.setFeaturesCol("features")
By default, the results are saved in the same a folder than the dataset. The results are saved in several folders:
- [DATETIME]-[DATESETNAME]-kmeansRes-[K]Results: It's the folder that contains the results of the clustering result as "class prediction".
- [DATETIME]-[DATESETNAME]-[K]-class-DFClusters: This folder contains the results of the contingency table with the relative values that sums 100% by rows.
- [DATETIME]-[DATESETNAME]-[K]-class-DFFeatures: This folder contains the results of the contingency table with the relative values that sums 100% by columns.
- [DATETIME]-[DATESETNAME]-loopingchi: This folder contains the results of the chi values for each K in 3 columns: Chi-squared value by row; Chi-squared value by column; and the Chi Index value.
This data can be copy-pasted directly into an spreadsheet to be visualized.
For the iris dataset, the results are following:
- José María Luna-Romera - Personal site
- María Martínez-Ballesteros - Research Group
- Jorge García-Gutiérrez
- José C. Riquelme
Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)
@article{LUNAROMERA20191,
title = {External clustering validity index based on chi-squared statistical test},
journal = {Information Sciences},
volume = {487},
pages = {1-17},
year = {2019},
issn = {0020-0255},
doi = {https://doi.org/10.1016/j.ins.2019.02.046},
url = {https://www.sciencedirect.com/science/article/pii/S0020025519301550},
author = {José María Luna-Romera and María Martínez-Ballesteros and Jorge García-Gutiérrez and José C. Riquelme},
keywords = {Clustering analysis, External validity indices, Comparing clusters, Big data}
}