Performance Improvements
- summaries are no longer computed when a
HandyFrame
is created. - column statistics (
q1
,q3
,median
,percentile
) now accept aprecision
argument (default = 0.01) to compute approximate statistics faster stratify
operations are no longer using RDD methods and rely on Spark's DataFrame built-in optimizer to deliver fast columnar statistics. A substantial performance improvement was achieved for almost everystratify
operation.
Stratified transformers
Transformers HandyImputer
and HandyFencer
now store values for stratified operations using the column name as first level of dictionary and filter clause as second level, as opposed to the inverse structure being used in version 0.1.0a1.
- in version 0.1.0a1:
{'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
'Pclass == "3" and Sex == "female"': {'Age': 21.75},
'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}
- in version 0.2.0a1:
{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
'Pclass == "1" and Sex == "male"': 41.28138613861386,
'Pclass == "2" and Sex == "female"': 28.722972972972972,
'Pclass == "2" and Sex == "male"': 30.74070707070707,
'Pclass == "3" and Sex == "female"': 21.75,
'Pclass == "3" and Sex == "male"': 26.507588932806325}}
Outlier detection and removal
Two new methods are available, at both HandyFrame
and HandyColumns
object, for detecting and removing outliers, based on Mahalanobis distance:
get_outliers
: returns a Spark DataFrame containing all rows considered outliersremove_outliers
: returns a filtered Spark DataFrame where all outliers were removed
Those methods consider only numeric columns and use a threshold (default 99.9%) to compute the corresponding chi-square critical value to filter the rows.
Binary classification metrics
The BinaryClassificationMetrics
object was extended to take a Spark DataFrame (instead of an RDD only) and the corresponding scoreCol
, with the vector of probabilities output from a classifier, and a labelCol
with the true labels.
It exposes several methods that were not available to PySpark:
thresholds
roc
pr
fMeasureByThreshold
precisionByThreshold
recallByThreshold
It also implements some new methods:
getMetricsByThreshold
: returns a Spark DataFrame with all metrics, FPR, Recall and Precision, by thresholdconfusionMatrix
: returns a DenseMatrix representing the confusion matrix for the informed thresholdprint_confusion_matrix
: returns a nice pandas DataFrame with the confusion matrixplot_roc_curve
plot_pr_curve
Information Theory
HandyColumn
object now exposes methods for computing entropy and mutual information:
entropy
: returns pandas Series with entropy for informed columnsmutual_info
: returns pandas DataFrame with mutual information between informed columns