Skip to content

Releases: dvgodoy/handyspark

Performance Improvement

08 Mar 20:56
a727c4e
Compare
Choose a tag to compare

Performance Improvements

  • summaries are no longer computed when a HandyFrame is created.
  • column statistics (q1, q3, median, percentile) now accept a precision argument (default = 0.01) to compute approximate statistics faster
  • stratify operations are no longer using RDD methods and rely on Spark's DataFrame built-in optimizer to deliver fast columnar statistics. A substantial performance improvement was achieved for almost every stratify operation.

Stratified transformers

Transformers HandyImputer and HandyFencer now store values for stratified operations using the column name as first level of dictionary and filter clause as second level, as opposed to the inverse structure being used in version 0.1.0a1.

  • in version 0.1.0a1:

{'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
'Pclass == "3" and Sex == "female"': {'Age': 21.75},
'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

  • in version 0.2.0a1:

{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
'Pclass == "1" and Sex == "male"': 41.28138613861386,
'Pclass == "2" and Sex == "female"': 28.722972972972972,
'Pclass == "2" and Sex == "male"': 30.74070707070707,
'Pclass == "3" and Sex == "female"': 21.75,
'Pclass == "3" and Sex == "male"': 26.507588932806325}}

Outlier detection and removal

Two new methods are available, at both HandyFrame and HandyColumns object, for detecting and removing outliers, based on Mahalanobis distance:

  • get_outliers: returns a Spark DataFrame containing all rows considered outliers
  • remove_outliers: returns a filtered Spark DataFrame where all outliers were removed

Those methods consider only numeric columns and use a threshold (default 99.9%) to compute the corresponding chi-square critical value to filter the rows.

Binary classification metrics

The BinaryClassificationMetrics object was extended to take a Spark DataFrame (instead of an RDD only) and the corresponding scoreCol, with the vector of probabilities output from a classifier, and a labelCol with the true labels.

It exposes several methods that were not available to PySpark:

  • thresholds
  • roc
  • pr
  • fMeasureByThreshold
  • precisionByThreshold
  • recallByThreshold

It also implements some new methods:

  • getMetricsByThreshold: returns a Spark DataFrame with all metrics, FPR, Recall and Precision, by threshold
  • confusionMatrix: returns a DenseMatrix representing the confusion matrix for the informed threshold
  • print_confusion_matrix: returns a nice pandas DataFrame with the confusion matrix
  • plot_roc_curve
  • plot_pr_curve

Information Theory

HandyColumn object now exposes methods for computing entropy and mutual information:

  • entropy: returns pandas Series with entropy for informed columns
  • mutual_info: returns pandas DataFrame with mutual information between informed columns