Skip to content

Latest commit

 

History

History
568 lines (472 loc) · 35.9 KB

RELEASE.md

File metadata and controls

568 lines (472 loc) · 35.9 KB

Release 0.12.2

Changes:

  • Fixed loading of epsilon dataset into memory
  • Fixed multiclass learning on GPU for >255 classes
  • Improved error handling
  • Some other minor fixes

Release 0.12.1.1

Changes:

  • Fixed Python compatibility issue in dataset downloading
  • Added sampling_type parameter for YetiRankPairwise loss

Release 0.12.1

Changes:

  • Support saving models in ONNX format (only for models without categorical features).
  • Added new dataset to our catboost.datasets() -- dataset epsilon, a large dense dataset for binary classification.
  • Speedup of Python cv on GPU.
  • Fixed creation of Pool from pandas.DataFrame with pandas.Categorical columns.

Release 0.12.0

Breaking changes:

  • Class weights are now taken into account by eval_metrics(), get_feature_importance(), and get_object_importance(). In previous versions the weights were ignored.
  • Parameter random-strength for pairwise training (PairLogitPairwise, QueryCrossEntropy, YetiRankPairwise) is not supported anymore.
  • Simultaneous use of MultiClass and MultiClassOneVsAll metrics is now deprecated.

New functionality:

  • cv method is now supported on GPU.
  • String labels for classes are supported in Python. In multiclassification the string class names are inferred from the data. In binary classification for using string labels you should employ class_names parameter and specify which class is negative (0) and which is positive (1). You can also use class_names in multiclassification mode to pass all possible class names to the fit function.
  • Borders can now be saved and reused. To save the feature quantization information obtained during training data preprocessing into a text file use cli option --output-borders-file. To use the borders for training use cli option --input-borders-file. This functionanlity is now supported on CPU and GPU (it was GPU-only in previous versions). File format for the borders is described here.
  • CLI option --eval-file is now supported on GPU.

Quality improvement:

  • Some cases in binary classification are fixed where training could diverge

Optimizations:

  • A great speedup of the Python applier (10x)
  • Reduced memory consumption in Python cv function (times fold count)

Benchmarks and tutorials:

  • Added speed benchmarks for CPU and GPU on a variety of different datasets.
  • Added benchmarks of different ranking modes. In this tutorial we compare different ranking modes in CatBoost, XGBoost and LightGBM.
  • Added tutorial for applying model in Java.
  • Added benchmarks of SHAP values calculation for CatBoost, XGBoost and LightGBM. The benchmarks also contain explanation of complexity of this calculation in all the libraries.

We also made a list of stability improvements and stricter checks of input data and parameters.

And we are so grateful to our community members @canorbal and @neer201 for their contribution in this release. Thank you.

Release 0.11.2

Changes:

  • Pure GPU implementation of NDCG metric
  • Enabled LQ loss function
  • Fixed NDCG metric on CPU
  • Added model_sum mode to command line interface
  • Added SHAP values benchmark (#566)
  • Fixed random_strength for Plain boosting (#448)
  • Enabled passing a test pool to caret training (#544)
  • Fixed a bug in exporting the model as python code (#556)
  • Fixed label mapper for multiclassification custom labels (#523)
  • Fixed hash type of categorical features (#558)
  • Fixed handling of cross-validation fold count options in python package (#568)

Release 0.11.1

Changes:

  • Accelerated formula evaluation by ~15%
  • Improved model application interface
  • Improved compilation time for building GPU version
  • Better handling of stray commas in list arguments
  • Added a benchmark that employs Rossman Store Sales dataset to compare quality of GBDT packages
  • Added references to Catboost papers in R-package CITATION file
  • Fixed a build issue in compilation for GPU
  • Fixed a bug in model applicator
  • Fixed model conversion, #533
  • Returned pre 0.11 behaviour for best_score_ and evals_result_ (issue #539)
  • Make valid RECORD in wheel (issue #534)

Release 0.11.0

Changes:

  • Changed default border count for float feature binarization to 254 on CPU to achieve better quality
  • Fixed random seed to 0 by default
  • Support model with more than 254 feature borders or one hot values when doing predictions
  • Added model summation support in python: use catboost.sum_models() to sum models with provided weights.
  • Added json model tutorial json_model_tutorial.ipynb

Release 0.10.4.1

Changes:

  • Bugfix for #518

Release 0.10.4

Breaking changes:

In python 3 some functions returned dictionaries with keys of type bytes - particularly eval_metrics and get_best_score. These are fixed to have keys of type str.

Changes:

  • New metric NumErrors:greater_than=value
  • New metric and objective L_q:q=value
  • model.score(X, y) - can now work with Pool and labels from Pool

Release 0.10.3

Changes:

  • Added EvalResult output after GPU catboost training
  • Supported prediction type option on GPU
  • Added get_evals_result() method and evals_result_ property to model in python wrapper to allow user access metric values
  • Supported string labels for GPU training in cmdline mode
  • Many improvements in JNI wrapper
  • Updated NDCG metric: speeded up and added NDCG with exponentiation in numerator as a new NDCG mode
  • CatBoost doesn't drop unused features from model after training
  • Write training finish time and catboost build info to model metadata
  • Fix automatic pairs generation for GPU PairLogitPairwise target

Release 0.10.2

Main changes:

  • Fixed Python 3 support in catboost.FeaturesData
  • 40% speedup QuerySoftMax CPU training

Release 0.10.1

Improvements

  • 2x Speedup pairwise loss functions
  • For all the people struggling with occasional NaNs in test datasets - now we only write warnings about it

Bugfixes

  • We set up default loss_function in CatBoostClassifier and CatBoostRegressor
  • Catboost write Warning and Error logs to stderr

Release 0.10.0

Breaking changes

R package

  • In R package we have changed parameter name target to label in method save_pool()

Python package

  • We don't support Python 3.4 anymore
  • CatBoostClassifier and CatBoostRegressor get_params() method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified. This also means that this code:
model1 = CatBoostClassifier()
params = model1.get_params()
model2 = CatBoost(params)

will create model2 with default loss_function RMSE, not with Logloss. This breaking change is done to support sklearn interface, so that sklearn GridSearchCV can work.

  • We've removed several attributes and changed them to functions. This was needed to avoid sklearn warnings: is_fitted_ => is_fitted() metadata_ => get_metadata()
  • We removed file with model from constructor of estimator. This was also done to avoid sklearn warnings.

Educational materials

  • We added tutorial for our ranking modes.
  • We published our slides, you are very welcome to use them.

Improvements

All

  • Now it is possible to save model in json format.
  • We have added Java interface for CatBoost model
  • We now have static linkage with CUDA, so you don't have to install any particular version of CUDA to get catboost working on GPU.
  • We implemented both multiclass modes on GPU, it is very fast.
  • It is possible now to use multiclass with string labels, they will be inferred from data
  • Added use_weights parameter to metrics. By default all metrics, except for AUC use weights, but you can disable it. To calculate metric value without weights, you need to set this parameter to false. Example: Accuracy:use_weights=false. This can be done only for custom_metrics or eval_metric, not for the objective function. Objective function always uses weights if they are present in the dataset.
  • We now use snapshot time intervals. It will work much faster if you save snapshot every 5 or 10 minutes instead of saving it on every iteration.
  • Reduced memory consumption by ranking modes.
  • Added automatic feature importance evaluation after completion of GPU training.
  • Allow inexistent indexes in ignored features list
  • Added new metrics: LogLikelihoodOfPrediction, RecallAt:top=k, PrecisionAt:top=k and MAP:top=k.
  • Improved quality for multiclass with weighted datasets.
  • Pairwise modes now support automatic pairs generation (see tutorial for that).
  • Metric QueryAverage is renamed to a more clear AverageGain. This is a very important ranking metric. It shows average target value in top k documents of a group. Introduced parameter best_model_min_trees - the minimal number of trees the best model should have.

Python

  • We now support sklearn GridSearchCV: you can pass categorical feature indices when constructing estimator. And then use it in GridSearchCV.
  • We added new method to utils - building of ROC curve: get_roc_curve.
  • Added get_gpu_device_count() method to python package. This is a way to check if your CUDA devices are available.
  • We implemented automatical selection of decision-boundary using ROC curve. You can select best classification boundary given the maximum FPR or FNR that you allow to the model. Take a look on catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1). You can also calculate FPR and FNR for each boundary value.
  • We have added pool slicing: pool.slice(doc_indices)
  • Allow GroupId and SubgroupId specified as strings.

R package

  • GPU support in R package. You need to use parameter task_type='GPU' to enable GPU training.
  • Models in R can be saved/restored by means of R: save/load or saveRDS/readRDS

Speedups

  • New way of loading data in Python using FeaturesData structure. Using FeaturesData will speed up both loading data for training and for prediction. It is especially important for prediction, because it gives around 10 to 20 times python prediction speedup.
  • Training multiclass on CPU ~ 60% speedup
  • Training of ranking modes on CPU ~ 50% speedup
  • Training of ranking modes on GPU ~ 50% speedup for datasets with many features and not very many objects
  • Speedups of metric calculation on GPU. Example of speedup on our internal dataset: training with - AUC eval metric with test dataset with 2kk objects is speeded up 7sec => 0.2 seconds per iteration.
  • Speedup of all modes on CPU training.

We also did a lot of stability improvements, and improved usability of the library, added new parameter synonyms and improved input data validations.

Thanks a lot to all people who created issues on github. And thanks a lot to our contributor @pukhlyakova who implemented many new useful metrics!

Release 0.9.1.1

Bugfixes

  • Fixed #403 bug in cuda train submodule (training crashed without evaluation set)
  • Fixed exception propagation on pool parsing stage
  • Add support of string GroupId and SubgroupId in python-package
  • Print real class names instead of their labels in eval output

Release 0.9

Breaking Changes

  • We removed calc_feature_importance parameter from Python and R. Now feature importance calculation is almost free, so we always calculate feature importances. Previously you could disable it if it was slowing down your training.
  • We removed Doc type for feature importances. Use Shap instead.
  • We moved thread_count parameter in Python get_feature_importance method to the end.

Ranking

In this release we added several very powerfull ranking objectives:

  • PairLogitPairwise
  • YetiRankPairwise
  • QueryCrossEntropy (GPU only)

Other ranking improvements:

  • We have made improvements to our existing ranking objectives QuerySoftMax and PairLogit.
  • We have added group weights support.

Accuracy improvements

  • Improvement for datasets with weights
  • Now we automatically calculate a good learning rate for you in the start of training, you don't have to specify it. After the training has finished, you can look on the training curve on evaluation dataset and make ajustments to the selected learning rate, but it will already be a good value.

Speedups:

  • Several speedups for GPU training.
  • 1.5x speedup for applying the model.
  • Speed up multi classificaton training.
  • 2x speedup for AUC calculation in eval_metrics.
  • Several speedups for eval_metrics for other metrics.
  • 100x speed up for Shap values calculation.
  • Speedup for feature importance calculation. It used to be a bottleneck for GPU training previously, now it's not.
  • We added possibility to not calculate metric on train dataset using MetricName:hints=skip_train~false (it might speed up your training if metric calculation is a bottle neck, for example, if you calculate many metrics or if you calculate metrics on GPU).
  • We added possibility to calculate metrics only periodically, not on all iterations. Use metric_period for that. (previously it only disabled verbose output on each iteration).
  • Now we disable by default calculation of expensive metrics on train dataset. We don't calculate AUC and PFound metrics on train dataset by default. You can also disable calculation of other metrics on train dataset using MetricName:hints=skip_train~true. If you want to calculate AUC or PFound on train dataset you can use MetricName:hints=skip_train~false.
  • Now if you want to calculate metrics using eval_metrics or during training you can use metric_period to skip some iterations. It will speed up eval_metrics and it might speed up training, especially GPU training. Note that the most expensive metric calculation is AUC calculation, for this metric and large datasets it makes sense to use metric_period. If you only want to see less verbose output, and still want to see metric values on every iteration written in file, you can use verbose=n parameter
  • Parallelization of calculation of most of the metrics during training

Improved GPU experience

  • It is possible now to calculate and visualise custom_metric during training on GPU. Now you can use our Jupyter visualization, CatBoost viewer or TensorBoard the same way you used it for CPU training. It might be a bottleneck, so if it slows down your training use metric_period=something and MetricName:hints=skip_train~false
  • We switched to CUDA 9.1. Starting from this release CUDA 8.0 will not be supported
  • Support for external borders on GPU for cmdline

Improved tools for model analysis

  • We added support of feature combinations to our Shap values implementation.
  • Added Shap values for MultiClass and added an example of it's usage to our Shap tutorial.
  • Added pretified parameter to get_feature_importance(). With pretified=True the function will return list of features with names sorted in descending order by their importance.
  • Improved interfaces for eval-feature functionality
  • Shap values support in R-package

New features

  • It is possible now to save any metainformation to the model.
  • Empty values support
  • Better support of sklearn
  • feature_names_ for CatBoost class
  • Added silent parameter
  • Better stdout
  • Better diagnostic for invalid inputs
  • Better documentation
  • Added a flag to allow constant labels

New metrics

We added many new metrics that can be used for visualization, overfitting detection, selecting of best iteration of training or for cross-validation:

  • BierScore
  • HingeLoss
  • HammingLoss
  • ZeroOneLoss
  • MSLE
  • MAE
  • BalancedAccuracy
  • BalancedErrorRate
  • Kappa
  • Wkappa
  • QueryCrossEntropy
  • NDCG

New ways to apply the model

  • Saving model as C++ code
  • Saving model with categorical features as Python code

New ways to build the code

Added make files for binary with CUDA and for Python package

Tutorials

We created a new repo with tutorials, now you don't have to clone the whole catboost repo to run Jupyter notebook with a tutorial.

Bugfixes

We have also a set of bugfixes and we are gratefull to everyone who has filled a bugreport, helping us making the library better.

Thanks to our Contributors

This release contains contributions from CatBoost team. We want to especially mention @pukhlyakova who implemented lots of useful metrics.

Release 0.8.1

Bug Fixes and Other Changes

  • New model method get_cat_feature_indices() in Python wrapper.
  • Minor fixes and stability improvements.

Release 0.8

Breaking changes

  • We fixed bug in CatBoost. Pool initialization from numpy.array and pandas.dataframe with string values that can cause slight inconsistence while using trained model from older versions. Around 1% of cat feature hashes were treated incorrectly. If you expirience quality drop after update you should consider retraining your model.

Major Features And Improvements

  • Algorithm for finding most influential training samples for a given object from the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' paper is implemented. This mode for every object from input pool calculates scores for every object from train pool. A positive score means that the given train object has made a negative contribution to the given test object prediction. And vice versa for negative scores. The higher score modulo - the higher contribution. See get_object_importance model method in Python package and ostr mode in cli-version. Tutorial for Python is available here. More details and examples will be published in documentation soon.
  • We have implemented new way of exploring feature importance - Shap values from paper. This allows to understand which features are most influent for a given object. You can also get more insite about your model, see details in a tutorial.
  • Save model as code functionality published. For now you could save model as Python code with categorical features and as C++ code w/o categorical features.

Bug Fixes and Other Changes

  • Fix _catboost reinitialization issues #268 and #269.
  • Python module catboost.util extended with create_cd. It creates column description file.
  • Now it's possible to load titanic and amazon (Kaggle Amazon Employee Access Challenge) datasets from Python code. Use catboost.datasets.
  • GPU parameter use_cpu_ram_for_cat_features renamed to gpu_cat_features_storage with posible values CpuPinnedMemory and GpuRam. Default is GpuRam.

Thanks to our Contributors

This release contains contributions from CatBoost team.

As usual we are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.7.2

Major Features And Improvements

  • GPU: New DocParallel mode for tasks without categorical features and or with categorical features and —max-ctr-complextiy 1. Provides best performance for pools with big number of documents.
  • GPU: Distributed training on several GPU host via MPI. See instruction how to build binary here.
  • GPU: Up to 30% learning speed-up for Maxwell and later GPUs with binarization level > 32

Bug Fixes and Other Changes

  • Hotfixes for GPU version of python wrapper.

Release 0.7.1

Major Features And Improvements

  • Python wrapper: added methods to download datasets titanic and amazon, to make it easier to try the library (catboost.datasets).
  • Python wrapper: added method to write column desctiption file (catboost.utils.create_cd).
  • Made improvements to visualization.
  • Support non-numeric values in GroupId column.
  • Tutorials section updated.

Bug Fixes and Other Changes

  • Fixed problems with eval_metrics (issue #285)
  • Other fixes

Release 0.7

Breaking changes

  • Changed parameter order in train() function to be consistant with other GBDT libraries.
  • use_best_model is set to True by default if eval_set labels are present.

Major Features And Improvements

  • New ranking mode YetiRank optimizes NDGC and PFound.
  • New visualisation for eval_metrics and cv in Jupyter notebook.
  • Improved per document feature importance.
  • Supported verbose=int: if verbose > 1, metric_period is set to this value.
  • Supported type(eval_set) = list in python. Currently supporting only single eval_set.
  • Binary classification leaf estimation defaults are changed for weighted datasets so that training converges for any weights.
  • Add model_size_reg parameter to control model size. Fix ctr_leaf_count_limit parameter, also to control model size.
  • Beta version of distributed CPU training with only float features support.
  • Add subgroupId to Python/R-packages.
  • Add groupwise metrics support in eval_metrics.

Thanks to our Contributors

This release contains contributions from CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.6.3

Breaking changes

  • boosting_type parameter value Dynamic is renamed to Ordered.
  • Data visualisation functionality in Jupyter Notebook requires ipywidgets 7.x+ now.
  • query_id parameter renamed to group_id in Python and R wrappers.
  • cv returns pandas.DataFrame by default if Pandas installed. See new parameter as_pandas.

Major Features And Improvements

  • CatBoost build with make file. Now it’s possible to build command-line CPU version of CatBoost under Linux with make file.
  • In column description column name Target is changed to Label. It will still work with previous name, but it is recommended to use the new one.
  • eval-metrics mode added into cmdline version. Metrics can be calculated for a given dataset using a previously trained model.
  • New classification metric CtrFactor is added.
  • Load CatBoost model from memory. You can load your CatBoost model from file or initialize it from buffer in memory.
  • Now you can run fit function using file with dataset: fit(train_path, eval_set=eval_path, column_description=cd_file). This will reduce memory consumption by up to two times.
  • 12% speedup for training.

Bug Fixes and Other Changes

  • JSON output data format is changed.
  • Python whl binaries with CUDA 9.1 support for Linux OS published into the release assets.
  • Added bootstrap_type parameter to CatBoostClassifier and Regressor (issue #263).

Thanks to our Contributors

This release contains contributions from newbfg and CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.6.2

Major Features And Improvements

  • BETA version of distributed mulit-host GPU via MPI training
  • Added possibility to import coreml model with oblivious trees. Makes possible to migrate pre-flatbuffers model (with float features only) to current format (issue #235)
  • Added QuerySoftMax loss function

Bug Fixes and Other Changes

  • Fixed GPU models bug on pools with both categorical and float features (issue #241)
  • Use all available cores by default
  • Fixed not querywise loss for pool with QueryId
  • Default float features binarization method set to GreedyLogSum

Release 0.6.1.1

Bug Fixes and Other Changes

  • Hotfix for critical bug in Python and R wrappers (issue #238)
  • Added stratified data split in CV
  • Fix is_classification check and CV for Logloss

Release 0.6.1

Bug Fixes and Other Changes

  • Fixed critical bugs in formula evaluation code (issue #236)
  • Added scale_pos_weight parameter

Release 0.6

Speedups

  • 25% speedup of the model applier
  • 43% speedup for training on large datasets.
  • 15% speedup for QueryRMSE and calculation of querywise metrics.
  • Large speedups when using binary categorical features.
  • Significant (x200 on 5k trees and 50k lines dataset) speedup for plot and stage predict calculations in cmdline.
  • Compilation time speedup.

Major Features And Improvements

  • Industry fastest applier implementation.
  • Introducing new parameter boosting-type to switch between standard boosting scheme and dynamic boosting, described in paper "Dynamic boosting".
  • Adding new bootstrap types bootstrap_type, subsample. Using Bernoulli bootstrap type with subsample < 1 might increase the training speed.
  • Better logging for cross-validation, added parameter logging_level and metric_period (should be set in training parameters) to cv.
  • Added a separate train function that receives the parameters and returns a trained model.
  • Ranking mode QueryRMSE now supports default settings for dynamic boosting.
  • R-package pre-build binaries are included into release.
  • We added many synonyms to our parameter names, now it is more convenient to try CatBoost if you are used to some other library.

Bug Fixes and Other Changes

  • Fix for CPU QueryRMSE with weights.
  • Adding several missing parameters into wrappers.
  • Fix for data split in querywise modes.
  • Better logging.
  • From this release we'll provide pre-build R-binaries
  • More parallelisation.
  • Memory usage improvements.
  • And some other bug fixes.

Thanks to our Contributors

This release contains contributions from CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.5.2

Major Features And Improvements

  • We've made single document formula applier 4 times faster!
  • model.shrink function added in Python and R wrappers.
  • Added new training parameter metric_period that controls output frequency.
  • Added new ranking metric QueryAverage.
  • This version contains an easy way to implement new user metrics in C++. How-to example is provided.

Bug Fixes and Other Changes

  • Stability improvements and bug fixes

As usual we are grateful to all who filed issues, asked and answered questions.

Release 0.5

Breaking Changes

Cmdline:

  • Training parameter gradient-iterations renamed to leaf-estimation-iterations.
  • border option removed. If you want to specify border for binary classification mode you need to specify it in the following way: loss-function Logloss:Border=0.5
  • CTR parameters are changed:
    • Removed priors, per-feature-priors, ctr-binarization;
    • Added simple-ctr, combintations-ctr, per-feature-ctr; More details will be published in our documentation.

Python:

  • Training parameter gradient_iterations renamed to leaf_estimation_iterations.
  • border option removed. If you want to specify border for binary classification mode you need to specify it in the following way: loss_function='Logloss:Border=0.5'
  • CTR parameters are changed:
    • Removed priors, per_feature_priors, ctr_binarization;
    • Added simple_ctr, combintations_ctr, per_feature_ctr; More details will be published in our documentation.

Major Features And Improvements

  • In Python we added a new method eval_metrics: now it's possible for a given model to calculate specified metric values for each iteration on specified dataset.
  • One command-line binary for CPU and GPU: in CatBoost you can switch between CPU and GPU training by changing single parameter value task-type CPU or GPU (task_type 'CPU', 'GPU' in python bindings). Windows build still contains two binaries.
  • We have speed up the training up to 30% for datasets with a lot of objects.
  • Up to 10% speed-up of GPU implementation on Pascal cards

Bug Fixes and Other Changes

  • Stability improvements and bug fixes

As usual we are grateful to all who filed issues, asked and answered questions.

Release 0.4

Breaking Changes

FlatBuffers model format: new CatBoost versions wouldn’t break model compatibility anymore.

Major Features And Improvements

  • Training speedups: we have speed up the training by 33%.
  • Two new ranking modes are available:
    • PairLogit - pairwise comparison of objects from the input dataset. Algorithm maximises probability correctly reorder all dataset pairs.
    • QueryRMSE - mix of regression and ranking. It’s trying to make best ranking for each dataset query by input labels.

Bug Fixes and Other Changes

  • We have fixed a bug that caused quality degradation when using weights < 1.
  • Verbose flag is now deprecated, please use logging_level instead. You could set the following levels: Silent, Verbose, Info, Debug.
  • And some other bugs.

Thanks to our Contributors

This release contains contributions from: avidale, newbfg, KochetovNicolai and CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.3

Major Features And Improvements

GPU CUDA support is available. CatBoost supports multi-GPU training. Our GPU implementation is 2 times faster then LightGBM and more then 20 times faster then XGBoost one. Check out the news with benchmarks on our site.

Bug Fixes and Other Changes

Stability improvements and bug fixes

Thanks to our Contributors

This release contains contributions from: daskol and CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.

Release 0.2

Breaking Changes

  • R library interface significantly changed
  • New model format: CatBoost v0.2 model binary not compatible with previous versions
  • Cross-validation parameters changes: we changed overfitting detector parameters of CV in python so that it is same as those in training.
  • CTR types: MeanValue => BinarizedTargetMeanValue

Major Features And Improvements

  • Training speedups: we have speed up the training by 20-30%.
  • Accuracy improvement with categoricals: we have changed computation of statistics for categorical features, which leads to better quality.
  • New type of overfitting detector: Iter. This type of detector was requested by our users. So now you can also stop training by a simple criterion: if after a fixed number of iterations there is no improvement of your evaluation function.
  • TensorBoard support: this is another way of looking on the graphs of different error functions both during training and after training has finished. To look at the metrics you need to provide train_dir when training your model and then run "tensorboard --logdir={train_dir}"
  • Jupyter notebook improvements: for our Python library users that experiment with Jupyter notebooks, we have improved our visualisation tool. Now it is possible to save image of the graph. We also have changed scrolling behaviour so that it is more convenient to scroll the notebook.
  • NaN features support: we also have added simple but effective way of dealing with NaN features. If you have some NaNs in the train set, they will be changed to a value that is less than the minimum value or greater than the maximum value in the dataset (this is configurable), so that it is guaranteed that they are in their own bin, and a split would separates NaN values from all other values. By default, no NaNs are allowed, so you need to use option nan_mode for that. When applying a model, NaNs will be treated in the same way for the features where NaN values were seen in train. It is not allowed to have NaN values in test if no NaNs in train for this feature were provided.
  • Snapshotting: we have added snapshotting to our Python and R libraries. So if you think that something can happen with your training, for example machine can reboot, you can use snapshot_file parameter - this way after you restart your training it will start from the last completed iteration.
  • R library tutorial: we have added tutorial
  • Logging customization: we have added allow_writing_files parameter. By default some files with logging and diagnostics are written on disc, but you can turn it off using by setting this flag to False.
  • Multiclass mode improvements: we have added a new objective for multiclass mode - MultiClassOneVsAll. We also added class_names param - now you don't have to renumber your classes to be able to use multiclass. And we have added two new metrics for multiclass: TotalF1 and MCC metrics. You can use the metrics to look how its values are changing during training or to use overfitting detection or cutting the model by best value of a given metric.
  • Any delimeters support: in addition to datasets in tsv format, CatBoost now supports files with any delimeters

Bug Fixes and Other Changes

Stability improvements and bug fixes

Thanks to our Contributors

This release contains contributions from: grayskripko, hadjipantelis and CatBoost team.

We are grateful to all who filed issues or helped resolve them, asked and answered questions.