scikit-learn 1.0 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
This release includes some new key features as well as many improvements and bug fixes. Highlights include:
- Keyword and positional arguments
- Spline Transformers
- Quantile Regressor
- Feature Names Support
- A more flexible plotting API
- Online One-Class SVM
- Histogram-based Gradient Boosting Models are now stable
- New documentation improvements
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.0.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.0.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.0.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.0.0 of scikit-learn requires python 3.7+, numpy 1.14.6+ and scipy 1.1.0+. Optional minimal dependency is matplotlib 2.2.2+.
Enforcing keyword-only arguments
In an effort to promote clear and non-ambiguous use of the library, most constructor and function parameters must now be passed as keyword arguments (i.e. using the param=value syntax) instead of positional. If a keyword-only parameter is used as positional, a TypeError is now raised.
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Fix manifold.TSNE now avoids numerical underflow issues during affinity matrix computation.
- Fix manifold.Isomap now connects disconnected components of the neighbors graph along some minimum distance pairs, instead of changing every infinite distances to zero.
- Fix The splitting criterion of tree.DecisionTreeClassifier and tree.DecisionTreeRegressor can be impacted by a fix in the handling of rounding errors. Previously some extra spurious splits could occur.
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
- API Change The option for using the squared error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "squared_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19310 by Christian Lorentzen.
- For ensemble.ExtraTreesRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.GradientBoostingRegressor, loss="ls" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.RandomForestRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.HistGradientBoostingRegressor, loss="least_squares" is deprecated, use "squared_error" instead which is now the default.
- For linear_model.RANSACRegressor, loss="squared_loss" is deprecated, use "squared_error" instead.
- For linear_model.SGDRegressor, loss="squared_loss" is deprecated, use "squared_error" instead which is now the default.
- For tree.DecisionTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For tree.ExtraTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- API Change The option for using the absolute error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "absolute_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19733 by Christian Lorentzen.
- For ensemble.ExtraTreesRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For ensemble.GradientBoostingRegressor, loss="lad" is deprecated, use "absolute_error" instead.
- For ensemble.RandomForestRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For ensemble.HistGradientBoostingRegressor, loss="least_absolute_deviation" is deprecated, use "absolute_error" instead.
- For linear_model.RANSACRegressor, loss="absolute_loss" is deprecated, use "absolute_error" instead which is now the default.
- For tree.DecisionTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For tree.ExtraTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- API Change np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
- API Change get_feature_names_out has been added to the transformer API to get the names of the output features. get_feature_names has in turn been deprecated. #18444 by Thomas Fan.
- API Change All estimators store feature_names_in_ when fitted on pandas Dataframes. These feature names are compared to names seen in non-fit methods, e.g. transform and will raise a FutureWarning if they are not consistent. These FutureWarning s will become ValueError s in 1.2. #18010 by Thomas Fan.
sklearn.base
- Fix config_context is now threadsafe. #18736 by Thomas Fan.
sklearn.calibration
- Feature calibration.CalibrationDisplay added to plot calibration curves. #17443 by Lucy Liu.
- Fix The predict and predict_proba methods of calibration.CalibratedClassifierCV can now properly be used on prefitted pipelines. #19641 by Alek Lefebvre.
- Fix Fixed an error when using a ensemble.VotingClassifier as base_estimator in calibration.CalibratedClassifierCV. #20087 by Clément Fauchereau.
sklearn.cluster
- Efficiency The "k-means++" initialization of cluster.KMeans and cluster.MiniBatchKMeans is now faster, especially in multicore settings. #19002 by Jon Crall and Jérémie du Boisberranger.
- Efficiency cluster.KMeans with algorithm='elkan' is now faster in multicore settings. #19052 by Yusuke Nagasaka.
- Efficiency cluster.MiniBatchKMeans is now faster in multicore settings. #17622 by Jérémie du Boisberranger.
- Efficiency cluster.OPTICS can now cache the output of the computation of the tree, using the memory parameter. #19024 by Frankie Robertson.
- Enhancement The predict and fit_predict methods of cluster.AffinityPropagation now accept sparse data type for input data. #20117 by Venkatachalam Natchiappan
- Fix Fixed a bug in cluster.MiniBatchKMeans where the sample weights were partially ignored when the input is sparse. #17622 by Jérémie du Boisberranger.
- Fix Improved convergence detection based on center change in cluster.MiniBatchKMeans which was almost never achievable. #17622 by Jérémie du Boisberranger.
- Fix cluster.AgglomerativeClustering now supports readonly memory-mapped datasets. #19883 by Julien Jerphanion.
- Fix cluster.AgglomerativeClustering correctly connects components when connectivity and affinity are both precomputed and the number of connected components is greater than 1. #20597 by Thomas Fan.
- Fix cluster.FeatureAgglomeration does not accept a **params kwarg in the fit function anymore, resulting in a more concise error message. #20899 by Adam Li.
- Fix Fixed a bug in cluster.KMeans, ensuring reproducibility and equivalence between sparse and dense input. #20200 by Jérémie du Boisberranger.
- API Change cluster.Birch attributes, fit_ and partial_fit_, are deprecated and will be removed in 1.2. #19297 by Thomas Fan.
- API Change the default value for the batch_size parameter of cluster.MiniBatchKMeans was changed from 100 to 1024 due to efficiency reasons. The n_iter_ attribute of cluster.MiniBatchKMeans now reports the number of started epochs and the n_steps_ attribute reports the number of mini batches processed. #17622 by Jérémie du Boisberranger.
- API Change cluster.spectral_clustering raises an improved error when passed a np.matrix. #20560 by Thomas Fan.
sklearn.compose
- Enhancement compose.ColumnTransformer now records the output of each transformer in output_indices_. #18393 by Luca Bittarello.
- Enhancement compose.ColumnTransformer now allows DataFrame input to have its columns appear in a changed order in transform. Further, columns that are dropped will not be required in transform, and additional columns will be ignored if remainder='drop'. #19263 by Thomas Fan.
- Enhancement Adds **predict_params keyword argument to compose.TransformedTargetRegressor.predict that passes keyword argument to the regressor. #19244 by Ricardo.
- Fix compose.ColumnTransformer.get_feature_names supports non-string feature names returned by any of its transformers. However, note that get_feature_names is deprecated, use get_feature_names_out instead. #18459 by Albert Villanova del Moral and Alonso Silva Allende.
- Fix compose.TransformedTargetRegressor now takes nD targets with an adequate transformer. #18898 by Oras Phongpanagnam.
- API Change Adds verbose_feature_names_out to compose.ColumnTransformer. This flag controls the prefixing of feature names out in get_feature_names_out. #18444 and #21080 by Thomas Fan.
sklearn.covariance
- Fix Adds arrays check to covariance.ledoit_wolf and covariance.ledoit_wolf_shrinkage. #20416 by Hugo Defois.
- API Change Deprecates the following keys in cv_results_: 'mean_score', 'std_score', and 'split(k)_score' in favor of 'mean_test_score' 'std_test_score', and 'split(k)_test_score'. #20583 by Thomas Fan.
sklearn.datasets
- Enhancement datasets.fetch_openml now supports categories with missing values when returning a pandas dataframe. #19365 by Thomas Fan and Amanda Dsouza and EL-ATEIF Sara.
- Enhancement datasets.fetch_kddcup99 raises a better message when the cached file is invalid. #19669 Thomas Fan.
- Enhancement Replace usages of __file__ related to resource file I/O with importlib.resources to avoid the assumption that these resource files (e.g. iris.csv) already exist on a filesystem, and by extension to enable compatibility with tools such as PyOxidizer. #20297 by Jack Liu.
- Fix Shorten data file names in the openml tests to better support installing on Windows and its default 260 character limit on file names. #20209 by Thomas Fan.
- Fix datasets.fetch_kddcup99 returns dataframes when return_X_y=True and as_frame=True. #19011 by Thomas Fan.
- API Change Deprecates datasets.load_boston in 1.0 and it will be removed in 1.2. Alternative code snippets to load similar datasets are provided. Please report to the docstring of the function for details. #20729 by Guillaume Lemaitre.
sklearn.decomposition
- Enhancement added a new approximate solver (randomized SVD, available with eigen_solver='randomized') to decomposition.KernelPCA. This significantly accelerates computation when the number of samples is much larger than the desired number of components. #12069 by Sylvain Marié.
- Fix Fixes incorrect multiple data-conversion warnings when clustering boolean data. #19046 by Surya Prakash.
- Fix Fixed dict_learning, used by decomposition.DictionaryLearning, to ensure determinism of the output. Achieved by flipping signs of the SVD output which is used to initialize the code. #18433 by Bruno Charron.
- Fix Fixed a bug in decomposition.MiniBatchDictionaryLearning, decomposition.MiniBatchSparsePCA and decomposition.dict_learning_online where the update of the dictionary was incorrect. #19198 by Jérémie du Boisberranger.
- Fix Fixed a bug in decomposition.DictionaryLearning, decomposition.SparsePCA, decomposition.MiniBatchDictionaryLearning, decomposition.MiniBatchSparsePCA, decomposition.dict_learning and decomposition.dict_learning_online where the restart of unused atoms during the dictionary update was not working as expected. #19198 by Jérémie du Boisberranger.
- API Change In decomposition.DictionaryLearning, decomposition.MiniBatchDictionaryLearning, decomposition.dict_learning and decomposition.dict_learning_online, transform_alpha will be equal to alpha instead of 1.0 by default starting from version 1.2 #19159 by Benoît Malézieux.
- API Change Rename variable names in KernelPCA to improve readability. lambdas_ and alphas_ are renamed to eigenvalues_ and eigenvectors_, respectively. lambdas_ and alphas_ are deprecated and will be removed in 1.2. #19908 by Kei Ishikawa.
- API Change The alpha and regularization parameters of decomposition.NMF and decomposition.non_negative_factorization are deprecated and will be removed in 1.2. Use the new parameters alpha_W and alpha_H instead. #20512 by Jérémie du Boisberranger.
sklearn.dummy
- API Change Attribute n_features_in_ in dummy.DummyRegressor and dummy.DummyRegressor is deprecated and will be removed in 1.2. #20960 by Thomas Fan.
sklearn.ensemble
- Enhancement HistGradientBoostingClassifier and HistGradientBoostingRegressor take cgroups quotas into account when deciding the number of threads used by OpenMP. This avoids performance problems caused by over-subscription when using those classes in a docker container for instance. #20477 by Thomas Fan.
- Enhancement HistGradientBoostingClassifier and HistGradientBoostingRegressor are no longer experimental. They are now considered stable and are subject to the same deprecation cycles as all other estimators. #19799 by Nicolas Hug.
- Enhancement Improve the HTML rendering of the ensemble.StackingClassifier and ensemble.StackingRegressor. #19564 by Thomas Fan.
- Enhancement Added Poisson criterion to ensemble.RandomForestRegressor. #19836 by Brian Sun.
- Fix Do not allow to compute out-of-bag (OOB) score in ensemble.RandomForestClassifier and ensemble.ExtraTreesClassifier with multiclass-multioutput target since scikit-learn does not provide any metric supporting this type of target. Additional private refactoring was performed. #19162 by Guillaume Lemaitre.
- Fix Improve numerical precision for weights boosting in ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor to avoid underflows. #10096 by Fenil Suchak.
- Fix Fixed the range of the argument max_samples to be (0.0, 1.0] in ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, where max_samples=1.0 is interpreted as using all n_samples for bootstrapping. #20159 by @murata-yu.
- Fix Fixed a bug in ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor where the sample_weight parameter got overwritten during fit. #20534 by Guillaume Lemaitre.
- API Change Removes tol=None option in ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor. Please use tol=0 for the same behavior. #19296 by Thomas Fan.
sklearn.feature_extraction
- Fix Fixed a bug in feature_extraction.text.HashingVectorizer where some input strings would result in negative indices in the transformed data. #19035 by Liu Yu.
- Fix Fixed a bug in feature_extraction.DictVectorizer by raising an error with unsupported value type. #19520 by Jeff Zhao.
- Fix Fixed a bug in feature_extraction.image.img_to_graph and feature_extraction.image.grid_to_graph where singleton connected components were not handled properly, resulting in a wrong vertex indexing. #18964 by Bertrand Thirion.
- Fix Raise a warning in feature_extraction.text.CountVectorizer with lowercase=True when there are vocabulary entries with uppercase characters to avoid silent misses in the resulting feature vectors. #19401 by Zito Relova
sklearn.feature_selection
- Feature feature_selection.r_regression computes Pearson’s R correlation coefficients between the features and the target. #17169 by Dmytro Lituiev and Julien Jerphanion.
- Enhancement feature_selection.RFE.fit accepts additional estimator parameters that are passed directly to the estimator’s fit method. #20380 by Iván Pulido, Felipe Bidu, Gil Rutter, and Adrin Jalali.
- Fix Fix a bug in isotonic.isotonic_regression where the sample_weight passed by a user were overwritten during fit. #20515 by Carsten Allefeld.
- Fix Change feature_selection.SequentialFeatureSelector to allow for unsupervised modelling so that the fit signature need not do any y validation and allow for y=None. #19568 by Shyam Desai.
- API Change Raises an error in feature_selection.VarianceThreshold when the variance threshold is negative. #20207 by Tomohiro Endo
- API Change Deprecates grid_scores_ in favor of split scores in cv_results_ in feature_selection.RFECV. grid_scores_ will be removed in version 1.2. #20161 by Shuhei Kayawari and @arka204.
sklearn.inspection
- Enhancement Add max_samples parameter in inspection.permutation_importance. It enables to draw a subset of the samples to compute the permutation importance. This is useful to keep the method tractable when evaluating feature importance on large datasets. #20431 by Oliver Pfaffel.
- Enhancement Add kwargs to format ICE and PD lines separately in partial dependence plots inspection.plot_partial_dependence and inspection.PartialDependenceDisplay.plot. #19428 by Mehdi Hamoumi.
- Fix Allow multiple scorers input to inspection.permutation_importance. #19411 by Simona Maggio.
- API Change inspection.PartialDependenceDisplay exposes a class method: from_estimator. inspection.plot_partial_dependence is deprecated in favor of the class method and will be removed in 1.2. #20959 by Thomas Fan.
sklearn.kernel_approximation
- Fix Fix a bug in kernel_approximation.Nystroem where the attribute component_indices_ did not correspond to the subset of sample indices used to generate the approximated kernel. #20554 by Xiangyin Kong.
sklearn.linear_model
- Feature Added linear_model.QuantileRegressor which implements linear quantile regression with L1 penalty. #9978 by David Dale and Christian Lorentzen.
- Feature The new linear_model.SGDOneClassSVM provides an SGD implementation of the linear One-Class SVM. Combined with kernel approximation techniques, this implementation approximates the solution of a kernelized One Class SVM while benefitting from a linear complexity in the number of samples. #10027 by Albert Thomas.
- Feature Added sample_weight parameter to linear_model.LassoCV and linear_model.ElasticNetCV. #16449 by Christian Lorentzen.
- Feature Added new solver lbfgs (available with solver="lbfgs") and positive argument to linear_model.Ridge. When positive is set to True, forces the coefficients to be positive (only supported by lbfgs). #20231 by Toshihiro Nakae.
- Efficiency The implementation of linear_model.LogisticRegression has been optimised for dense matrices when using solver='newton-cg' and multi_class!='multinomial'. #19571 by Julien Jerphanion.
- Enhancement fit method preserves dtype for numpy.float32 in linear_model.Lars, linear_model.LassoLars, linear_model.LassoLars, linear_model.LarsCV and linear_model.LassoLarsCV. #20155 by Takeshi Oura.
- Enhancement Validate user-supplied gram matrix passed to linear models via the precompute argument. #19004 by Adam Midvidy.
- Fix linear_model.ElasticNet.fit no longer modifies sample_weight in place. #19055 by Thomas Fan.
- Fix linear_model.Lasso and linear_model.ElasticNet no longer have a dual_gap_ not corresponding to their objective. #19172 by Mathurin Massias
- Fix sample_weight are now fully taken into account in linear models when normalize=True for both feature centering and feature scaling. #19426 by Alexandre Gramfort and Maria Telenczuk.
- Fix Points with residuals equal to residual_threshold are now considered as inliers for linear_model.RANSACRegressor. This allows fitting a model perfectly on some datasets when residual_threshold=0. #19499 by Gregory Strubel.
- Fix Sample weight invariance for linear_model.Ridge was fixed in #19616 by Oliver Grisel and Christian Lorentzen.
- Fix The dictionary params in linear_model.enet_path and linear_model.lasso_path should only contain parameter of the coordinate descent solver. Otherwise, an error will be raised. #19391 by Shao Yang Hong.
- API Change Raise a warning in linear_model.RANSACRegressor that from version 1.2, min_samples need to be set explicitly for models other than linear_model.LinearRegression. #19390 by Shao Yang Hong.
- API Change : The parameter normalize of linear_model.LinearRegression is deprecated and will be removed in 1.2. Motivation for this deprecation: normalize parameter did not take any effect if fit_intercept was set to False and therefore was deemed confusing. The behavior of the deprecated LinearModel(normalize=True) can be reproduced with a Pipeline with LinearModel (where LinearModel is LinearRegression, Ridge, RidgeClassifier, RidgeCV or RidgeClassifierCV) as follows: make_pipeline(StandardScaler(with_mean=False), LinearModel()). The normalize parameter in LinearRegression was deprecated in #17743 by Maria Telenczuk and Alexandre Gramfort. Same for Ridge, RidgeClassifier, RidgeCV, and RidgeClassifierCV, in: #17772 by Maria Telenczuk and Alexandre Gramfort. Same for BayesianRidge, ARDRegression in: #17746 by Maria Telenczuk. Same for Lasso, LassoCV, ElasticNet, ElasticNetCV, MultiTaskLasso, MultiTaskLassoCV, MultiTaskElasticNet, MultiTaskElasticNetCV, in: #17785 by Maria Telenczuk and Alexandre Gramfort.
- API Change The normalize parameter of OrthogonalMatchingPursuit and OrthogonalMatchingPursuitCV will default to False in 1.2 and will be removed in 1.4. #17750 by Maria Telenczuk and Alexandre Gramfort. Same for Lars LarsCV LassoLars LassoLarsCV LassoLarsIC, in #17769 by Maria Telenczuk and Alexandre Gramfort.
- API Change Keyword validation has moved from __init__ and set_params to fit for the following estimators conforming to scikit-learn’s conventions: SGDClassifier, SGDRegressor, SGDOneClassSVM, PassiveAggressiveClassifier, and PassiveAggressiveRegressor. #20683 by Guillaume Lemaitre.
sklearn.manifold
- Enhancement Implement 'auto' heuristic for the learning_rate in manifold.TSNE. It will become default in 1.2. The default initialization will change to pca in 1.2. PCA initialization will be scaled to have standard deviation 1e-4 in 1.2. #19491 by Dmitry Kobak.
- Fix Change numerical precision to prevent underflow issues during affinity matrix computation for manifold.TSNE. #19472 by Dmitry Kobak.
- Fix manifold.Isomap now uses scipy.sparse.csgraph.shortest_path to compute the graph shortest path. It also connects disconnected components of the neighbors graph along some minimum distance pairs, instead of changing every infinite distances to zero. #20531 by Roman Yurchak and Tom Dupre la Tour.
- Fix Decrease the numerical default tolerance in the lobpcg call in manifold.spectral_embedding to prevent numerical instability. #21194 by Andrew Knyazev.
sklearn.metrics
- Feature metrics.mean_pinball_loss exposes the pinball loss for quantile regression. #19415 by Xavier Dupré and Oliver Grisel.
- Feature metrics.d2_tweedie_score calculates the D^2 regression score for Tweedie deviances with power parameter power. This is a generalization of the r2_score and can be interpreted as percentage of Tweedie deviance explained. #17036 by Christian Lorentzen.
- Feature metrics.mean_squared_log_error now supports squared=False. #20326 by Uttam kumar.
- Efficiency Improved speed of metrics.confusion_matrix when labels are integral. #9843 by Jon Crall.
- Enhancement A fix to raise an error in metrics.hinge_loss when pred_decision is 1d whereas it is a multiclass classification or when pred_decision parameter is not consistent with the labels parameter. #19643 by Pierre Attard.
- Fix metrics.ConfusionMatrixDisplay.plot uses the correct max for colormap. #19784 by Thomas Fan.
- Fix Samples with zero sample_weight values do not affect the results from metrics.det_curve, metrics.precision_recall_curve and metrics.roc_curve. #18328 by Albert Villanova del Moral and Alonso Silva Allende.
- Fix avoid overflow in metrics.cluster.adjusted_rand_score with large amount of data. #20312 by Divyanshu Deoli.
- API Change metrics.ConfusionMatrixDisplay exposes two class methods from_estimator and from_predictions allowing to create a confusion matrix plot using an estimator or the predictions. metrics.plot_confusion_matrix is deprecated in favor of these two class methods and will be removed in 1.2. #18543 by Guillaume Lemaitre.
- API Change metrics.PrecisionRecallDisplay exposes two class methods from_estimator and from_predictions allowing to create a precision-recall curve using an estimator or the predictions. metrics.plot_precision_recall_curve is deprecated in favor of these two class methods and will be removed in 1.2. #20552 by Guillaume Lemaitre.
- API Change metrics.DetCurveDisplay exposes two class methods from_estimator and from_predictions allowing to create a confusion matrix plot using an estimator or the predictions. metrics.plot_det_curve is deprecated in favor of these two class methods and will be removed in 1.2. #19278 by Guillaume Lemaitre.
sklearn.mixture
- Fix Ensure that the best parameters are set appropriately in the case of divergency for mixture.GaussianMixture and mixture.BayesianGaussianMixture. #20030 by Tingshan Liu and Benjamin Pedigo.
sklearn.model_selection
- Feature added model_selection.StratifiedGroupKFold, that combines model_selection.StratifiedKFold and model_selection.GroupKFold, providing an ability to split data preserving the distribution of classes in each split while keeping each group within a single split. #18649 by Leandro Hermida and Rodion Martynov.
- Enhancement warn only once in the main process for per-split fit failures in cross-validation. #20619 by Loïc Estève
- Enhancement The model_selection.BaseShuffleSplit base class is now public. #20056 by @pabloduque0.
- Fix Avoid premature overflow in model_selection.train_test_split. #20904 by Tomasz Jakubek.
sklearn.naive_bayes
- Fix The fit and partial_fit methods of the discrete naive Bayes classifiers (naive_bayes.BernoulliNB, naive_bayes.CategoricalNB, naive_bayes.ComplementNB, and naive_bayes.MultinomialNB) now correctly handle the degenerate case of a single class in the training set. #18925 by David Poznik.
- API Change The attribute sigma_ is now deprecated in naive_bayes.GaussianNB and will be removed in 1.2. Use var_ instead. #18842 by Hong Shao Yang.
sklearn.neighbors
- Enhancement The creation of neighbors.KDTree and neighbors.BallTree has been improved for their worst-cases time complexity from O(n2) to O(n). #19473 by jiefangxuanyan and Julien Jerphanion.
- Fix neighbors.DistanceMetric subclasses now support readonly memory-mapped datasets. #19883 by Julien Jerphanion.
- Fix neighbors.NearestNeighbors, neighbors.KNeighborsClassifier, neighbors.RadiusNeighborsClassifier, neighbors.KNeighborsRegressor and neighbors.RadiusNeighborsRegressor do not validate weights in __init__ and validates weights in fit instead. #20072 by Juan Carlos Alfaro Jiménez.
- API Change The parameter kwargs of neighbors.RadiusNeighborsClassifier is deprecated and will be removed in 1.2. #20842 by Juan MartÃn Loyola.
sklearn.neural_network
- Fix neural_network.MLPClassifier and neural_network.MLPRegressor now correctly support continued training when loading from a pickled file. #19631 by Thomas Fan.
sklearn.pipeline
- API Change The predict_proba and predict_log_proba methods of the pipeline.Pipeline now support passing prediction kwargs to the final estimator. #19790 by Christopher Flynn.
sklearn.preprocessing
- Feature The new preprocessing.SplineTransformer is a feature preprocessing tool for the generation of B-splines, parametrized by the polynomial degree of the splines, number of knots n_knots and knot positioning strategy knots. #18368 by Christian Lorentzen. preprocessing.SplineTransformer also supports periodic splines via the extrapolation argument. #19483 by Malte Londschien. preprocessing.SplineTransformer supports sample weights for knot position strategy "quantile". #20526 by Malte Londschien.
- Feature preprocessing.OrdinalEncoder supports passing through missing values by default. #19069 by Thomas Fan.
- Feature preprocessing.OneHotEncoder now supports handle_unknown='ignore' and dropping categories. #19041 by Thomas Fan.
- Feature preprocessing.PolynomialFeatures now supports passing a tuple to degree, i.e. degree=(min_degree, max_degree). #20250 by Christian Lorentzen.
- Efficiency preprocessing.StandardScaler is faster and more memory efficient. #20652 by Thomas Fan.
- Efficiency Changed algorithm argument for cluster.KMeans in preprocessing.KBinsDiscretizer from auto to full. #19934 by Gleb Levitskiy.
- Efficiency The implementation of fit for preprocessing.PolynomialFeatures transformer is now faster. This is especially noticeable on large sparse input. #19734 by Fred Robinson.
- Fix The preprocessing.StandardScaler.inverse_transform method now raises error when the input data is 1D. #19752 by Zhehao Liu.
- Fix preprocessing.scale, preprocessing.StandardScaler and similar scalers detect near-constant features to avoid scaling them to very large values. This problem happens in particular when using a scaler on sparse data with a constant column with sample weights, in which case centering is typically disabled. #19527 by Oliver Grisel and Maria Telenczuk and #19788 by Jérémie du Boisberranger.
- Fix preprocessing.StandardScaler.inverse_transform now correctly handles integer dtypes. #19356 by @makoeppel.
- Fix preprocessing.OrdinalEncoder.inverse_transform is not supporting sparse matrix and raises the appropriate error message. #19879 by Guillaume Lemaitre.
- Fix The fit method of preprocessing.OrdinalEncoder will not raise error when handle_unknown='ignore' and unknown categories are given to fit. #19906 by Zhehao Liu.
- Fix Fix a regression in preprocessing.OrdinalEncoder where large Python numeric would raise an error due to overflow when casted to C type (np.float64 or np.int64). #20727 by Guillaume Lemaitre.
- Fix preprocessing.FunctionTransformer does not set n_features_in_ based on the input to inverse_transform. #20961 by Thomas Fan.
- API Change The n_input_features_ attribute of preprocessing.PolynomialFeatures is deprecated in favor of n_features_in_ and will be removed in 1.2. #20240 by Jérémie du Boisberranger.
sklearn.svm
- API Change The parameter **params of svm.OneClassSVM.fit is deprecated and will be removed in 1.2. #20843 by Juan MartÃn Loyola.
sklearn.tree
- Enhancement Add fontname argument in tree.export_graphviz for non-English characters. #18959 by Zero and wstates.
- Fix Improves compatibility of tree.plot_tree with high DPI screens. #20023 by Thomas Fan.
- Fix Fixed a bug in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor where a node could be split whereas it should not have been due to incorrect handling of rounding errors. #19336 by Jérémie du Boisberranger.
- API Change The n_features_ attribute of tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor is deprecated in favor of n_features_in_ and will be removed in 1.2. #20272 by Jérémie du Boisberranger.
sklearn.utils
- Enhancement Deprecated the default value of the random_state=0 in randomized_svd. Starting in 1.2, the default value of random_state will be set to None. #19459 by Cindy Bezuidenhout and Clifford Akai-Nettey.
- Enhancement Added helper decorator utils.metaestimators.available_if to provide flexiblity in metaestimators making methods available or unavailable on the basis of state, in a more readable way. #19948 by Joel Nothman.
- Enhancement utils.validation.check_is_fitted now uses __sklearn_is_fitted__ if available, instead of checking for attributes ending with an underscore. This also makes pipeline.Pipeline and preprocessing.FunctionTransformer pass check_is_fitted(estimator). #20657 by Adrin Jalali.
- Fix Fixed a bug in utils.sparsefuncs.mean_variance_axis where the precision of the computed variance was very poor when the real variance is exactly zero. #19766 by Jérémie du Boisberranger.
- Fix The docstrings of propreties that are decorated with utils.deprecated are now properly wrapped. #20385 by Thomas Fan.
- Fix utils.stats._weighted_percentile now correctly ignores zero-weighted observations smaller than the smallest observation with positive weight for percentile=0. Affected classes are dummy.DummyRegressor for quantile=0 and ensemble.HuberLossFunction and ensemble.HuberLossFunction for alpha=0. #20528 by Malte Londschien.
- Fix utils._safe_indexing explicitly takes a dataframe copy when integer indices are provided avoiding to raise a warning from Pandas. This warning was previously raised in resampling utilities and functions using those utilities (e.g. model_selection.train_test_split, model_selection.cross_validate, model_selection.cross_val_score, model_selection.cross_val_predict). #20673 by Joris Van den Bossche.
- Fix Fix a regression in utils.is_scalar_nan where large Python numbers would raise an error due to overflow in C types (np.float64 or np.int64). #20727 by Guillaume Lemaitre.
- Fix Support for np.matrix is deprecated in check_array in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
- API Change utils._testing.assert_warns and utils._testing.assert_warns_message are deprecated in 1.0 and will be removed in 1.2. Used pytest.warns context manager instead. Note that these functions were not documented and part from the public API. #20521 by Olivier Grisel.
- API Change Fixed several bugs in utils.graph.graph_shortest_path, which is now deprecated. Use scipy.sparse.csgraph.shortest_path instead. #20531 by Tom Dupre la Tour.
Have any questions?
Contact Exxact Today
scikit-learn 1.0 Released
scikit-learn 1.0 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
This release includes some new key features as well as many improvements and bug fixes. Highlights include:
- Keyword and positional arguments
- Spline Transformers
- Quantile Regressor
- Feature Names Support
- A more flexible plotting API
- Online One-Class SVM
- Histogram-based Gradient Boosting Models are now stable
- New documentation improvements
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.0.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.0.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.0.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.0.0 of scikit-learn requires python 3.7+, numpy 1.14.6+ and scipy 1.1.0+. Optional minimal dependency is matplotlib 2.2.2+.
Enforcing keyword-only arguments
In an effort to promote clear and non-ambiguous use of the library, most constructor and function parameters must now be passed as keyword arguments (i.e. using the param=value syntax) instead of positional. If a keyword-only parameter is used as positional, a TypeError is now raised.
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Fix manifold.TSNE now avoids numerical underflow issues during affinity matrix computation.
- Fix manifold.Isomap now connects disconnected components of the neighbors graph along some minimum distance pairs, instead of changing every infinite distances to zero.
- Fix The splitting criterion of tree.DecisionTreeClassifier and tree.DecisionTreeRegressor can be impacted by a fix in the handling of rounding errors. Previously some extra spurious splits could occur.
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
- API Change The option for using the squared error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "squared_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19310 by Christian Lorentzen.
- For ensemble.ExtraTreesRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.GradientBoostingRegressor, loss="ls" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.RandomForestRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For ensemble.HistGradientBoostingRegressor, loss="least_squares" is deprecated, use "squared_error" instead which is now the default.
- For linear_model.RANSACRegressor, loss="squared_loss" is deprecated, use "squared_error" instead.
- For linear_model.SGDRegressor, loss="squared_loss" is deprecated, use "squared_error" instead which is now the default.
- For tree.DecisionTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- For tree.ExtraTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
- API Change The option for using the absolute error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "absolute_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19733 by Christian Lorentzen.
- For ensemble.ExtraTreesRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For ensemble.GradientBoostingRegressor, loss="lad" is deprecated, use "absolute_error" instead.
- For ensemble.RandomForestRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For ensemble.HistGradientBoostingRegressor, loss="least_absolute_deviation" is deprecated, use "absolute_error" instead.
- For linear_model.RANSACRegressor, loss="absolute_loss" is deprecated, use "absolute_error" instead which is now the default.
- For tree.DecisionTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- For tree.ExtraTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
- API Change np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
- API Change get_feature_names_out has been added to the transformer API to get the names of the output features. get_feature_names has in turn been deprecated. #18444 by Thomas Fan.
- API Change All estimators store feature_names_in_ when fitted on pandas Dataframes. These feature names are compared to names seen in non-fit methods, e.g. transform and will raise a FutureWarning if they are not consistent. These FutureWarning s will become ValueError s in 1.2. #18010 by Thomas Fan.
sklearn.base
- Fix config_context is now threadsafe. #18736 by Thomas Fan.
sklearn.calibration
- Feature calibration.CalibrationDisplay added to plot calibration curves. #17443 by Lucy Liu.
- Fix The predict and predict_proba methods of calibration.CalibratedClassifierCV can now properly be used on prefitted pipelines. #19641 by Alek Lefebvre.
- Fix Fixed an error when using a ensemble.VotingClassifier as base_estimator in calibration.CalibratedClassifierCV. #20087 by Clément Fauchereau.
sklearn.cluster
- Efficiency The "k-means++" initialization of cluster.KMeans and cluster.MiniBatchKMeans is now faster, especially in multicore settings. #19002 by Jon Crall and Jérémie du Boisberranger.
- Efficiency cluster.KMeans with algorithm='elkan' is now faster in multicore settings. #19052 by Yusuke Nagasaka.
- Efficiency cluster.MiniBatchKMeans is now faster in multicore settings. #17622 by Jérémie du Boisberranger.
- Efficiency cluster.OPTICS can now cache the output of the computation of the tree, using the memory parameter. #19024 by Frankie Robertson.
- Enhancement The predict and fit_predict methods of cluster.AffinityPropagation now accept sparse data type for input data. #20117 by Venkatachalam Natchiappan
- Fix Fixed a bug in cluster.MiniBatchKMeans where the sample weights were partially ignored when the input is sparse. #17622 by Jérémie du Boisberranger.
- Fix Improved convergence detection based on center change in cluster.MiniBatchKMeans which was almost never achievable. #17622 by Jérémie du Boisberranger.
- Fix cluster.AgglomerativeClustering now supports readonly memory-mapped datasets. #19883 by Julien Jerphanion.
- Fix cluster.AgglomerativeClustering correctly connects components when connectivity and affinity are both precomputed and the number of connected components is greater than 1. #20597 by Thomas Fan.
- Fix cluster.FeatureAgglomeration does not accept a **params kwarg in the fit function anymore, resulting in a more concise error message. #20899 by Adam Li.
- Fix Fixed a bug in cluster.KMeans, ensuring reproducibility and equivalence between sparse and dense input. #20200 by Jérémie du Boisberranger.
- API Change cluster.Birch attributes, fit_ and partial_fit_, are deprecated and will be removed in 1.2. #19297 by Thomas Fan.
- API Change the default value for the batch_size parameter of cluster.MiniBatchKMeans was changed from 100 to 1024 due to efficiency reasons. The n_iter_ attribute of cluster.MiniBatchKMeans now reports the number of started epochs and the n_steps_ attribute reports the number of mini batches processed. #17622 by Jérémie du Boisberranger.
- API Change cluster.spectral_clustering raises an improved error when passed a np.matrix. #20560 by Thomas Fan.
sklearn.compose
- Enhancement compose.ColumnTransformer now records the output of each transformer in output_indices_. #18393 by Luca Bittarello.
- Enhancement compose.ColumnTransformer now allows DataFrame input to have its columns appear in a changed order in transform. Further, columns that are dropped will not be required in transform, and additional columns will be ignored if remainder='drop'. #19263 by Thomas Fan.
- Enhancement Adds **predict_params keyword argument to compose.TransformedTargetRegressor.predict that passes keyword argument to the regressor. #19244 by Ricardo.
- Fix compose.ColumnTransformer.get_feature_names supports non-string feature names returned by any of its transformers. However, note that get_feature_names is deprecated, use get_feature_names_out instead. #18459 by Albert Villanova del Moral and Alonso Silva Allende.
- Fix compose.TransformedTargetRegressor now takes nD targets with an adequate transformer. #18898 by Oras Phongpanagnam.
- API Change Adds verbose_feature_names_out to compose.ColumnTransformer. This flag controls the prefixing of feature names out in get_feature_names_out. #18444 and #21080 by Thomas Fan.
sklearn.covariance
- Fix Adds arrays check to covariance.ledoit_wolf and covariance.ledoit_wolf_shrinkage. #20416 by Hugo Defois.
- API Change Deprecates the following keys in cv_results_: 'mean_score', 'std_score', and 'split(k)_score' in favor of 'mean_test_score' 'std_test_score', and 'split(k)_test_score'. #20583 by Thomas Fan.
sklearn.datasets
- Enhancement datasets.fetch_openml now supports categories with missing values when returning a pandas dataframe. #19365 by Thomas Fan and Amanda Dsouza and EL-ATEIF Sara.
- Enhancement datasets.fetch_kddcup99 raises a better message when the cached file is invalid. #19669 Thomas Fan.
- Enhancement Replace usages of __file__ related to resource file I/O with importlib.resources to avoid the assumption that these resource files (e.g. iris.csv) already exist on a filesystem, and by extension to enable compatibility with tools such as PyOxidizer. #20297 by Jack Liu.
- Fix Shorten data file names in the openml tests to better support installing on Windows and its default 260 character limit on file names. #20209 by Thomas Fan.
- Fix datasets.fetch_kddcup99 returns dataframes when return_X_y=True and as_frame=True. #19011 by Thomas Fan.
- API Change Deprecates datasets.load_boston in 1.0 and it will be removed in 1.2. Alternative code snippets to load similar datasets are provided. Please report to the docstring of the function for details. #20729 by Guillaume Lemaitre.
sklearn.decomposition
- Enhancement added a new approximate solver (randomized SVD, available with eigen_solver='randomized') to decomposition.KernelPCA. This significantly accelerates computation when the number of samples is much larger than the desired number of components. #12069 by Sylvain Marié.
- Fix Fixes incorrect multiple data-conversion warnings when clustering boolean data. #19046 by Surya Prakash.
- Fix Fixed dict_learning, used by decomposition.DictionaryLearning, to ensure determinism of the output. Achieved by flipping signs of the SVD output which is used to initialize the code. #18433 by Bruno Charron.
- Fix Fixed a bug in decomposition.MiniBatchDictionaryLearning, decomposition.MiniBatchSparsePCA and decomposition.dict_learning_online where the update of the dictionary was incorrect. #19198 by Jérémie du Boisberranger.
- Fix Fixed a bug in decomposition.DictionaryLearning, decomposition.SparsePCA, decomposition.MiniBatchDictionaryLearning, decomposition.MiniBatchSparsePCA, decomposition.dict_learning and decomposition.dict_learning_online where the restart of unused atoms during the dictionary update was not working as expected. #19198 by Jérémie du Boisberranger.
- API Change In decomposition.DictionaryLearning, decomposition.MiniBatchDictionaryLearning, decomposition.dict_learning and decomposition.dict_learning_online, transform_alpha will be equal to alpha instead of 1.0 by default starting from version 1.2 #19159 by Benoît Malézieux.
- API Change Rename variable names in KernelPCA to improve readability. lambdas_ and alphas_ are renamed to eigenvalues_ and eigenvectors_, respectively. lambdas_ and alphas_ are deprecated and will be removed in 1.2. #19908 by Kei Ishikawa.
- API Change The alpha and regularization parameters of decomposition.NMF and decomposition.non_negative_factorization are deprecated and will be removed in 1.2. Use the new parameters alpha_W and alpha_H instead. #20512 by Jérémie du Boisberranger.
sklearn.dummy
- API Change Attribute n_features_in_ in dummy.DummyRegressor and dummy.DummyRegressor is deprecated and will be removed in 1.2. #20960 by Thomas Fan.
sklearn.ensemble
- Enhancement HistGradientBoostingClassifier and HistGradientBoostingRegressor take cgroups quotas into account when deciding the number of threads used by OpenMP. This avoids performance problems caused by over-subscription when using those classes in a docker container for instance. #20477 by Thomas Fan.
- Enhancement HistGradientBoostingClassifier and HistGradientBoostingRegressor are no longer experimental. They are now considered stable and are subject to the same deprecation cycles as all other estimators. #19799 by Nicolas Hug.
- Enhancement Improve the HTML rendering of the ensemble.StackingClassifier and ensemble.StackingRegressor. #19564 by Thomas Fan.
- Enhancement Added Poisson criterion to ensemble.RandomForestRegressor. #19836 by Brian Sun.
- Fix Do not allow to compute out-of-bag (OOB) score in ensemble.RandomForestClassifier and ensemble.ExtraTreesClassifier with multiclass-multioutput target since scikit-learn does not provide any metric supporting this type of target. Additional private refactoring was performed. #19162 by Guillaume Lemaitre.
- Fix Improve numerical precision for weights boosting in ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor to avoid underflows. #10096 by Fenil Suchak.
- Fix Fixed the range of the argument max_samples to be (0.0, 1.0] in ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, where max_samples=1.0 is interpreted as using all n_samples for bootstrapping. #20159 by @murata-yu.
- Fix Fixed a bug in ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor where the sample_weight parameter got overwritten during fit. #20534 by Guillaume Lemaitre.
- API Change Removes tol=None option in ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor. Please use tol=0 for the same behavior. #19296 by Thomas Fan.
sklearn.feature_extraction
- Fix Fixed a bug in feature_extraction.text.HashingVectorizer where some input strings would result in negative indices in the transformed data. #19035 by Liu Yu.
- Fix Fixed a bug in feature_extraction.DictVectorizer by raising an error with unsupported value type. #19520 by Jeff Zhao.
- Fix Fixed a bug in feature_extraction.image.img_to_graph and feature_extraction.image.grid_to_graph where singleton connected components were not handled properly, resulting in a wrong vertex indexing. #18964 by Bertrand Thirion.
- Fix Raise a warning in feature_extraction.text.CountVectorizer with lowercase=True when there are vocabulary entries with uppercase characters to avoid silent misses in the resulting feature vectors. #19401 by Zito Relova
sklearn.feature_selection
- Feature feature_selection.r_regression computes Pearson’s R correlation coefficients between the features and the target. #17169 by Dmytro Lituiev and Julien Jerphanion.
- Enhancement feature_selection.RFE.fit accepts additional estimator parameters that are passed directly to the estimator’s fit method. #20380 by Iván Pulido, Felipe Bidu, Gil Rutter, and Adrin Jalali.
- Fix Fix a bug in isotonic.isotonic_regression where the sample_weight passed by a user were overwritten during fit. #20515 by Carsten Allefeld.
- Fix Change feature_selection.SequentialFeatureSelector to allow for unsupervised modelling so that the fit signature need not do any y validation and allow for y=None. #19568 by Shyam Desai.
- API Change Raises an error in feature_selection.VarianceThreshold when the variance threshold is negative. #20207 by Tomohiro Endo
- API Change Deprecates grid_scores_ in favor of split scores in cv_results_ in feature_selection.RFECV. grid_scores_ will be removed in version 1.2. #20161 by Shuhei Kayawari and @arka204.
sklearn.inspection
- Enhancement Add max_samples parameter in inspection.permutation_importance. It enables to draw a subset of the samples to compute the permutation importance. This is useful to keep the method tractable when evaluating feature importance on large datasets. #20431 by Oliver Pfaffel.
- Enhancement Add kwargs to format ICE and PD lines separately in partial dependence plots inspection.plot_partial_dependence and inspection.PartialDependenceDisplay.plot. #19428 by Mehdi Hamoumi.
- Fix Allow multiple scorers input to inspection.permutation_importance. #19411 by Simona Maggio.
- API Change inspection.PartialDependenceDisplay exposes a class method: from_estimator. inspection.plot_partial_dependence is deprecated in favor of the class method and will be removed in 1.2. #20959 by Thomas Fan.
sklearn.kernel_approximation
- Fix Fix a bug in kernel_approximation.Nystroem where the attribute component_indices_ did not correspond to the subset of sample indices used to generate the approximated kernel. #20554 by Xiangyin Kong.
sklearn.linear_model
- Feature Added linear_model.QuantileRegressor which implements linear quantile regression with L1 penalty. #9978 by David Dale and Christian Lorentzen.
- Feature The new linear_model.SGDOneClassSVM provides an SGD implementation of the linear One-Class SVM. Combined with kernel approximation techniques, this implementation approximates the solution of a kernelized One Class SVM while benefitting from a linear complexity in the number of samples. #10027 by Albert Thomas.
- Feature Added sample_weight parameter to linear_model.LassoCV and linear_model.ElasticNetCV. #16449 by Christian Lorentzen.
- Feature Added new solver lbfgs (available with solver="lbfgs") and positive argument to linear_model.Ridge. When positive is set to True, forces the coefficients to be positive (only supported by lbfgs). #20231 by Toshihiro Nakae.
- Efficiency The implementation of linear_model.LogisticRegression has been optimised for dense matrices when using solver='newton-cg' and multi_class!='multinomial'. #19571 by Julien Jerphanion.
- Enhancement fit method preserves dtype for numpy.float32 in linear_model.Lars, linear_model.LassoLars, linear_model.LassoLars, linear_model.LarsCV and linear_model.LassoLarsCV. #20155 by Takeshi Oura.
- Enhancement Validate user-supplied gram matrix passed to linear models via the precompute argument. #19004 by Adam Midvidy.
- Fix linear_model.ElasticNet.fit no longer modifies sample_weight in place. #19055 by Thomas Fan.
- Fix linear_model.Lasso and linear_model.ElasticNet no longer have a dual_gap_ not corresponding to their objective. #19172 by Mathurin Massias
- Fix sample_weight are now fully taken into account in linear models when normalize=True for both feature centering and feature scaling. #19426 by Alexandre Gramfort and Maria Telenczuk.
- Fix Points with residuals equal to residual_threshold are now considered as inliers for linear_model.RANSACRegressor. This allows fitting a model perfectly on some datasets when residual_threshold=0. #19499 by Gregory Strubel.
- Fix Sample weight invariance for linear_model.Ridge was fixed in #19616 by Oliver Grisel and Christian Lorentzen.
- Fix The dictionary params in linear_model.enet_path and linear_model.lasso_path should only contain parameter of the coordinate descent solver. Otherwise, an error will be raised. #19391 by Shao Yang Hong.
- API Change Raise a warning in linear_model.RANSACRegressor that from version 1.2, min_samples need to be set explicitly for models other than linear_model.LinearRegression. #19390 by Shao Yang Hong.
- API Change : The parameter normalize of linear_model.LinearRegression is deprecated and will be removed in 1.2. Motivation for this deprecation: normalize parameter did not take any effect if fit_intercept was set to False and therefore was deemed confusing. The behavior of the deprecated LinearModel(normalize=True) can be reproduced with a Pipeline with LinearModel (where LinearModel is LinearRegression, Ridge, RidgeClassifier, RidgeCV or RidgeClassifierCV) as follows: make_pipeline(StandardScaler(with_mean=False), LinearModel()). The normalize parameter in LinearRegression was deprecated in #17743 by Maria Telenczuk and Alexandre Gramfort. Same for Ridge, RidgeClassifier, RidgeCV, and RidgeClassifierCV, in: #17772 by Maria Telenczuk and Alexandre Gramfort. Same for BayesianRidge, ARDRegression in: #17746 by Maria Telenczuk. Same for Lasso, LassoCV, ElasticNet, ElasticNetCV, MultiTaskLasso, MultiTaskLassoCV, MultiTaskElasticNet, MultiTaskElasticNetCV, in: #17785 by Maria Telenczuk and Alexandre Gramfort.
- API Change The normalize parameter of OrthogonalMatchingPursuit and OrthogonalMatchingPursuitCV will default to False in 1.2 and will be removed in 1.4. #17750 by Maria Telenczuk and Alexandre Gramfort. Same for Lars LarsCV LassoLars LassoLarsCV LassoLarsIC, in #17769 by Maria Telenczuk and Alexandre Gramfort.
- API Change Keyword validation has moved from __init__ and set_params to fit for the following estimators conforming to scikit-learn’s conventions: SGDClassifier, SGDRegressor, SGDOneClassSVM, PassiveAggressiveClassifier, and PassiveAggressiveRegressor. #20683 by Guillaume Lemaitre.
sklearn.manifold
- Enhancement Implement 'auto' heuristic for the learning_rate in manifold.TSNE. It will become default in 1.2. The default initialization will change to pca in 1.2. PCA initialization will be scaled to have standard deviation 1e-4 in 1.2. #19491 by Dmitry Kobak.
- Fix Change numerical precision to prevent underflow issues during affinity matrix computation for manifold.TSNE. #19472 by Dmitry Kobak.
- Fix manifold.Isomap now uses scipy.sparse.csgraph.shortest_path to compute the graph shortest path. It also connects disconnected components of the neighbors graph along some minimum distance pairs, instead of changing every infinite distances to zero. #20531 by Roman Yurchak and Tom Dupre la Tour.
- Fix Decrease the numerical default tolerance in the lobpcg call in manifold.spectral_embedding to prevent numerical instability. #21194 by Andrew Knyazev.
sklearn.metrics
- Feature metrics.mean_pinball_loss exposes the pinball loss for quantile regression. #19415 by Xavier Dupré and Oliver Grisel.
- Feature metrics.d2_tweedie_score calculates the D^2 regression score for Tweedie deviances with power parameter power. This is a generalization of the r2_score and can be interpreted as percentage of Tweedie deviance explained. #17036 by Christian Lorentzen.
- Feature metrics.mean_squared_log_error now supports squared=False. #20326 by Uttam kumar.
- Efficiency Improved speed of metrics.confusion_matrix when labels are integral. #9843 by Jon Crall.
- Enhancement A fix to raise an error in metrics.hinge_loss when pred_decision is 1d whereas it is a multiclass classification or when pred_decision parameter is not consistent with the labels parameter. #19643 by Pierre Attard.
- Fix metrics.ConfusionMatrixDisplay.plot uses the correct max for colormap. #19784 by Thomas Fan.
- Fix Samples with zero sample_weight values do not affect the results from metrics.det_curve, metrics.precision_recall_curve and metrics.roc_curve. #18328 by Albert Villanova del Moral and Alonso Silva Allende.
- Fix avoid overflow in metrics.cluster.adjusted_rand_score with large amount of data. #20312 by Divyanshu Deoli.
- API Change metrics.ConfusionMatrixDisplay exposes two class methods from_estimator and from_predictions allowing to create a confusion matrix plot using an estimator or the predictions. metrics.plot_confusion_matrix is deprecated in favor of these two class methods and will be removed in 1.2. #18543 by Guillaume Lemaitre.
- API Change metrics.PrecisionRecallDisplay exposes two class methods from_estimator and from_predictions allowing to create a precision-recall curve using an estimator or the predictions. metrics.plot_precision_recall_curve is deprecated in favor of these two class methods and will be removed in 1.2. #20552 by Guillaume Lemaitre.
- API Change metrics.DetCurveDisplay exposes two class methods from_estimator and from_predictions allowing to create a confusion matrix plot using an estimator or the predictions. metrics.plot_det_curve is deprecated in favor of these two class methods and will be removed in 1.2. #19278 by Guillaume Lemaitre.
sklearn.mixture
- Fix Ensure that the best parameters are set appropriately in the case of divergency for mixture.GaussianMixture and mixture.BayesianGaussianMixture. #20030 by Tingshan Liu and Benjamin Pedigo.
sklearn.model_selection
- Feature added model_selection.StratifiedGroupKFold, that combines model_selection.StratifiedKFold and model_selection.GroupKFold, providing an ability to split data preserving the distribution of classes in each split while keeping each group within a single split. #18649 by Leandro Hermida and Rodion Martynov.
- Enhancement warn only once in the main process for per-split fit failures in cross-validation. #20619 by Loïc Estève
- Enhancement The model_selection.BaseShuffleSplit base class is now public. #20056 by @pabloduque0.
- Fix Avoid premature overflow in model_selection.train_test_split. #20904 by Tomasz Jakubek.
sklearn.naive_bayes
- Fix The fit and partial_fit methods of the discrete naive Bayes classifiers (naive_bayes.BernoulliNB, naive_bayes.CategoricalNB, naive_bayes.ComplementNB, and naive_bayes.MultinomialNB) now correctly handle the degenerate case of a single class in the training set. #18925 by David Poznik.
- API Change The attribute sigma_ is now deprecated in naive_bayes.GaussianNB and will be removed in 1.2. Use var_ instead. #18842 by Hong Shao Yang.
sklearn.neighbors
- Enhancement The creation of neighbors.KDTree and neighbors.BallTree has been improved for their worst-cases time complexity from O(n2) to O(n). #19473 by jiefangxuanyan and Julien Jerphanion.
- Fix neighbors.DistanceMetric subclasses now support readonly memory-mapped datasets. #19883 by Julien Jerphanion.
- Fix neighbors.NearestNeighbors, neighbors.KNeighborsClassifier, neighbors.RadiusNeighborsClassifier, neighbors.KNeighborsRegressor and neighbors.RadiusNeighborsRegressor do not validate weights in __init__ and validates weights in fit instead. #20072 by Juan Carlos Alfaro Jiménez.
- API Change The parameter kwargs of neighbors.RadiusNeighborsClassifier is deprecated and will be removed in 1.2. #20842 by Juan MartÃn Loyola.
sklearn.neural_network
- Fix neural_network.MLPClassifier and neural_network.MLPRegressor now correctly support continued training when loading from a pickled file. #19631 by Thomas Fan.
sklearn.pipeline
- API Change The predict_proba and predict_log_proba methods of the pipeline.Pipeline now support passing prediction kwargs to the final estimator. #19790 by Christopher Flynn.
sklearn.preprocessing
- Feature The new preprocessing.SplineTransformer is a feature preprocessing tool for the generation of B-splines, parametrized by the polynomial degree of the splines, number of knots n_knots and knot positioning strategy knots. #18368 by Christian Lorentzen. preprocessing.SplineTransformer also supports periodic splines via the extrapolation argument. #19483 by Malte Londschien. preprocessing.SplineTransformer supports sample weights for knot position strategy "quantile". #20526 by Malte Londschien.
- Feature preprocessing.OrdinalEncoder supports passing through missing values by default. #19069 by Thomas Fan.
- Feature preprocessing.OneHotEncoder now supports handle_unknown='ignore' and dropping categories. #19041 by Thomas Fan.
- Feature preprocessing.PolynomialFeatures now supports passing a tuple to degree, i.e. degree=(min_degree, max_degree). #20250 by Christian Lorentzen.
- Efficiency preprocessing.StandardScaler is faster and more memory efficient. #20652 by Thomas Fan.
- Efficiency Changed algorithm argument for cluster.KMeans in preprocessing.KBinsDiscretizer from auto to full. #19934 by Gleb Levitskiy.
- Efficiency The implementation of fit for preprocessing.PolynomialFeatures transformer is now faster. This is especially noticeable on large sparse input. #19734 by Fred Robinson.
- Fix The preprocessing.StandardScaler.inverse_transform method now raises error when the input data is 1D. #19752 by Zhehao Liu.
- Fix preprocessing.scale, preprocessing.StandardScaler and similar scalers detect near-constant features to avoid scaling them to very large values. This problem happens in particular when using a scaler on sparse data with a constant column with sample weights, in which case centering is typically disabled. #19527 by Oliver Grisel and Maria Telenczuk and #19788 by Jérémie du Boisberranger.
- Fix preprocessing.StandardScaler.inverse_transform now correctly handles integer dtypes. #19356 by @makoeppel.
- Fix preprocessing.OrdinalEncoder.inverse_transform is not supporting sparse matrix and raises the appropriate error message. #19879 by Guillaume Lemaitre.
- Fix The fit method of preprocessing.OrdinalEncoder will not raise error when handle_unknown='ignore' and unknown categories are given to fit. #19906 by Zhehao Liu.
- Fix Fix a regression in preprocessing.OrdinalEncoder where large Python numeric would raise an error due to overflow when casted to C type (np.float64 or np.int64). #20727 by Guillaume Lemaitre.
- Fix preprocessing.FunctionTransformer does not set n_features_in_ based on the input to inverse_transform. #20961 by Thomas Fan.
- API Change The n_input_features_ attribute of preprocessing.PolynomialFeatures is deprecated in favor of n_features_in_ and will be removed in 1.2. #20240 by Jérémie du Boisberranger.
sklearn.svm
- API Change The parameter **params of svm.OneClassSVM.fit is deprecated and will be removed in 1.2. #20843 by Juan MartÃn Loyola.
sklearn.tree
- Enhancement Add fontname argument in tree.export_graphviz for non-English characters. #18959 by Zero and wstates.
- Fix Improves compatibility of tree.plot_tree with high DPI screens. #20023 by Thomas Fan.
- Fix Fixed a bug in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor where a node could be split whereas it should not have been due to incorrect handling of rounding errors. #19336 by Jérémie du Boisberranger.
- API Change The n_features_ attribute of tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor is deprecated in favor of n_features_in_ and will be removed in 1.2. #20272 by Jérémie du Boisberranger.
sklearn.utils
- Enhancement Deprecated the default value of the random_state=0 in randomized_svd. Starting in 1.2, the default value of random_state will be set to None. #19459 by Cindy Bezuidenhout and Clifford Akai-Nettey.
- Enhancement Added helper decorator utils.metaestimators.available_if to provide flexiblity in metaestimators making methods available or unavailable on the basis of state, in a more readable way. #19948 by Joel Nothman.
- Enhancement utils.validation.check_is_fitted now uses __sklearn_is_fitted__ if available, instead of checking for attributes ending with an underscore. This also makes pipeline.Pipeline and preprocessing.FunctionTransformer pass check_is_fitted(estimator). #20657 by Adrin Jalali.
- Fix Fixed a bug in utils.sparsefuncs.mean_variance_axis where the precision of the computed variance was very poor when the real variance is exactly zero. #19766 by Jérémie du Boisberranger.
- Fix The docstrings of propreties that are decorated with utils.deprecated are now properly wrapped. #20385 by Thomas Fan.
- Fix utils.stats._weighted_percentile now correctly ignores zero-weighted observations smaller than the smallest observation with positive weight for percentile=0. Affected classes are dummy.DummyRegressor for quantile=0 and ensemble.HuberLossFunction and ensemble.HuberLossFunction for alpha=0. #20528 by Malte Londschien.
- Fix utils._safe_indexing explicitly takes a dataframe copy when integer indices are provided avoiding to raise a warning from Pandas. This warning was previously raised in resampling utilities and functions using those utilities (e.g. model_selection.train_test_split, model_selection.cross_validate, model_selection.cross_val_score, model_selection.cross_val_predict). #20673 by Joris Van den Bossche.
- Fix Fix a regression in utils.is_scalar_nan where large Python numbers would raise an error due to overflow in C types (np.float64 or np.int64). #20727 by Guillaume Lemaitre.
- Fix Support for np.matrix is deprecated in check_array in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
- API Change utils._testing.assert_warns and utils._testing.assert_warns_message are deprecated in 1.0 and will be removed in 1.2. Used pytest.warns context manager instead. Note that these functions were not documented and part from the public API. #20521 by Olivier Grisel.
- API Change Fixed several bugs in utils.graph.graph_shortest_path, which is now deprecated. Use scipy.sparse.csgraph.shortest_path instead. #20531 by Tom Dupre la Tour.
Have any questions?
Contact Exxact Today