
scikit-learn 1.1 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
Highlights include:
- Quantile loss in
ensemble.HistGradientBoostingRegressor get_feature_names_outAvailable in all Transformers- Grouping infrequent categories in
OneHotEncoder - Performance improvements
- MiniBatchNMF: an online version of NMF
- BisectingKMeans: divide and cluster
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.1.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.1.0 of scikit-learn requires python 3.8+, numpy 1.17.3+ and scipy 1.3.2+. Optional minimal dependency is matplotlib 3.1.2+.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Efficiency
cluster.KMeansnow defaults toalgorithm="lloyd"instead ofalgorithm="auto", which was equivalent toalgorithm="elkan". Lloyd’s algorithm and Elkan’s algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd’s algorithm uses much less memory, and it is often faster. - Efficiency Fitting
tree.DecisionTreeClassifier,tree.DecisionTreeRegressor,ensemble.RandomForestClassifier,ensemble.RandomForestRegressor,ensemble.GradientBoostingClassifier, andensemble.GradientBoostingRegressoris on average 15% faster than in previous versions thanks to a new sort algorithm to find the best split. Models might be different because of a different handling of splits with tied criterion values: both the old and the new sorting algorithm are unstable sorting algorithms. #22868 by Thomas Fan. - Fix The eigenvectors initialization for
cluster.SpectralClusteringandmanifold.SpectralEmbeddingnow samples from a Gaussian when using the'amg'or'lobpcg'solver. This change improves numerical stability of the solver, but may result in a different model. - Fix
feature_selection.f_regressionandfeature_selection.r_regressionwill now returned finite score by default instead ofnp.nanandnp.inffor some corner case. You can useforce_finite=Falseif you really want to get non-finite values and keep the old behavior. - Fix Panda’s DataFrames with all non-string columns such as a MultiIndex no longer warns when passed into an Estimator. Estimators will continue to ignore the column names in DataFrames with non-string columns. For
feature_names_in_to be defined, columns must be all strings. #22410 by Thomas Fan. - Fix
preprocessing.KBinsDiscretizerchanged handling of bin edges slightly, which might result in a different encoding with the same data. - Fix
calibration.calibration_curvechanged handling of bin edges slightly, which might result in a different output curve given the same data. - Fix
discriminant_analysis.LinearDiscriminantAnalysisnow uses the correct variance-scaling coefficient which may result in different model behavior. - Fix
feature_selection.SelectFromModel.fitandfeature_selection.SelectFromModel.partial_fitcan now be called withprefit=True.estimators_will be a deep copy ofestimatorwhenprefit=True. #23271 by Guillaume Lemaitre.
Changelog
- Efficiency Low-level routines for reductions on pairwise distances for dense float64 datasets have been refactored. The following functions and estimators now benefit from improved performances in terms of hardware scalability and speed-ups:
sklearn.metrics.pairwise_distances_argminsklearn.metrics.pairwise_distances_argmin_minsklearn.cluster.AffinityPropagationsklearn.cluster.Birchsklearn.cluster.MeanShiftsklearn.cluster.OPTICSsklearn.cluster.SpectralClusteringsklearn.feature_selection.mutual_info_regressionsklearn.neighbors.KNeighborsClassifiersklearn.neighbors.KNeighborsRegressorsklearn.neighbors.RadiusNeighborsClassifiersklearn.neighbors.RadiusNeighborsRegressorsklearn.neighbors.LocalOutlierFactorsklearn.neighbors.NearestNeighborssklearn.manifold.Isomapsklearn.manifold.LocallyLinearEmbeddingsklearn.manifold.TSNEsklearn.manifold.trustworthinesssklearn.semi_supervised.LabelPropagationsklearn.semi_supervised.LabelSpreading
For instance
sklearn.neighbors.NearestNeighbors.kneighborsandsklearn.neighbors.NearestNeighbors.radius_neighborscan respectively be up to ×20 and ×5 faster than previously. #21987, #22064, #22065, #22288 and #22320 by Julien Jerphanion. - Enhancement All scikit-learn models now generate a more informative error message when some input contains unexpected
NaNor infinite values. In particular the message contains the input name (“X”, “y” or “sample_weight”) and if an unexpectedNaNvalue is found inX, the error message suggests potential solutions. #21219 by Olivier Grisel. - Enhancement All scikit-learn models now generate a more informative error message when setting invalid hyper-parameters with
set_params. #21542 by Olivier Grisel. - Enhancement Removes random unique identifiers in the HTML representation. With this change, jupyter notebooks are reproducible as long as the cells are run in the same order. #23098 by Thomas Fan.
- Fix Estimators with
non_deterministictag set toTruewill skip bothcheck_methods_sample_order_invarianceandcheck_methods_subset_invariancetests. #22318 by Zhehao Liu. - API Change The option for using the log loss, aka binomial or multinomial deviance, via the
lossparameters was made more consistent. The preferred way is by setting the value to"log_loss". Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.- For
ensemble.GradientBoostingClassifier, thelossparameter name “deviance” is deprecated in favor of the new name “log_loss”, which is now the default. #23036 by Christian Lorentzen. - For
ensemble.HistGradientBoostingClassifier, thelossparameter names “auto”, “binary_crossentropy” and “categorical_crossentropy” are deprecated in favor of the new name “log_loss”, which is now the default. #23040 by Christian Lorentzen. - For
linear_model.SGDClassifier, thelossparameter name “log” is deprecated in favor of the new name “log_loss”. #23046 by Christian Lorentzen.
- For
- API Change Rich html representation of estimators is now enabled by default in Jupyter notebooks. It can be deactivated by setting
display='text'insklearn.set_config. #22856 by Jérémie du Boisberranger. - Enhancement The error message is improved when importing
model_selection.HalvingGridSearchCV,model_selection.HalvingRandomSearchCV, orimpute.IterativeImputerwithout importing the experimental flag. #23194 by Thomas Fan. - Enhancement Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values. #23198 by Lise Kleiber, Zhehao Liu and Chiara Marmo.
sklearn.calibration
- Enhancement
calibration.calibration_curveaccepts a parameterpos_labelto specify the positive class label. #21032 by Guillaume Lemaitre. - Enhancement
calibration.CalibratedClassifierCV.fitnow supports passingfit_params, which are routed to thebase_estimator. #18170 by Benjamin Bossan. - Enhancement
calibration.CalibrationDisplayaccepts a parameterpos_labelto add this information to the plot. #21038 by Guillaume Lemaitre. - Fix
calibration.calibration_curvehandles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - API Change
calibration.calibration_curve’snormalizeparameter is now deprecated and will be removed in version 1.3. It is recommended that a proper probability (i.e. a classifier’s predict_proba positive class) is used fory_prob. #23095 by Jordan Silke.
sklearn.cluster
- Major Feature
BisectingKMeansintroducing Bisecting K-Means algorithm #20031 by Michal Krawczyk, Tom Dupre la Tour and Jérémie du Boisberranger. - Enhancement
cluster.SpectralClusteringandcluster.spectral_clusteringnow include the new'cluster_qr'method that clusters samples in the embedding space as an alternative to the existing'kmeans'and'discrete'methods. Seecluster.spectral_clusteringfor more details. #21148 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
cluster.Birch,cluster.FeatureAgglomeration,cluster.KMeans,cluster.MiniBatchKMeans. #22255 by Thomas Fan. - Enhancement
cluster.SpectralClusteringnow raises consistent error messages when passed invalid values forn_clusters,n_init,gamma,n_neighbors,eigen_tolordegree. #21881 by Hugo Vassard. - Enhancement
cluster.AffinityPropagationnow returns cluster centers and labels if they exist, even if the model has not fully converged. When returning these potentially-degenerate cluster centers and labels, a new warning message is shown. If no cluster centers were constructed, then the cluster centers remain an empty list with labels set to-1and the original warning message is shown. #22217 by Meekail Zain. - Efficiency In
cluster.KMeans, the defaultalgorithmis now"lloyd"which is the full classical EM-style algorithm. Both"auto"and"full"are deprecated and will be removed in version 1.3. They are now aliases for"lloyd". The previous default was"auto", which relied on Elkan’s algorithm. Lloyd’s algorithm uses less memory than Elkan’s, it is faster on many datasets, and its results are identical, hence the change. #21735 by Aurélien Geron. - Fix
cluster.KMeans’sinitparameter now properly supports array-like input and NumPy string scalars. #22154 by Thomas Fan.
sklearn.compose
- Fix
compose.ColumnTransformernow removes validation errors from__init__andset_paramsmethods. #22537 by iofall and Arisa Y.. - Fix get_feature_names_out functionality in
compose.ColumnTransformerwas broken when columns were specified usingslice. This is fixed in #22775 and #22913 by randomgeek78.
sklearn.covariance
- Fix
covariance.GraphicalLassoCVnow accepts NumPy array for the parameteralphas. #22493 by Guillaume Lemaitre.
sklearn.cross_decomposition
- Enhancement the
inverse_transformmethod ofcross_decomposition.PLSRegression,cross_decomposition.PLSCanonicalandcross_decomposition.CCAnow allows reconstruction of aXtarget when aYparameter is given. #19680 by Robin Thibaut. - Enhancement Adds get_feature_names_out to all transformers in the
cross_decompositionmodule:cross_decomposition.CCA,cross_decomposition.PLSSVD,cross_decomposition.PLSRegression, andcross_decomposition.PLSCanonical. #22119 by Thomas Fan. - Fix The shape of the coef_ attribute of
cross_decomposition.CCA,cross_decomposition.PLSCanonicalandcross_decomposition.PLSRegressionwill change in version 1.3, from(n_features, n_targets)to(n_targets, n_features), to be consistent with other linear models and to make it work with interface expecting a specific shape forcoef_(e.g.feature_selection.RFE). #22016 by Guillaume Lemaitre. - API Change add the fitted attribute
intercept_tocross_decomposition.PLSCanonical,cross_decomposition.PLSRegression, andcross_decomposition.CCA. The methodpredictis indeed equivalent toY = X @ coef_ + intercept_. #22015 by Guillaume Lemaitre.
sklearn.datasets
- Feature
datasets.load_filesnow accepts a ignore list and an allow list based on file extensions. #19747 by Tony Attalla and #22498 by Meekail Zain. - Enhancement
datasets.make_swiss_rollnow supports the optional argument hole; when set to True, it returns the swiss-hole dataset. #21482 by Sebastian Pujalte. - Enhancement
datasets.make_blobsno longer copies data during the generation process, therefore uses less memory. #22412 by Zhehao Liu. - Enhancement
datasets.load_diabetesnow accepts the parameterscaled, to allow loading unscaled data. The scaled version of this dataset is now computed from the unscaled data, and can produce slightly different results that in previous version (within a 1e-4 absolute tolerance). #16605 by Mandy Gu. - Enhancement
datasets.fetch_openmlnow has two optional argumentsn_retriesanddelay. By default,datasets.fetch_openmlwill retry 3 times in case of a network failure with a delay between each try. #21901 by Rileran. - Fix
datasets.fetch_covtypeis now concurrent-safe: data is downloaded to a temporary directory before being moved to the data directory. #23113 by Ilion Beyst. - API Change
datasets.make_sparse_coded_signalnow accepts a parameterdata_transposedto explicitly specify the shape of matrixX. The default behaviorTrueis to return a transposed matrixXcorresponding to a(n_features, n_samples)shape. The default value will change toFalsein version 1.3. #21425 by Gabriel Stefanini Vicente.
sklearn.decomposition
- Major Feature Added a new estimator
decomposition.MiniBatchNMF. It is a faster but less accurate version of non-negative matrix factorization, better suited for large datasets. #16948 by Chiara Marmo, Patricio Cerda and Jérémie du Boisberranger. - Enhancement
decomposition.dict_learning,decomposition.dict_learning_onlineanddecomposition.sparse_encodepreserve dtype fornumpy.float32.decomposition.DictionaryLearning,decomposition.MiniBatchDictionaryLearninganddecomposition.SparseCoderpreserve dtype fornumpy.float32. #22002 by Takeshi Oura. - Enhancement
decomposition.PCAexposes a parametern_oversamplesto tuneutils.randomized_svdand get accurate results when the number of features is large. #21109 by Smile. - Enhancement The
decomposition.MiniBatchDictionaryLearninganddecomposition.dict_learning_onlinehave been refactored and now have a stopping criterion based on a small change of the dictionary or objective function, controlled by the newmax_iter,tolandmax_no_improvementparameters. In addition, some of their parameters and attributes are deprecated.- the
n_iterparameter of both is deprecated. Usemax_iterinstead. - the
iter_offset,return_inner_stats,inner_statsandreturn_n_iterparameters ofdecomposition.dict_learning_onlineserve internal purpose and are deprecated. - the
inner_stats_,iter_offset_andrandom_state_attributes ofdecomposition.MiniBatchDictionaryLearningserve internal purpose and are deprecated. - the default value of the
batch_sizeparameter of both will change from 3 to 256 in version 1.3.
- the
- Enhancement
decomposition.SparsePCAanddecomposition.MiniBatchSparsePCApreserve dtype fornumpy.float32. #22111 by Takeshi Oura. - Enhancement
decomposition.TruncatedSVDnow allowsn_components == n_features, ifalgorithm='randomized'. #22181 by Zach Deane-Mayer. - Enhancement Adds get_feature_names_out to all transformers in the
decompositionmodule:decomposition.DictionaryLearning,decomposition.FactorAnalysis,decomposition.FastICA,decomposition.IncrementalPCA,decomposition.KernelPCA,decomposition.LatentDirichletAllocation,decomposition.MiniBatchDictionaryLearning,decomposition.MiniBatchSparsePCA,decomposition.NMF,decomposition.PCA,decomposition.SparsePCA, anddecomposition.TruncatedSVD. #21334 by Thomas Fan. - Enhancement
decomposition.TruncatedSVDexposes the parametern_oversamplesandpower_iteration_normalizerto tuneutils.randomized_svdand get accurate results when the number of features is large, the rank of the matrix is high, or other features of the matrix make low rank approximation difficult. #21705 by Jay S. Stanley III. - Enhancement
decomposition.PCAexposes the parameterpower_iteration_normalizerto tuneutils.randomized_svdand get more accurate results when low rank approximation is difficult. #21705 by Jay S. Stanley III. - Fix
decomposition.FastICAnow validates input parameters infitinstead of__init__. #21432 by Hannah Bohle and Maren Westermann. - Fix
decomposition.FastICAnow acceptsnp.float32data without silent upcasting. The dtype is preserved byfitandfit_transformand the main fitted attributes use a dtype of the same precision as the training data. #22806 by Jihane Bennis and Olivier Grisel. - Fix
decomposition.FactorAnalysisnow validates input parameters infitinstead of__init__. #21713 by Haya and Krum Arnaudov. - Fix
decomposition.KernelPCAnow validates input parameters infitinstead of__init__. #21567 by Maggie Chege. - Fix
decomposition.PCAanddecomposition.IncrementalPCAmore safely calculate precision using the inverse of the covariance matrix ifself.noise_variance_is zero. #22300 by Meekail Zain and #15948 by @sysuresh. - Fix Greatly reduced peak memory usage in
decomposition.PCAwhen callingfitorfit_transform. #22553 by Meekail Zain. - API Change
decomposition.FastICAnow supports unit variance for whitening. The default value of itswhitenargument will change fromTrue(which behaves like'arbitrary-variance') to'unit-variance'in version 1.3. #19490 by Facundo Ferrin and Julien Jerphanion.
sklearn.discriminant_analysis
- Enhancement Adds get_feature_names_out to
discriminant_analysis.LinearDiscriminantAnalysis. #22120 by Thomas Fan. - Fix
discriminant_analysis.LinearDiscriminantAnalysisnow uses the correct variance-scaling coefficient which may result in different model behavior. #15984 by Okon Samuel and #22696 by Meekail Zain.
sklearn.dummy
- Fix
dummy.DummyRegressorno longer overrides theconstantparameter duringfit. #22486 by Thomas Fan.
sklearn.ensemble
- Major Feature Added additional option
loss="quantile"toensemble.HistGradientBoostingRegressorfor modelling quantiles. The quantile level can be specified with the new parameterquantile. #21800 and #20567 by Christian Lorentzen. - Efficiency
fitofensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressornow callsutils.check_arraywith parameterforce_all_finite=Falsefor non initial warm-start runs as it has already been checked before. #22159 by Geoffrey Paris. - Enhancement
ensemble.HistGradientBoostingClassifieris faster, for binary and in particular for multiclass problems thanks to the new private loss function module. #20811, #20567 and #21814 by Christian Lorentzen. - Enhancement Adds support to use pre-fit models with
cv="prefit"inensemble.StackingClassifierandensemble.StackingRegressor. #16748 by Siqi He and #22215 by Meekail Zain. - Enhancement
ensemble.RandomForestClassifierandensemble.ExtraTreesClassifierhave the newcriterion="log_loss", which is equivalent tocriterion="entropy". #23047 by Christian Lorentzen. - Enhancement Adds get_feature_names_out to
ensemble.VotingClassifier,ensemble.VotingRegressor,ensemble.StackingClassifier, andensemble.StackingRegressor. #22695 and #22697 by Thomas Fan. - Enhancement
ensemble.RandomTreesEmbeddingnow has an informative get_feature_names_out function that includes both tree index and leaf index in the output feature names. #21762 by Zhehao Liu and Thomas Fan. - Efficiency Fitting a
ensemble.RandomForestClassifier,ensemble.RandomForestRegressor,ensemble.ExtraTreesClassifier,ensemble.ExtraTreesRegressor, andensemble.RandomTreesEmbeddingis now faster in a multiprocessing setting, especially for subsequent fits withwarm_startenabled. #22106 by Pieter Gijsbers. - Fix Change the parameter
validation_fractioninensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressorso that an error is raised if anything other than a float is passed in as an argument. #21632 by Genesis Valencia. - Fix Removed a potential source of CPU oversubscription in
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorwhen CPU resource usage is limited, for instance using cgroups quota in a docker container. #22566 by Jérémie du Boisberranger. - Fix
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorno longer warns when fitting on a pandas DataFrame with a non-defaultscoringparameter and early_stopping enabled. #22908 by Thomas Fan. - Fix Fixes HTML repr for
ensemble.StackingClassifierandensemble.StackingRegressor. #23097 by Thomas Fan. - API Change The attribute
loss_ofensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressorhas been deprecated and will be removed in version 1.3. #23079 by Christian Lorentzen. - API Change Changed the default of
max_featuresto 1.0 forensemble.RandomForestRegressorand to"sqrt"forensemble.RandomForestClassifier. Note that these give the same fit results as before, but are much easier to understand. The old default value"auto"has been deprecated and will be removed in version 1.3. The same changes are also applied forensemble.ExtraTreesRegressorandensemble.ExtraTreesClassifier. #20803 by Brian Sun. - Efficiency Improve runtime performance of
ensemble.IsolationForestby skipping repetitive input checks. #23149 by Zhehao Liu.
sklearn.feature_extraction
- Feature
feature_extraction.FeatureHashernow supports PyPy. #23023 by Thomas Fan. - Fix
feature_extraction.FeatureHashernow validates input parameters intransforminstead of__init__. #21573 by Hannah Bohle and Maren Westermann. - Fix
feature_extraction.text.TfidfVectorizernow does not create afeature_extraction.text.TfidfTransformerat__init__as required by our API. #21832 by Guillaume Lemaitre.
sklearn.feature_selection
- Feature Added auto mode to
feature_selection.SequentialFeatureSelector. If the argumentn_features_to_selectis'auto', select features until the score improvement does not exceed the argumenttol. The default value ofn_features_to_selectchanged fromNoneto'warn'in 1.1 and will become'auto'in 1.3.Noneand'warn'will be removed in 1.3. #20145 by murata-yu. - Feature Added the ability to pass callables to the
max_featuresparameter offeature_selection.SelectFromModel. Also introduced new attributemax_features_which is inferred frommax_featuresand the data duringfit. Ifmax_featuresis an integer, thenmax_features_ = max_features. Ifmax_featuresis a callable, thenmax_features_ = max_features(X). #22356 by Meekail Zain. - Enhancement
feature_selection.GenericUnivariateSelectpreserves float32 dtype. #18482 by Thierry Gameiro and Daniel Kharsa and #22370 by Meekail Zain. - Enhancement Add a parameter
force_finitetofeature_selection.f_regressionandfeature_selection.r_regression. This parameter allows to force the output to be finite in the case where a feature or a the target is constant or that the feature and target are perfectly correlated (only for the F-statistic). #17819 by Juan Carlos Alfaro Jiménez. - Efficiency Improve runtime performance of
feature_selection.chi2with boolean arrays. #22235 by Thomas Fan. - Efficiency Reduced memory usage of
feature_selection.chi2. #21837 by Louis Wagner.
sklearn.gaussian_process
- Fix
predictandsample_ymethods ofgaussian_process.GaussianProcessRegressornow return arrays of the correct shape in single-target and multi-target cases, and for bothnormalize_y=Falseandnormalize_y=True. #22199 by Guillaume Lemaitre, Aidar Shakerimoff and Tenavi Nakamura-Zimmerer. - Fix
gaussian_process.GaussianProcessClassifierraises a more informative error ifCompoundKernelis passed viakernel. #22223 by MarcoM.
sklearn.impute
- Enhancement
impute.SimpleImputernow warns with feature names when features which are skipped due to the lack of any observed values in the training set. #21617 by Christian Ritter. - Enhancement Added support for
pd.NAinimpute.SimpleImputer. #21114 by Ying Xiong. - Enhancement Adds get_feature_names_out to
impute.SimpleImputer,impute.KNNImputer,impute.IterativeImputer, andimpute.MissingIndicator. #21078 by Thomas Fan. - API Change The
verboseparameter was deprecated forimpute.SimpleImputer. A warning will always be raised upon the removal of empty columns. #21448 by Oleh Kozynets and Christian Ritter.
sklearn.inspection
- Feature Add a display to plot the boundary decision of a classifier by using the method
inspection.DecisionBoundaryDisplay.from_estimator. #16061 by Thomas Fan. - Enhancement In
inspection.PartialDependenceDisplay.from_estimator, allowkindto accept a list of strings to specify which type of plot to draw for each feature interaction. #19438 by Guillaume Lemaitre. - Enhancement
inspection.PartialDependenceDisplay.from_estimator,inspection.PartialDependenceDisplay.plot, andinspection.plot_partial_dependencenow support plotting centered Individual Conditional Expectation (cICE) and centered PDP curves controlled by setting the parametercentered. #18310 by Johannes Elfner and Guillaume Lemaitre.
sklearn.isotonic
- Enhancement Adds get_feature_names_out to
isotonic.IsotonicRegression. #22249 by Thomas Fan.
sklearn.kernel_approximation
- Enhancement Adds get_feature_names_out to
kernel_approximation.AdditiveChi2Sampler.kernel_approximation.Nystroem,kernel_approximation.PolynomialCountSketch,kernel_approximation.RBFSampler, andkernel_approximation.SkewedChi2Sampler. #22137 and #22694 by Thomas Fan.
sklearn.linear_model
- Feature
linear_model.ElasticNet,linear_model.ElasticNetCV,linear_model.Lassoandlinear_model.LassoCVsupportsample_weightfor sparse inputX. #22808 by Christian Lorentzen. - Feature
linear_model.Ridgewithsolver="lsqr"now supports to fit sparse input withfit_intercept=True. #22950 by Christian Lorentzen. - Enhancement
linear_model.QuantileRegressorsupport sparse input for the highs based solvers. #21086 by Venkatachalam Natchiappan. In addition, those solvers now use the CSC matrix right from the beginning which speeds up fitting. #22206 by Christian Lorentzen. - Enhancement
linear_model.LogisticRegressionis faster forsolvers="lbfgs"andsolver="newton-cg", for binary and in particular for multiclass problems thanks to the new private loss function module. In the multiclass case, the memory consumption has also been reduced for these solvers as the target is now label encoded (mapped to integers) instead of label binarized (one-hot encoded). The more classes, the larger the benefit. #21808, #20567 and #21814 by Christian Lorentzen. - Enhancement
linear_model.GammaRegressor,linear_model.PoissonRegressorandlinear_model.TweedieRegressorare faster forsolvers="lbfgs". #22548, #21808 and #20567 by Christian Lorentzen. - Enhancement Rename parameter
base_estimatortoestimatorinlinear_model.RANSACRegressorto improve readability and consistency.base_estimatoris deprecated and will be removed in 1.3. #22062 by Adrian Trujillo. - Enhancement
linear_model.ElasticNetand and other linear model classes using coordinate descent show error messages when non-finite parameter weights are produced. #22148 by Christian Ritter and Norbert Preining. - Enhancement
linear_model.ElasticNetandlinear_model.Lassonow raise consistent error messages when passed invalid values forl1_ratio,alpha,max_iterandtol. #22240 by Arturo Amor. - Enhancement
linear_model.BayesianRidgeandlinear_model.ARDRegressionnow preserve float32 dtype. #9087 by Arthur Imbert and #22525 by Meekail Zain. - Enhancement
linear_model.RidgeClassifieris now supporting multilabel classification. #19689 by Guillaume Lemaitre. - Enhancement
linear_model.RidgeCVandlinear_model.RidgeClassifierCVnow raise consistent error message when passed invalid values foralphas. #21606 by Arturo Amor. - Enhancement
linear_model.Ridgeandlinear_model.RidgeClassifiernow raise consistent error message when passed invalid values foralpha,max_iterandtol. #21341 by Arturo Amor. - Enhancement
linear_model.orthogonal_mp_grampreservse dtype fornumpy.float32. #22002 by Takeshi Oura. - Fix
linear_model.LassoLarsICnow correctly computes AIC and BIC. An error is now raised whenn_features > n_samplesand when the noise variance is not provided. #21481 by Guillaume Lemaitre and Andrés Babino. - Fix
linear_model.TheilSenRegressornow validates input parametermax_subpopulationinfitinstead of__init__. #21767 by Maren Westermann. - Fix
linear_model.ElasticNetCVnow produces correct warning whenl1_ratio=0. #21724 by Yar Khine Phyo. - Fix
linear_model.LogisticRegressionandlinear_model.LogisticRegressionCVnow set then_iter_attribute with a shape that respects the docstring and that is consistent with the shape obtained when using the other solvers in the one-vs-rest setting. Previously, it would record only the maximum of the number of iterations for each binary sub-problem while now all of them are recorded. #21998 by Olivier Grisel. - Fix The property
familyoflinear_model.TweedieRegressoris not validated in__init__anymore. Instead, this (private) property is deprecated inlinear_model.GammaRegressor,linear_model.PoissonRegressorandlinear_model.TweedieRegressor, and will be removed in 1.3. #22548 by Christian Lorentzen. - Fix The
coef_andintercept_attributes oflinear_model.LinearRegressionare now correctly computed in the presence of sample weights when the input is sparse. #22891 by Jérémie du Boisberranger. - Fix The
coef_andintercept_attributes oflinear_model.Ridgewithsolver="sparse_cg"andsolver="lbfgs"are now correctly computed in the presence of sample weights when the input is sparse. #22899 by Jérémie du Boisberranger. - Fix
linear_model.SGDRegressorandlinear_model.SGDClassifiernow computes the validation error correctly when early stopping is enabled. #23256 by Zhehao Liu. - API Change
linear_model.LassoLarsICnow exposesnoise_varianceas a parameter in order to provide an estimate of the noise variance. This is particularly relevant whenn_features > n_samplesand the estimator of the noise variance cannot be computed. #21481 by Guillaume Lemaitre.
sklearn.manifold
- Feature
manifold.Isomapnow supports radius-based neighbors via theradiusargument. #19794 by Zhehao Liu. - Enhancement
manifold.spectral_embeddingandmanifold.SpectralEmbeddingsupportsnp.float32dtype and will preserve this dtype. #21534 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
manifold.Isomapandmanifold.LocallyLinearEmbedding. #22254 by Thomas Fan. - Enhancement added
metric_paramstomanifold.TSNEconstructor for additional parameters of distance metric to use in optimization. #21805 by Jeanne Dionisi and #22685 by Meekail Zain. - Enhancement
manifold.trustworthinessraises an error ifn_neighbours >= n_samples / 2to ensure a correct support for the function. #18832 by Hong Shao Yang and #23033 by Meekail Zain. - Fix
manifold.spectral_embeddingnow uses Gaussian instead of the previous uniform on [0, 1] random initial approximations to eigenvectors in eigen_solverslobpcgandamgto improve their numerical stability. #21565 by Andrew Knyazev.
sklearn.metrics
- Feature
metrics.r2_scoreandmetrics.explained_variance_scorehave a newforce_finiteparameter. Setting this parameter toFalsewill return the actual non-finite score in case of perfect predictions or constanty_true, instead of the finite approximation (1.0and0.0respectively) currently returned by default. #17266 by Sylvain Marié. - Feature
metrics.d2_pinball_scoreandmetrics.d2_absolute_error_scorecalculate the D2 regression score for the pinball loss and the absolute error respectively.metrics.d2_absolute_error_scoreis a special case ofmetrics.d2_pinball_scorewith a fixed quantile parameteralpha=0.5for ease of use and discovery. The D2 scores are generalizations of ther2_scoreand can be interpeted as the fraction of deviance explained. #22118 by Ohad Michel. - Enhancement
metrics.top_k_accuracy_scoreraises an improved error message wheny_trueis binary andy_scoreis 2d. #22284 by Thomas Fan. - Enhancement
metrics.roc_auc_scorenow supportsaverage=Nonein the multiclass case whenmulticlass='ovr'which will return the score per class. #19158 by Nicki Skafte. - Enhancement Adds
im_kwparameter tometrics.ConfusionMatrixDisplay.from_estimatormetrics.ConfusionMatrixDisplay.from_predictions, andmetrics.ConfusionMatrixDisplay.plot. Theim_kwparameter is passed to thematplotlib.pyplot.imshowcall when plotting the confusion matrix. #20753 by Thomas Fan. - Fix
metrics.silhouette_scorenow supports integer input for precomputed distances. #22108 by Thomas Fan. - Fix Fixed a bug in
metrics.normalized_mutual_info_scorewhich could return unbounded values. #22635 by Jérémie du Boisberranger. - Fix Fixes
metrics.precision_recall_curveandmetrics.average_precision_scorewhen true labels are all negative. #19085 by Varun Agrawal. - API Change
metrics.SCORERSis now deprecated and will be removed in 1.3. Please usemetrics.get_scorer_namesto retrieve the names of all available scorers. #22866 by Adrin Jalali. - API Change Parameters
sample_weightandmultioutputofmetrics.mean_absolute_percentage_errorare now keyword-only, in accordance with SLEP009. A deprecation cycle was introduced. #21576 by Paul-Emile Dugnat. - API Change The
"wminkowski"metric ofmetrics.DistanceMetricis deprecated and will be removed in version 1.3. Instead the existing"minkowski"metric now takes in an optionalwparameter for weights. This deprecation aims at remaining consistent with SciPy 1.8 convention. #21873 by Yar Khine Phyo. - API Change
metrics.DistanceMetrichas been moved fromsklearn.neighborstosklearn.metrics. Usingneighbors.DistanceMetricfor imports is still valid for backward compatibility, but this alias will be removed in 1.3. #21177 by Julien Jerphanion.
sklearn.mixture
- Enhancement
mixture.GaussianMixtureandmixture.BayesianGaussianMixturecan now be initialized using k-means++ and random data points. #20408 by Gordon Walsh, Alberto Ceballos and Andres Rios. - Fix Fix a bug that correctly initialize
precisions_cholesky_inmixture.GaussianMixturewhen providingprecisions_initby taking its square root. #22058 by Guillaume Lemaitre. - Fix
mixture.GaussianMixturenow normalizesweights_more safely, preventing rounding errors when callingmixture.GaussianMixture.samplewithn_components=1. #23034 by Meekail Zain.
sklearn.model_selection
- Enhancement it is now possible to pass
scoring="matthews_corrcoef"to all model selection tools with ascoringargument to use the Matthews correlation coefficient (MCC). #22203 by Olivier Grisel. - Enhancement raise an error during cross-validation when the fits for all the splits failed. Similarly raise an error during grid-search when the fits for all the models and all the splits failed. #21026 by Loïc Estève.
- Fix
model_selection.GridSearchCV,model_selection.HalvingGridSearchCVnow validate input parameters infitinstead of__init__. #21880 by Mrinal Tyagi. - Fix
model_selection.learning_curvenow supportspartial_fitwith regressors. #22982 by Thomas Fan.
sklearn.multiclass
- Enhancement
multiclass.OneVsRestClassifiernow supports averboseparameter so progress on fitting can be seen. #22508 by Chris Combs. - Fix
multiclass.OneVsOneClassifier.predictreturns correct predictions when the inner classifier only has a predict_proba. #22604 by Thomas Fan.
sklearn.neighbors
- Enhancement Adds get_feature_names_out to
neighbors.RadiusNeighborsTransformer,neighbors.KNeighborsTransformerandneighbors.NeighborhoodComponentsAnalysis. #22212 by Meekail Zain. - Fix
neighbors.KernelDensitynow validates input parameters infitinstead of__init__. #21430 by Desislava Vasileva and Lucy Jimenez. - Fix
neighbors.KNeighborsRegressor.predictnow works properly when given an array-like input ifKNeighborsRegressoris first constructed with a callable passed to theweightsparameter. #22687 by Meekail Zain.
sklearn.neural_network
- Enhancement
neural_network.MLPClassifierandneural_network.MLPRegressorshow error messages when optimizers produce non-finite parameter weights. #22150 by Christian Ritter and Norbert Preining. - Enhancement Adds get_feature_names_out to
neural_network.BernoulliRBM. #22248 by Thomas Fan.
sklearn.pipeline
- Enhancement Added support for “passthrough” in
pipeline.FeatureUnion. Setting a transformer to “passthrough” will pass the features unchanged. #20860 by Shubhraneel Pal. - Fix
pipeline.Pipelinenow does not validate hyper-parameters in__init__but in.fit(). #21888 by iofall and Arisa Y.. - Fix
pipeline.FeatureUniondoes not validate hyper-parameters in__init__. Validation is now handled in.fit()and.fit_transform(). #21954 by iofall and Arisa Y.. - Fix Defines
__sklearn_is_fitted__inpipeline.FeatureUnionto return correct result withutils.validation.check_is_fitted. #22953 by randomgeek78.
sklearn.preprocessing
- Feature
preprocessing.OneHotEncodernow supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequencyormax_categories. #16018 by Thomas Fan. - Enhancement Adds a
subsampleparameter topreprocessing.KBinsDiscretizer. This allows specifying a maximum number of samples to be used while fitting the model. The option is only available whenstrategyis set toquantile. #21445 by Felipe Bidu and Amanda Dsouza. - Enhancement Adds
encoded_missing_valuetopreprocessing.OrdinalEncoderto configure the encoded value for missing data. #21988 by Thomas Fan. - Enhancement Added the
get_feature_names_outmethod and a new parameterfeature_names_outtopreprocessing.FunctionTransformer. You can setfeature_names_outto ‘one-to-one’ to use the input features names as the output feature names, or you can set it to a callable that returns the output feature names. This is especially useful when the transformer changes the number of features. Iffeature_names_outis None (which is the default), thenget_output_feature_namesis not defined. #21569 by Aurélien Geron. - Enhancement Adds get_feature_names_out to
preprocessing.Normalizer,preprocessing.KernelCenterer,preprocessing.OrdinalEncoder, andpreprocessing.Binarizer. #21079 by Thomas Fan. - Fix
preprocessing.PowerTransformerwithmethod='yeo-johnson'better supports significantly non-Gaussian data when searching for an optimal lambda. #20653 by Thomas Fan. - Fix
preprocessing.LabelBinarizernow validates input parameters infitinstead of__init__. #21434 by Krum Arnaudov. - Fix
preprocessing.FunctionTransformerwithcheck_inverse=Truenow provides informative error message when input has mixed dtypes. #19916 by Zhehao Liu. - Fix
preprocessing.KBinsDiscretizerhandles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - Fix Adds
preprocessing.KBinsDiscretizer.get_feature_names_outsupport whenencode="ordinal". #22735 by Thomas Fan.
sklearn.random_projection
- Enhancement Adds an
inverse_transformmethod and acompute_inverse_transformparameter torandom_projection.GaussianRandomProjectionandrandom_projection.SparseRandomProjection. When the parameter is set to True, the pseudo-inverse of the components is computed duringfitand stored asinverse_components_. #21701 by Aurélien Geron. - Enhancement
random_projection.SparseRandomProjectionandrandom_projection.GaussianRandomProjectionpreserves dtype fornumpy.float32. #22114 by Takeshi Oura. - Enhancement Adds get_feature_names_out to all transformers in the
sklearn.random_projectionmodule:random_projection.GaussianRandomProjectionandrandom_projection.SparseRandomProjection. #21330 by Loïc Estève.
sklearn.svm
- Enhancement
svm.OneClassSVM,svm.NuSVC,svm.NuSVR,svm.SVCandsvm.SVRnow exposen_iter_, the number of iterations of the libsvm optimization routine. #21408 by Juan Martín Loyola. - Enhancement
svm.SVR,svm.SVC,svm.NuSVR,svm.OneClassSVM,svm.NuSVCnow raise an error when the dual-gap estimation produce non-finite parameter weights. #22149 by Christian Ritter and Norbert Preining. - Fix
svm.NuSVC,svm.NuSVR,svm.SVC,svm.SVR,svm.OneClassSVMnow validate input parameters infitinstead of__init__. #21436 by Haidar Almubarak.
sklearn.tree
- Enhancement
tree.DecisionTreeClassifierandtree.ExtraTreeClassifierhave the newcriterion="log_loss", which is equivalent tocriterion="entropy". #23047 by Christian Lorentzen. - Fix Fix a bug in the Poisson splitting criterion for
tree.DecisionTreeRegressor. #22191 by Christian Lorentzen. - API Change Changed the default value of
max_featuresto 1.0 fortree.ExtraTreeRegressorand to"sqrt"fortree.ExtraTreeClassifier, which will not change the fit result. The original default value"auto"has been deprecated and will be removed in version 1.3. Settingmax_featuresto"auto"is also deprecated fortree.DecisionTreeClassifierandtree.DecisionTreeRegressor. #22476 by Zhehao Liu.
sklearn.utils
- Enhancement
utils.check_arrayandutils.multiclass.type_of_targetnow accept aninput_nameparameter to make the error message more informative when passed invalid input data (e.g. with NaN or infinite values). #21219 by Olivier Grisel. - Enhancement
utils.check_arrayreturns a float ndarray withnp.nanwhen passed aFloat32orFloat64pandas extension array withpd.NA. #21278 by Thomas Fan. - Enhancement
utils.estimator_html_reprshows a more helpful error message when running in a jupyter notebook that is not trusted. #21316 by Thomas Fan. - Enhancement
utils.estimator_html_reprdisplays an arrow on the top left corner of the HTML representation to show how the elements are clickable. #21298 by Thomas Fan. - Enhancement
utils.check_arraywithdtype=Nonereturns numeric arrays when passed in a pandas DataFrame with mixed dtypes.dtype="numeric"will also make better infer the dtype when the DataFrame has mixed dtypes. #22237 by Thomas Fan. - Enhancement
utils.check_scalarnow has better messages when displaying the type. #22218 by Thomas Fan. - Fix Changes the error message of the
ValidationErrorraised byutils.check_X_ywhen y is None so that it is compatible with thecheck_requires_y_noneestimator check. #22578 by Claudio Salvatore Arcidiacono. - Fix
utils.class_weight.compute_class_weightnow only requires that all classes inyhave a weight inclass_weight. An error is still raised when a class is present inybut not inclass_weight. #22595 by Thomas Fan. - Fix
utils.estimator_html_reprhas an improved visualization for nested meta-estimators. #21310 by Thomas Fan. - Fix
utils.check_scalarraises an error wheninclude_boundaries={"left", "right"}and the boundaries are not set. #22027 by Marie Lanternier. - Fix
utils.metaestimators.available_ifcorrectly returns a bounded method that can be pickled. #23077 by Thomas Fan. - API Change
utils.estimator_checks.check_estimator’s argument is now calledestimator(previous name wasEstimator). #22188 by Mathurin Massias. - API Change
utils.metaestimators.if_delegate_has_methodis deprecated and will be removed in version 1.3. Useutils.metaestimators.available_ifinstead. #22830 by Jérémie du Boisberranger.
Have any questions?
Contact Exxact Today

scikit-learn 1.1 Released
scikit-learn 1.1 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
Highlights include:
- Quantile loss in
ensemble.HistGradientBoostingRegressor get_feature_names_outAvailable in all Transformers- Grouping infrequent categories in
OneHotEncoder - Performance improvements
- MiniBatchNMF: an online version of NMF
- BisectingKMeans: divide and cluster
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.1.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.1.0 of scikit-learn requires python 3.8+, numpy 1.17.3+ and scipy 1.3.2+. Optional minimal dependency is matplotlib 3.1.2+.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Efficiency
cluster.KMeansnow defaults toalgorithm="lloyd"instead ofalgorithm="auto", which was equivalent toalgorithm="elkan". Lloyd’s algorithm and Elkan’s algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd’s algorithm uses much less memory, and it is often faster. - Efficiency Fitting
tree.DecisionTreeClassifier,tree.DecisionTreeRegressor,ensemble.RandomForestClassifier,ensemble.RandomForestRegressor,ensemble.GradientBoostingClassifier, andensemble.GradientBoostingRegressoris on average 15% faster than in previous versions thanks to a new sort algorithm to find the best split. Models might be different because of a different handling of splits with tied criterion values: both the old and the new sorting algorithm are unstable sorting algorithms. #22868 by Thomas Fan. - Fix The eigenvectors initialization for
cluster.SpectralClusteringandmanifold.SpectralEmbeddingnow samples from a Gaussian when using the'amg'or'lobpcg'solver. This change improves numerical stability of the solver, but may result in a different model. - Fix
feature_selection.f_regressionandfeature_selection.r_regressionwill now returned finite score by default instead ofnp.nanandnp.inffor some corner case. You can useforce_finite=Falseif you really want to get non-finite values and keep the old behavior. - Fix Panda’s DataFrames with all non-string columns such as a MultiIndex no longer warns when passed into an Estimator. Estimators will continue to ignore the column names in DataFrames with non-string columns. For
feature_names_in_to be defined, columns must be all strings. #22410 by Thomas Fan. - Fix
preprocessing.KBinsDiscretizerchanged handling of bin edges slightly, which might result in a different encoding with the same data. - Fix
calibration.calibration_curvechanged handling of bin edges slightly, which might result in a different output curve given the same data. - Fix
discriminant_analysis.LinearDiscriminantAnalysisnow uses the correct variance-scaling coefficient which may result in different model behavior. - Fix
feature_selection.SelectFromModel.fitandfeature_selection.SelectFromModel.partial_fitcan now be called withprefit=True.estimators_will be a deep copy ofestimatorwhenprefit=True. #23271 by Guillaume Lemaitre.
Changelog
- Efficiency Low-level routines for reductions on pairwise distances for dense float64 datasets have been refactored. The following functions and estimators now benefit from improved performances in terms of hardware scalability and speed-ups:
sklearn.metrics.pairwise_distances_argminsklearn.metrics.pairwise_distances_argmin_minsklearn.cluster.AffinityPropagationsklearn.cluster.Birchsklearn.cluster.MeanShiftsklearn.cluster.OPTICSsklearn.cluster.SpectralClusteringsklearn.feature_selection.mutual_info_regressionsklearn.neighbors.KNeighborsClassifiersklearn.neighbors.KNeighborsRegressorsklearn.neighbors.RadiusNeighborsClassifiersklearn.neighbors.RadiusNeighborsRegressorsklearn.neighbors.LocalOutlierFactorsklearn.neighbors.NearestNeighborssklearn.manifold.Isomapsklearn.manifold.LocallyLinearEmbeddingsklearn.manifold.TSNEsklearn.manifold.trustworthinesssklearn.semi_supervised.LabelPropagationsklearn.semi_supervised.LabelSpreading
For instance
sklearn.neighbors.NearestNeighbors.kneighborsandsklearn.neighbors.NearestNeighbors.radius_neighborscan respectively be up to ×20 and ×5 faster than previously. #21987, #22064, #22065, #22288 and #22320 by Julien Jerphanion. - Enhancement All scikit-learn models now generate a more informative error message when some input contains unexpected
NaNor infinite values. In particular the message contains the input name (“X”, “y” or “sample_weight”) and if an unexpectedNaNvalue is found inX, the error message suggests potential solutions. #21219 by Olivier Grisel. - Enhancement All scikit-learn models now generate a more informative error message when setting invalid hyper-parameters with
set_params. #21542 by Olivier Grisel. - Enhancement Removes random unique identifiers in the HTML representation. With this change, jupyter notebooks are reproducible as long as the cells are run in the same order. #23098 by Thomas Fan.
- Fix Estimators with
non_deterministictag set toTruewill skip bothcheck_methods_sample_order_invarianceandcheck_methods_subset_invariancetests. #22318 by Zhehao Liu. - API Change The option for using the log loss, aka binomial or multinomial deviance, via the
lossparameters was made more consistent. The preferred way is by setting the value to"log_loss". Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.- For
ensemble.GradientBoostingClassifier, thelossparameter name “deviance” is deprecated in favor of the new name “log_loss”, which is now the default. #23036 by Christian Lorentzen. - For
ensemble.HistGradientBoostingClassifier, thelossparameter names “auto”, “binary_crossentropy” and “categorical_crossentropy” are deprecated in favor of the new name “log_loss”, which is now the default. #23040 by Christian Lorentzen. - For
linear_model.SGDClassifier, thelossparameter name “log” is deprecated in favor of the new name “log_loss”. #23046 by Christian Lorentzen.
- For
- API Change Rich html representation of estimators is now enabled by default in Jupyter notebooks. It can be deactivated by setting
display='text'insklearn.set_config. #22856 by Jérémie du Boisberranger. - Enhancement The error message is improved when importing
model_selection.HalvingGridSearchCV,model_selection.HalvingRandomSearchCV, orimpute.IterativeImputerwithout importing the experimental flag. #23194 by Thomas Fan. - Enhancement Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values. #23198 by Lise Kleiber, Zhehao Liu and Chiara Marmo.
sklearn.calibration
- Enhancement
calibration.calibration_curveaccepts a parameterpos_labelto specify the positive class label. #21032 by Guillaume Lemaitre. - Enhancement
calibration.CalibratedClassifierCV.fitnow supports passingfit_params, which are routed to thebase_estimator. #18170 by Benjamin Bossan. - Enhancement
calibration.CalibrationDisplayaccepts a parameterpos_labelto add this information to the plot. #21038 by Guillaume Lemaitre. - Fix
calibration.calibration_curvehandles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - API Change
calibration.calibration_curve’snormalizeparameter is now deprecated and will be removed in version 1.3. It is recommended that a proper probability (i.e. a classifier’s predict_proba positive class) is used fory_prob. #23095 by Jordan Silke.
sklearn.cluster
- Major Feature
BisectingKMeansintroducing Bisecting K-Means algorithm #20031 by Michal Krawczyk, Tom Dupre la Tour and Jérémie du Boisberranger. - Enhancement
cluster.SpectralClusteringandcluster.spectral_clusteringnow include the new'cluster_qr'method that clusters samples in the embedding space as an alternative to the existing'kmeans'and'discrete'methods. Seecluster.spectral_clusteringfor more details. #21148 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
cluster.Birch,cluster.FeatureAgglomeration,cluster.KMeans,cluster.MiniBatchKMeans. #22255 by Thomas Fan. - Enhancement
cluster.SpectralClusteringnow raises consistent error messages when passed invalid values forn_clusters,n_init,gamma,n_neighbors,eigen_tolordegree. #21881 by Hugo Vassard. - Enhancement
cluster.AffinityPropagationnow returns cluster centers and labels if they exist, even if the model has not fully converged. When returning these potentially-degenerate cluster centers and labels, a new warning message is shown. If no cluster centers were constructed, then the cluster centers remain an empty list with labels set to-1and the original warning message is shown. #22217 by Meekail Zain. - Efficiency In
cluster.KMeans, the defaultalgorithmis now"lloyd"which is the full classical EM-style algorithm. Both"auto"and"full"are deprecated and will be removed in version 1.3. They are now aliases for"lloyd". The previous default was"auto", which relied on Elkan’s algorithm. Lloyd’s algorithm uses less memory than Elkan’s, it is faster on many datasets, and its results are identical, hence the change. #21735 by Aurélien Geron. - Fix
cluster.KMeans’sinitparameter now properly supports array-like input and NumPy string scalars. #22154 by Thomas Fan.
sklearn.compose
- Fix
compose.ColumnTransformernow removes validation errors from__init__andset_paramsmethods. #22537 by iofall and Arisa Y.. - Fix get_feature_names_out functionality in
compose.ColumnTransformerwas broken when columns were specified usingslice. This is fixed in #22775 and #22913 by randomgeek78.
sklearn.covariance
- Fix
covariance.GraphicalLassoCVnow accepts NumPy array for the parameteralphas. #22493 by Guillaume Lemaitre.
sklearn.cross_decomposition
- Enhancement the
inverse_transformmethod ofcross_decomposition.PLSRegression,cross_decomposition.PLSCanonicalandcross_decomposition.CCAnow allows reconstruction of aXtarget when aYparameter is given. #19680 by Robin Thibaut. - Enhancement Adds get_feature_names_out to all transformers in the
cross_decompositionmodule:cross_decomposition.CCA,cross_decomposition.PLSSVD,cross_decomposition.PLSRegression, andcross_decomposition.PLSCanonical. #22119 by Thomas Fan. - Fix The shape of the coef_ attribute of
cross_decomposition.CCA,cross_decomposition.PLSCanonicalandcross_decomposition.PLSRegressionwill change in version 1.3, from(n_features, n_targets)to(n_targets, n_features), to be consistent with other linear models and to make it work with interface expecting a specific shape forcoef_(e.g.feature_selection.RFE). #22016 by Guillaume Lemaitre. - API Change add the fitted attribute
intercept_tocross_decomposition.PLSCanonical,cross_decomposition.PLSRegression, andcross_decomposition.CCA. The methodpredictis indeed equivalent toY = X @ coef_ + intercept_. #22015 by Guillaume Lemaitre.
sklearn.datasets
- Feature
datasets.load_filesnow accepts a ignore list and an allow list based on file extensions. #19747 by Tony Attalla and #22498 by Meekail Zain. - Enhancement
datasets.make_swiss_rollnow supports the optional argument hole; when set to True, it returns the swiss-hole dataset. #21482 by Sebastian Pujalte. - Enhancement
datasets.make_blobsno longer copies data during the generation process, therefore uses less memory. #22412 by Zhehao Liu. - Enhancement
datasets.load_diabetesnow accepts the parameterscaled, to allow loading unscaled data. The scaled version of this dataset is now computed from the unscaled data, and can produce slightly different results that in previous version (within a 1e-4 absolute tolerance). #16605 by Mandy Gu. - Enhancement
datasets.fetch_openmlnow has two optional argumentsn_retriesanddelay. By default,datasets.fetch_openmlwill retry 3 times in case of a network failure with a delay between each try. #21901 by Rileran. - Fix
datasets.fetch_covtypeis now concurrent-safe: data is downloaded to a temporary directory before being moved to the data directory. #23113 by Ilion Beyst. - API Change
datasets.make_sparse_coded_signalnow accepts a parameterdata_transposedto explicitly specify the shape of matrixX. The default behaviorTrueis to return a transposed matrixXcorresponding to a(n_features, n_samples)shape. The default value will change toFalsein version 1.3. #21425 by Gabriel Stefanini Vicente.
sklearn.decomposition
- Major Feature Added a new estimator
decomposition.MiniBatchNMF. It is a faster but less accurate version of non-negative matrix factorization, better suited for large datasets. #16948 by Chiara Marmo, Patricio Cerda and Jérémie du Boisberranger. - Enhancement
decomposition.dict_learning,decomposition.dict_learning_onlineanddecomposition.sparse_encodepreserve dtype fornumpy.float32.decomposition.DictionaryLearning,decomposition.MiniBatchDictionaryLearninganddecomposition.SparseCoderpreserve dtype fornumpy.float32. #22002 by Takeshi Oura. - Enhancement
decomposition.PCAexposes a parametern_oversamplesto tuneutils.randomized_svdand get accurate results when the number of features is large. #21109 by Smile. - Enhancement The
decomposition.MiniBatchDictionaryLearninganddecomposition.dict_learning_onlinehave been refactored and now have a stopping criterion based on a small change of the dictionary or objective function, controlled by the newmax_iter,tolandmax_no_improvementparameters. In addition, some of their parameters and attributes are deprecated.- the
n_iterparameter of both is deprecated. Usemax_iterinstead. - the
iter_offset,return_inner_stats,inner_statsandreturn_n_iterparameters ofdecomposition.dict_learning_onlineserve internal purpose and are deprecated. - the
inner_stats_,iter_offset_andrandom_state_attributes ofdecomposition.MiniBatchDictionaryLearningserve internal purpose and are deprecated. - the default value of the
batch_sizeparameter of both will change from 3 to 256 in version 1.3.
- the
- Enhancement
decomposition.SparsePCAanddecomposition.MiniBatchSparsePCApreserve dtype fornumpy.float32. #22111 by Takeshi Oura. - Enhancement
decomposition.TruncatedSVDnow allowsn_components == n_features, ifalgorithm='randomized'. #22181 by Zach Deane-Mayer. - Enhancement Adds get_feature_names_out to all transformers in the
decompositionmodule:decomposition.DictionaryLearning,decomposition.FactorAnalysis,decomposition.FastICA,decomposition.IncrementalPCA,decomposition.KernelPCA,decomposition.LatentDirichletAllocation,decomposition.MiniBatchDictionaryLearning,decomposition.MiniBatchSparsePCA,decomposition.NMF,decomposition.PCA,decomposition.SparsePCA, anddecomposition.TruncatedSVD. #21334 by Thomas Fan. - Enhancement
decomposition.TruncatedSVDexposes the parametern_oversamplesandpower_iteration_normalizerto tuneutils.randomized_svdand get accurate results when the number of features is large, the rank of the matrix is high, or other features of the matrix make low rank approximation difficult. #21705 by Jay S. Stanley III. - Enhancement
decomposition.PCAexposes the parameterpower_iteration_normalizerto tuneutils.randomized_svdand get more accurate results when low rank approximation is difficult. #21705 by Jay S. Stanley III. - Fix
decomposition.FastICAnow validates input parameters infitinstead of__init__. #21432 by Hannah Bohle and Maren Westermann. - Fix
decomposition.FastICAnow acceptsnp.float32data without silent upcasting. The dtype is preserved byfitandfit_transformand the main fitted attributes use a dtype of the same precision as the training data. #22806 by Jihane Bennis and Olivier Grisel. - Fix
decomposition.FactorAnalysisnow validates input parameters infitinstead of__init__. #21713 by Haya and Krum Arnaudov. - Fix
decomposition.KernelPCAnow validates input parameters infitinstead of__init__. #21567 by Maggie Chege. - Fix
decomposition.PCAanddecomposition.IncrementalPCAmore safely calculate precision using the inverse of the covariance matrix ifself.noise_variance_is zero. #22300 by Meekail Zain and #15948 by @sysuresh. - Fix Greatly reduced peak memory usage in
decomposition.PCAwhen callingfitorfit_transform. #22553 by Meekail Zain. - API Change
decomposition.FastICAnow supports unit variance for whitening. The default value of itswhitenargument will change fromTrue(which behaves like'arbitrary-variance') to'unit-variance'in version 1.3. #19490 by Facundo Ferrin and Julien Jerphanion.
sklearn.discriminant_analysis
- Enhancement Adds get_feature_names_out to
discriminant_analysis.LinearDiscriminantAnalysis. #22120 by Thomas Fan. - Fix
discriminant_analysis.LinearDiscriminantAnalysisnow uses the correct variance-scaling coefficient which may result in different model behavior. #15984 by Okon Samuel and #22696 by Meekail Zain.
sklearn.dummy
- Fix
dummy.DummyRegressorno longer overrides theconstantparameter duringfit. #22486 by Thomas Fan.
sklearn.ensemble
- Major Feature Added additional option
loss="quantile"toensemble.HistGradientBoostingRegressorfor modelling quantiles. The quantile level can be specified with the new parameterquantile. #21800 and #20567 by Christian Lorentzen. - Efficiency
fitofensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressornow callsutils.check_arraywith parameterforce_all_finite=Falsefor non initial warm-start runs as it has already been checked before. #22159 by Geoffrey Paris. - Enhancement
ensemble.HistGradientBoostingClassifieris faster, for binary and in particular for multiclass problems thanks to the new private loss function module. #20811, #20567 and #21814 by Christian Lorentzen. - Enhancement Adds support to use pre-fit models with
cv="prefit"inensemble.StackingClassifierandensemble.StackingRegressor. #16748 by Siqi He and #22215 by Meekail Zain. - Enhancement
ensemble.RandomForestClassifierandensemble.ExtraTreesClassifierhave the newcriterion="log_loss", which is equivalent tocriterion="entropy". #23047 by Christian Lorentzen. - Enhancement Adds get_feature_names_out to
ensemble.VotingClassifier,ensemble.VotingRegressor,ensemble.StackingClassifier, andensemble.StackingRegressor. #22695 and #22697 by Thomas Fan. - Enhancement
ensemble.RandomTreesEmbeddingnow has an informative get_feature_names_out function that includes both tree index and leaf index in the output feature names. #21762 by Zhehao Liu and Thomas Fan. - Efficiency Fitting a
ensemble.RandomForestClassifier,ensemble.RandomForestRegressor,ensemble.ExtraTreesClassifier,ensemble.ExtraTreesRegressor, andensemble.RandomTreesEmbeddingis now faster in a multiprocessing setting, especially for subsequent fits withwarm_startenabled. #22106 by Pieter Gijsbers. - Fix Change the parameter
validation_fractioninensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressorso that an error is raised if anything other than a float is passed in as an argument. #21632 by Genesis Valencia. - Fix Removed a potential source of CPU oversubscription in
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorwhen CPU resource usage is limited, for instance using cgroups quota in a docker container. #22566 by Jérémie du Boisberranger. - Fix
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorno longer warns when fitting on a pandas DataFrame with a non-defaultscoringparameter and early_stopping enabled. #22908 by Thomas Fan. - Fix Fixes HTML repr for
ensemble.StackingClassifierandensemble.StackingRegressor. #23097 by Thomas Fan. - API Change The attribute
loss_ofensemble.GradientBoostingClassifierandensemble.GradientBoostingRegressorhas been deprecated and will be removed in version 1.3. #23079 by Christian Lorentzen. - API Change Changed the default of
max_featuresto 1.0 forensemble.RandomForestRegressorand to"sqrt"forensemble.RandomForestClassifier. Note that these give the same fit results as before, but are much easier to understand. The old default value"auto"has been deprecated and will be removed in version 1.3. The same changes are also applied forensemble.ExtraTreesRegressorandensemble.ExtraTreesClassifier. #20803 by Brian Sun. - Efficiency Improve runtime performance of
ensemble.IsolationForestby skipping repetitive input checks. #23149 by Zhehao Liu.
sklearn.feature_extraction
- Feature
feature_extraction.FeatureHashernow supports PyPy. #23023 by Thomas Fan. - Fix
feature_extraction.FeatureHashernow validates input parameters intransforminstead of__init__. #21573 by Hannah Bohle and Maren Westermann. - Fix
feature_extraction.text.TfidfVectorizernow does not create afeature_extraction.text.TfidfTransformerat__init__as required by our API. #21832 by Guillaume Lemaitre.
sklearn.feature_selection
- Feature Added auto mode to
feature_selection.SequentialFeatureSelector. If the argumentn_features_to_selectis'auto', select features until the score improvement does not exceed the argumenttol. The default value ofn_features_to_selectchanged fromNoneto'warn'in 1.1 and will become'auto'in 1.3.Noneand'warn'will be removed in 1.3. #20145 by murata-yu. - Feature Added the ability to pass callables to the
max_featuresparameter offeature_selection.SelectFromModel. Also introduced new attributemax_features_which is inferred frommax_featuresand the data duringfit. Ifmax_featuresis an integer, thenmax_features_ = max_features. Ifmax_featuresis a callable, thenmax_features_ = max_features(X). #22356 by Meekail Zain. - Enhancement
feature_selection.GenericUnivariateSelectpreserves float32 dtype. #18482 by Thierry Gameiro and Daniel Kharsa and #22370 by Meekail Zain. - Enhancement Add a parameter
force_finitetofeature_selection.f_regressionandfeature_selection.r_regression. This parameter allows to force the output to be finite in the case where a feature or a the target is constant or that the feature and target are perfectly correlated (only for the F-statistic). #17819 by Juan Carlos Alfaro Jiménez. - Efficiency Improve runtime performance of
feature_selection.chi2with boolean arrays. #22235 by Thomas Fan. - Efficiency Reduced memory usage of
feature_selection.chi2. #21837 by Louis Wagner.
sklearn.gaussian_process
- Fix
predictandsample_ymethods ofgaussian_process.GaussianProcessRegressornow return arrays of the correct shape in single-target and multi-target cases, and for bothnormalize_y=Falseandnormalize_y=True. #22199 by Guillaume Lemaitre, Aidar Shakerimoff and Tenavi Nakamura-Zimmerer. - Fix
gaussian_process.GaussianProcessClassifierraises a more informative error ifCompoundKernelis passed viakernel. #22223 by MarcoM.
sklearn.impute
- Enhancement
impute.SimpleImputernow warns with feature names when features which are skipped due to the lack of any observed values in the training set. #21617 by Christian Ritter. - Enhancement Added support for
pd.NAinimpute.SimpleImputer. #21114 by Ying Xiong. - Enhancement Adds get_feature_names_out to
impute.SimpleImputer,impute.KNNImputer,impute.IterativeImputer, andimpute.MissingIndicator. #21078 by Thomas Fan. - API Change The
verboseparameter was deprecated forimpute.SimpleImputer. A warning will always be raised upon the removal of empty columns. #21448 by Oleh Kozynets and Christian Ritter.
sklearn.inspection
- Feature Add a display to plot the boundary decision of a classifier by using the method
inspection.DecisionBoundaryDisplay.from_estimator. #16061 by Thomas Fan. - Enhancement In
inspection.PartialDependenceDisplay.from_estimator, allowkindto accept a list of strings to specify which type of plot to draw for each feature interaction. #19438 by Guillaume Lemaitre. - Enhancement
inspection.PartialDependenceDisplay.from_estimator,inspection.PartialDependenceDisplay.plot, andinspection.plot_partial_dependencenow support plotting centered Individual Conditional Expectation (cICE) and centered PDP curves controlled by setting the parametercentered. #18310 by Johannes Elfner and Guillaume Lemaitre.
sklearn.isotonic
- Enhancement Adds get_feature_names_out to
isotonic.IsotonicRegression. #22249 by Thomas Fan.
sklearn.kernel_approximation
- Enhancement Adds get_feature_names_out to
kernel_approximation.AdditiveChi2Sampler.kernel_approximation.Nystroem,kernel_approximation.PolynomialCountSketch,kernel_approximation.RBFSampler, andkernel_approximation.SkewedChi2Sampler. #22137 and #22694 by Thomas Fan.
sklearn.linear_model
- Feature
linear_model.ElasticNet,linear_model.ElasticNetCV,linear_model.Lassoandlinear_model.LassoCVsupportsample_weightfor sparse inputX. #22808 by Christian Lorentzen. - Feature
linear_model.Ridgewithsolver="lsqr"now supports to fit sparse input withfit_intercept=True. #22950 by Christian Lorentzen. - Enhancement
linear_model.QuantileRegressorsupport sparse input for the highs based solvers. #21086 by Venkatachalam Natchiappan. In addition, those solvers now use the CSC matrix right from the beginning which speeds up fitting. #22206 by Christian Lorentzen. - Enhancement
linear_model.LogisticRegressionis faster forsolvers="lbfgs"andsolver="newton-cg", for binary and in particular for multiclass problems thanks to the new private loss function module. In the multiclass case, the memory consumption has also been reduced for these solvers as the target is now label encoded (mapped to integers) instead of label binarized (one-hot encoded). The more classes, the larger the benefit. #21808, #20567 and #21814 by Christian Lorentzen. - Enhancement
linear_model.GammaRegressor,linear_model.PoissonRegressorandlinear_model.TweedieRegressorare faster forsolvers="lbfgs". #22548, #21808 and #20567 by Christian Lorentzen. - Enhancement Rename parameter
base_estimatortoestimatorinlinear_model.RANSACRegressorto improve readability and consistency.base_estimatoris deprecated and will be removed in 1.3. #22062 by Adrian Trujillo. - Enhancement
linear_model.ElasticNetand and other linear model classes using coordinate descent show error messages when non-finite parameter weights are produced. #22148 by Christian Ritter and Norbert Preining. - Enhancement
linear_model.ElasticNetandlinear_model.Lassonow raise consistent error messages when passed invalid values forl1_ratio,alpha,max_iterandtol. #22240 by Arturo Amor. - Enhancement
linear_model.BayesianRidgeandlinear_model.ARDRegressionnow preserve float32 dtype. #9087 by Arthur Imbert and #22525 by Meekail Zain. - Enhancement
linear_model.RidgeClassifieris now supporting multilabel classification. #19689 by Guillaume Lemaitre. - Enhancement
linear_model.RidgeCVandlinear_model.RidgeClassifierCVnow raise consistent error message when passed invalid values foralphas. #21606 by Arturo Amor. - Enhancement
linear_model.Ridgeandlinear_model.RidgeClassifiernow raise consistent error message when passed invalid values foralpha,max_iterandtol. #21341 by Arturo Amor. - Enhancement
linear_model.orthogonal_mp_grampreservse dtype fornumpy.float32. #22002 by Takeshi Oura. - Fix
linear_model.LassoLarsICnow correctly computes AIC and BIC. An error is now raised whenn_features > n_samplesand when the noise variance is not provided. #21481 by Guillaume Lemaitre and Andrés Babino. - Fix
linear_model.TheilSenRegressornow validates input parametermax_subpopulationinfitinstead of__init__. #21767 by Maren Westermann. - Fix
linear_model.ElasticNetCVnow produces correct warning whenl1_ratio=0. #21724 by Yar Khine Phyo. - Fix
linear_model.LogisticRegressionandlinear_model.LogisticRegressionCVnow set then_iter_attribute with a shape that respects the docstring and that is consistent with the shape obtained when using the other solvers in the one-vs-rest setting. Previously, it would record only the maximum of the number of iterations for each binary sub-problem while now all of them are recorded. #21998 by Olivier Grisel. - Fix The property
familyoflinear_model.TweedieRegressoris not validated in__init__anymore. Instead, this (private) property is deprecated inlinear_model.GammaRegressor,linear_model.PoissonRegressorandlinear_model.TweedieRegressor, and will be removed in 1.3. #22548 by Christian Lorentzen. - Fix The
coef_andintercept_attributes oflinear_model.LinearRegressionare now correctly computed in the presence of sample weights when the input is sparse. #22891 by Jérémie du Boisberranger. - Fix The
coef_andintercept_attributes oflinear_model.Ridgewithsolver="sparse_cg"andsolver="lbfgs"are now correctly computed in the presence of sample weights when the input is sparse. #22899 by Jérémie du Boisberranger. - Fix
linear_model.SGDRegressorandlinear_model.SGDClassifiernow computes the validation error correctly when early stopping is enabled. #23256 by Zhehao Liu. - API Change
linear_model.LassoLarsICnow exposesnoise_varianceas a parameter in order to provide an estimate of the noise variance. This is particularly relevant whenn_features > n_samplesand the estimator of the noise variance cannot be computed. #21481 by Guillaume Lemaitre.
sklearn.manifold
- Feature
manifold.Isomapnow supports radius-based neighbors via theradiusargument. #19794 by Zhehao Liu. - Enhancement
manifold.spectral_embeddingandmanifold.SpectralEmbeddingsupportsnp.float32dtype and will preserve this dtype. #21534 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
manifold.Isomapandmanifold.LocallyLinearEmbedding. #22254 by Thomas Fan. - Enhancement added
metric_paramstomanifold.TSNEconstructor for additional parameters of distance metric to use in optimization. #21805 by Jeanne Dionisi and #22685 by Meekail Zain. - Enhancement
manifold.trustworthinessraises an error ifn_neighbours >= n_samples / 2to ensure a correct support for the function. #18832 by Hong Shao Yang and #23033 by Meekail Zain. - Fix
manifold.spectral_embeddingnow uses Gaussian instead of the previous uniform on [0, 1] random initial approximations to eigenvectors in eigen_solverslobpcgandamgto improve their numerical stability. #21565 by Andrew Knyazev.
sklearn.metrics
- Feature
metrics.r2_scoreandmetrics.explained_variance_scorehave a newforce_finiteparameter. Setting this parameter toFalsewill return the actual non-finite score in case of perfect predictions or constanty_true, instead of the finite approximation (1.0and0.0respectively) currently returned by default. #17266 by Sylvain Marié. - Feature
metrics.d2_pinball_scoreandmetrics.d2_absolute_error_scorecalculate the D2 regression score for the pinball loss and the absolute error respectively.metrics.d2_absolute_error_scoreis a special case ofmetrics.d2_pinball_scorewith a fixed quantile parameteralpha=0.5for ease of use and discovery. The D2 scores are generalizations of ther2_scoreand can be interpeted as the fraction of deviance explained. #22118 by Ohad Michel. - Enhancement
metrics.top_k_accuracy_scoreraises an improved error message wheny_trueis binary andy_scoreis 2d. #22284 by Thomas Fan. - Enhancement
metrics.roc_auc_scorenow supportsaverage=Nonein the multiclass case whenmulticlass='ovr'which will return the score per class. #19158 by Nicki Skafte. - Enhancement Adds
im_kwparameter tometrics.ConfusionMatrixDisplay.from_estimatormetrics.ConfusionMatrixDisplay.from_predictions, andmetrics.ConfusionMatrixDisplay.plot. Theim_kwparameter is passed to thematplotlib.pyplot.imshowcall when plotting the confusion matrix. #20753 by Thomas Fan. - Fix
metrics.silhouette_scorenow supports integer input for precomputed distances. #22108 by Thomas Fan. - Fix Fixed a bug in
metrics.normalized_mutual_info_scorewhich could return unbounded values. #22635 by Jérémie du Boisberranger. - Fix Fixes
metrics.precision_recall_curveandmetrics.average_precision_scorewhen true labels are all negative. #19085 by Varun Agrawal. - API Change
metrics.SCORERSis now deprecated and will be removed in 1.3. Please usemetrics.get_scorer_namesto retrieve the names of all available scorers. #22866 by Adrin Jalali. - API Change Parameters
sample_weightandmultioutputofmetrics.mean_absolute_percentage_errorare now keyword-only, in accordance with SLEP009. A deprecation cycle was introduced. #21576 by Paul-Emile Dugnat. - API Change The
"wminkowski"metric ofmetrics.DistanceMetricis deprecated and will be removed in version 1.3. Instead the existing"minkowski"metric now takes in an optionalwparameter for weights. This deprecation aims at remaining consistent with SciPy 1.8 convention. #21873 by Yar Khine Phyo. - API Change
metrics.DistanceMetrichas been moved fromsklearn.neighborstosklearn.metrics. Usingneighbors.DistanceMetricfor imports is still valid for backward compatibility, but this alias will be removed in 1.3. #21177 by Julien Jerphanion.
sklearn.mixture
- Enhancement
mixture.GaussianMixtureandmixture.BayesianGaussianMixturecan now be initialized using k-means++ and random data points. #20408 by Gordon Walsh, Alberto Ceballos and Andres Rios. - Fix Fix a bug that correctly initialize
precisions_cholesky_inmixture.GaussianMixturewhen providingprecisions_initby taking its square root. #22058 by Guillaume Lemaitre. - Fix
mixture.GaussianMixturenow normalizesweights_more safely, preventing rounding errors when callingmixture.GaussianMixture.samplewithn_components=1. #23034 by Meekail Zain.
sklearn.model_selection
- Enhancement it is now possible to pass
scoring="matthews_corrcoef"to all model selection tools with ascoringargument to use the Matthews correlation coefficient (MCC). #22203 by Olivier Grisel. - Enhancement raise an error during cross-validation when the fits for all the splits failed. Similarly raise an error during grid-search when the fits for all the models and all the splits failed. #21026 by Loïc Estève.
- Fix
model_selection.GridSearchCV,model_selection.HalvingGridSearchCVnow validate input parameters infitinstead of__init__. #21880 by Mrinal Tyagi. - Fix
model_selection.learning_curvenow supportspartial_fitwith regressors. #22982 by Thomas Fan.
sklearn.multiclass
- Enhancement
multiclass.OneVsRestClassifiernow supports averboseparameter so progress on fitting can be seen. #22508 by Chris Combs. - Fix
multiclass.OneVsOneClassifier.predictreturns correct predictions when the inner classifier only has a predict_proba. #22604 by Thomas Fan.
sklearn.neighbors
- Enhancement Adds get_feature_names_out to
neighbors.RadiusNeighborsTransformer,neighbors.KNeighborsTransformerandneighbors.NeighborhoodComponentsAnalysis. #22212 by Meekail Zain. - Fix
neighbors.KernelDensitynow validates input parameters infitinstead of__init__. #21430 by Desislava Vasileva and Lucy Jimenez. - Fix
neighbors.KNeighborsRegressor.predictnow works properly when given an array-like input ifKNeighborsRegressoris first constructed with a callable passed to theweightsparameter. #22687 by Meekail Zain.
sklearn.neural_network
- Enhancement
neural_network.MLPClassifierandneural_network.MLPRegressorshow error messages when optimizers produce non-finite parameter weights. #22150 by Christian Ritter and Norbert Preining. - Enhancement Adds get_feature_names_out to
neural_network.BernoulliRBM. #22248 by Thomas Fan.
sklearn.pipeline
- Enhancement Added support for “passthrough” in
pipeline.FeatureUnion. Setting a transformer to “passthrough” will pass the features unchanged. #20860 by Shubhraneel Pal. - Fix
pipeline.Pipelinenow does not validate hyper-parameters in__init__but in.fit(). #21888 by iofall and Arisa Y.. - Fix
pipeline.FeatureUniondoes not validate hyper-parameters in__init__. Validation is now handled in.fit()and.fit_transform(). #21954 by iofall and Arisa Y.. - Fix Defines
__sklearn_is_fitted__inpipeline.FeatureUnionto return correct result withutils.validation.check_is_fitted. #22953 by randomgeek78.
sklearn.preprocessing
- Feature
preprocessing.OneHotEncodernow supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequencyormax_categories. #16018 by Thomas Fan. - Enhancement Adds a
subsampleparameter topreprocessing.KBinsDiscretizer. This allows specifying a maximum number of samples to be used while fitting the model. The option is only available whenstrategyis set toquantile. #21445 by Felipe Bidu and Amanda Dsouza. - Enhancement Adds
encoded_missing_valuetopreprocessing.OrdinalEncoderto configure the encoded value for missing data. #21988 by Thomas Fan. - Enhancement Added the
get_feature_names_outmethod and a new parameterfeature_names_outtopreprocessing.FunctionTransformer. You can setfeature_names_outto ‘one-to-one’ to use the input features names as the output feature names, or you can set it to a callable that returns the output feature names. This is especially useful when the transformer changes the number of features. Iffeature_names_outis None (which is the default), thenget_output_feature_namesis not defined. #21569 by Aurélien Geron. - Enhancement Adds get_feature_names_out to
preprocessing.Normalizer,preprocessing.KernelCenterer,preprocessing.OrdinalEncoder, andpreprocessing.Binarizer. #21079 by Thomas Fan. - Fix
preprocessing.PowerTransformerwithmethod='yeo-johnson'better supports significantly non-Gaussian data when searching for an optimal lambda. #20653 by Thomas Fan. - Fix
preprocessing.LabelBinarizernow validates input parameters infitinstead of__init__. #21434 by Krum Arnaudov. - Fix
preprocessing.FunctionTransformerwithcheck_inverse=Truenow provides informative error message when input has mixed dtypes. #19916 by Zhehao Liu. - Fix
preprocessing.KBinsDiscretizerhandles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - Fix Adds
preprocessing.KBinsDiscretizer.get_feature_names_outsupport whenencode="ordinal". #22735 by Thomas Fan.
sklearn.random_projection
- Enhancement Adds an
inverse_transformmethod and acompute_inverse_transformparameter torandom_projection.GaussianRandomProjectionandrandom_projection.SparseRandomProjection. When the parameter is set to True, the pseudo-inverse of the components is computed duringfitand stored asinverse_components_. #21701 by Aurélien Geron. - Enhancement
random_projection.SparseRandomProjectionandrandom_projection.GaussianRandomProjectionpreserves dtype fornumpy.float32. #22114 by Takeshi Oura. - Enhancement Adds get_feature_names_out to all transformers in the
sklearn.random_projectionmodule:random_projection.GaussianRandomProjectionandrandom_projection.SparseRandomProjection. #21330 by Loïc Estève.
sklearn.svm
- Enhancement
svm.OneClassSVM,svm.NuSVC,svm.NuSVR,svm.SVCandsvm.SVRnow exposen_iter_, the number of iterations of the libsvm optimization routine. #21408 by Juan Martín Loyola. - Enhancement
svm.SVR,svm.SVC,svm.NuSVR,svm.OneClassSVM,svm.NuSVCnow raise an error when the dual-gap estimation produce non-finite parameter weights. #22149 by Christian Ritter and Norbert Preining. - Fix
svm.NuSVC,svm.NuSVR,svm.SVC,svm.SVR,svm.OneClassSVMnow validate input parameters infitinstead of__init__. #21436 by Haidar Almubarak.
sklearn.tree
- Enhancement
tree.DecisionTreeClassifierandtree.ExtraTreeClassifierhave the newcriterion="log_loss", which is equivalent tocriterion="entropy". #23047 by Christian Lorentzen. - Fix Fix a bug in the Poisson splitting criterion for
tree.DecisionTreeRegressor. #22191 by Christian Lorentzen. - API Change Changed the default value of
max_featuresto 1.0 fortree.ExtraTreeRegressorand to"sqrt"fortree.ExtraTreeClassifier, which will not change the fit result. The original default value"auto"has been deprecated and will be removed in version 1.3. Settingmax_featuresto"auto"is also deprecated fortree.DecisionTreeClassifierandtree.DecisionTreeRegressor. #22476 by Zhehao Liu.
sklearn.utils
- Enhancement
utils.check_arrayandutils.multiclass.type_of_targetnow accept aninput_nameparameter to make the error message more informative when passed invalid input data (e.g. with NaN or infinite values). #21219 by Olivier Grisel. - Enhancement
utils.check_arrayreturns a float ndarray withnp.nanwhen passed aFloat32orFloat64pandas extension array withpd.NA. #21278 by Thomas Fan. - Enhancement
utils.estimator_html_reprshows a more helpful error message when running in a jupyter notebook that is not trusted. #21316 by Thomas Fan. - Enhancement
utils.estimator_html_reprdisplays an arrow on the top left corner of the HTML representation to show how the elements are clickable. #21298 by Thomas Fan. - Enhancement
utils.check_arraywithdtype=Nonereturns numeric arrays when passed in a pandas DataFrame with mixed dtypes.dtype="numeric"will also make better infer the dtype when the DataFrame has mixed dtypes. #22237 by Thomas Fan. - Enhancement
utils.check_scalarnow has better messages when displaying the type. #22218 by Thomas Fan. - Fix Changes the error message of the
ValidationErrorraised byutils.check_X_ywhen y is None so that it is compatible with thecheck_requires_y_noneestimator check. #22578 by Claudio Salvatore Arcidiacono. - Fix
utils.class_weight.compute_class_weightnow only requires that all classes inyhave a weight inclass_weight. An error is still raised when a class is present inybut not inclass_weight. #22595 by Thomas Fan. - Fix
utils.estimator_html_reprhas an improved visualization for nested meta-estimators. #21310 by Thomas Fan. - Fix
utils.check_scalarraises an error wheninclude_boundaries={"left", "right"}and the boundaries are not set. #22027 by Marie Lanternier. - Fix
utils.metaestimators.available_ifcorrectly returns a bounded method that can be pickled. #23077 by Thomas Fan. - API Change
utils.estimator_checks.check_estimator’s argument is now calledestimator(previous name wasEstimator). #22188 by Mathurin Massias. - API Change
utils.metaestimators.if_delegate_has_methodis deprecated and will be removed in version 1.3. Useutils.metaestimators.available_ifinstead. #22830 by Jérémie du Boisberranger.
Have any questions?
Contact Exxact Today



.jpg?format=webp)