scikit-learn 1.1 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
Highlights include:
- Quantile loss in
ensemble.HistGradientBoostingRegressor
get_feature_names_out
Available in all Transformers- Grouping infrequent categories in
OneHotEncoder
- Performance improvements
- MiniBatchNMF: an online version of NMF
- BisectingKMeans: divide and cluster
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.1.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.1.0 of scikit-learn requires python 3.8+, numpy 1.17.3+ and scipy 1.3.2+. Optional minimal dependency is matplotlib 3.1.2+.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Efficiency
cluster.KMeans
now defaults toalgorithm="lloyd"
instead ofalgorithm="auto"
, which was equivalent toalgorithm="elkan"
. Lloyd’s algorithm and Elkan’s algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd’s algorithm uses much less memory, and it is often faster. - Efficiency Fitting
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.GradientBoostingClassifier
, andensemble.GradientBoostingRegressor
is on average 15% faster than in previous versions thanks to a new sort algorithm to find the best split. Models might be different because of a different handling of splits with tied criterion values: both the old and the new sorting algorithm are unstable sorting algorithms. #22868 by Thomas Fan. - Fix The eigenvectors initialization for
cluster.SpectralClustering
andmanifold.SpectralEmbedding
now samples from a Gaussian when using the'amg'
or'lobpcg'
solver. This change improves numerical stability of the solver, but may result in a different model. - Fix
feature_selection.f_regression
andfeature_selection.r_regression
will now returned finite score by default instead ofnp.nan
andnp.inf
for some corner case. You can useforce_finite=False
if you really want to get non-finite values and keep the old behavior. - Fix Panda’s DataFrames with all non-string columns such as a MultiIndex no longer warns when passed into an Estimator. Estimators will continue to ignore the column names in DataFrames with non-string columns. For
feature_names_in_
to be defined, columns must be all strings. #22410 by Thomas Fan. - Fix
preprocessing.KBinsDiscretizer
changed handling of bin edges slightly, which might result in a different encoding with the same data. - Fix
calibration.calibration_curve
changed handling of bin edges slightly, which might result in a different output curve given the same data. - Fix
discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior. - Fix
feature_selection.SelectFromModel.fit
andfeature_selection.SelectFromModel.partial_fit
can now be called withprefit=True
.estimators_
will be a deep copy ofestimator
whenprefit=True
. #23271 by Guillaume Lemaitre.
Changelog
- Efficiency Low-level routines for reductions on pairwise distances for dense float64 datasets have been refactored. The following functions and estimators now benefit from improved performances in terms of hardware scalability and speed-ups:
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min
sklearn.cluster.AffinityPropagation
sklearn.cluster.Birch
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.neighbors.RadiusNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
For instance
sklearn.neighbors.NearestNeighbors.kneighbors
andsklearn.neighbors.NearestNeighbors.radius_neighbors
can respectively be up to ×20 and ×5 faster than previously. #21987, #22064, #22065, #22288 and #22320 by Julien Jerphanion. - Enhancement All scikit-learn models now generate a more informative error message when some input contains unexpected
NaN
or infinite values. In particular the message contains the input name (“X”, “y” or “sample_weight”) and if an unexpectedNaN
value is found inX
, the error message suggests potential solutions. #21219 by Olivier Grisel. - Enhancement All scikit-learn models now generate a more informative error message when setting invalid hyper-parameters with
set_params
. #21542 by Olivier Grisel. - Enhancement Removes random unique identifiers in the HTML representation. With this change, jupyter notebooks are reproducible as long as the cells are run in the same order. #23098 by Thomas Fan.
- Fix Estimators with
non_deterministic
tag set toTrue
will skip bothcheck_methods_sample_order_invariance
andcheck_methods_subset_invariance
tests. #22318 by Zhehao Liu. - API Change The option for using the log loss, aka binomial or multinomial deviance, via the
loss
parameters was made more consistent. The preferred way is by setting the value to"log_loss"
. Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.- For
ensemble.GradientBoostingClassifier
, theloss
parameter name “deviance” is deprecated in favor of the new name “log_loss”, which is now the default. #23036 by Christian Lorentzen. - For
ensemble.HistGradientBoostingClassifier
, theloss
parameter names “auto”, “binary_crossentropy” and “categorical_crossentropy” are deprecated in favor of the new name “log_loss”, which is now the default. #23040 by Christian Lorentzen. - For
linear_model.SGDClassifier
, theloss
parameter name “log” is deprecated in favor of the new name “log_loss”. #23046 by Christian Lorentzen.
- For
- API Change Rich html representation of estimators is now enabled by default in Jupyter notebooks. It can be deactivated by setting
display='text'
insklearn.set_config
. #22856 by Jérémie du Boisberranger. - Enhancement The error message is improved when importing
model_selection.HalvingGridSearchCV
,model_selection.HalvingRandomSearchCV
, orimpute.IterativeImputer
without importing the experimental flag. #23194 by Thomas Fan. - Enhancement Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values. #23198 by Lise Kleiber, Zhehao Liu and Chiara Marmo.
sklearn.calibration
- Enhancement
calibration.calibration_curve
accepts a parameterpos_label
to specify the positive class label. #21032 by Guillaume Lemaitre. - Enhancement
calibration.CalibratedClassifierCV.fit
now supports passingfit_params
, which are routed to thebase_estimator
. #18170 by Benjamin Bossan. - Enhancement
calibration.CalibrationDisplay
accepts a parameterpos_label
to add this information to the plot. #21038 by Guillaume Lemaitre. - Fix
calibration.calibration_curve
handles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - API Change
calibration.calibration_curve
’snormalize
parameter is now deprecated and will be removed in version 1.3. It is recommended that a proper probability (i.e. a classifier’s predict_proba positive class) is used fory_prob
. #23095 by Jordan Silke.
sklearn.cluster
- Major Feature
BisectingKMeans
introducing Bisecting K-Means algorithm #20031 by Michal Krawczyk, Tom Dupre la Tour and Jérémie du Boisberranger. - Enhancement
cluster.SpectralClustering
andcluster.spectral_clustering
now include the new'cluster_qr'
method that clusters samples in the embedding space as an alternative to the existing'kmeans'
and'discrete'
methods. Seecluster.spectral_clustering
for more details. #21148 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
cluster.Birch
,cluster.FeatureAgglomeration
,cluster.KMeans
,cluster.MiniBatchKMeans
. #22255 by Thomas Fan. - Enhancement
cluster.SpectralClustering
now raises consistent error messages when passed invalid values forn_clusters
,n_init
,gamma
,n_neighbors
,eigen_tol
ordegree
. #21881 by Hugo Vassard. - Enhancement
cluster.AffinityPropagation
now returns cluster centers and labels if they exist, even if the model has not fully converged. When returning these potentially-degenerate cluster centers and labels, a new warning message is shown. If no cluster centers were constructed, then the cluster centers remain an empty list with labels set to-1
and the original warning message is shown. #22217 by Meekail Zain. - Efficiency In
cluster.KMeans
, the defaultalgorithm
is now"lloyd"
which is the full classical EM-style algorithm. Both"auto"
and"full"
are deprecated and will be removed in version 1.3. They are now aliases for"lloyd"
. The previous default was"auto"
, which relied on Elkan’s algorithm. Lloyd’s algorithm uses less memory than Elkan’s, it is faster on many datasets, and its results are identical, hence the change. #21735 by Aurélien Geron. - Fix
cluster.KMeans
’sinit
parameter now properly supports array-like input and NumPy string scalars. #22154 by Thomas Fan.
sklearn.compose
- Fix
compose.ColumnTransformer
now removes validation errors from__init__
andset_params
methods. #22537 by iofall and Arisa Y.. - Fix get_feature_names_out functionality in
compose.ColumnTransformer
was broken when columns were specified usingslice
. This is fixed in #22775 and #22913 by randomgeek78.
sklearn.covariance
- Fix
covariance.GraphicalLassoCV
now accepts NumPy array for the parameteralphas
. #22493 by Guillaume Lemaitre.
sklearn.cross_decomposition
- Enhancement the
inverse_transform
method ofcross_decomposition.PLSRegression
,cross_decomposition.PLSCanonical
andcross_decomposition.CCA
now allows reconstruction of aX
target when aY
parameter is given. #19680 by Robin Thibaut. - Enhancement Adds get_feature_names_out to all transformers in the
cross_decomposition
module:cross_decomposition.CCA
,cross_decomposition.PLSSVD
,cross_decomposition.PLSRegression
, andcross_decomposition.PLSCanonical
. #22119 by Thomas Fan. - Fix The shape of the coef_ attribute of
cross_decomposition.CCA
,cross_decomposition.PLSCanonical
andcross_decomposition.PLSRegression
will change in version 1.3, from(n_features, n_targets)
to(n_targets, n_features)
, to be consistent with other linear models and to make it work with interface expecting a specific shape forcoef_
(e.g.feature_selection.RFE
). #22016 by Guillaume Lemaitre. - API Change add the fitted attribute
intercept_
tocross_decomposition.PLSCanonical
,cross_decomposition.PLSRegression
, andcross_decomposition.CCA
. The methodpredict
is indeed equivalent toY = X @ coef_ + intercept_
. #22015 by Guillaume Lemaitre.
sklearn.datasets
- Feature
datasets.load_files
now accepts a ignore list and an allow list based on file extensions. #19747 by Tony Attalla and #22498 by Meekail Zain. - Enhancement
datasets.make_swiss_roll
now supports the optional argument hole; when set to True, it returns the swiss-hole dataset. #21482 by Sebastian Pujalte. - Enhancement
datasets.make_blobs
no longer copies data during the generation process, therefore uses less memory. #22412 by Zhehao Liu. - Enhancement
datasets.load_diabetes
now accepts the parameterscaled
, to allow loading unscaled data. The scaled version of this dataset is now computed from the unscaled data, and can produce slightly different results that in previous version (within a 1e-4 absolute tolerance). #16605 by Mandy Gu. - Enhancement
datasets.fetch_openml
now has two optional argumentsn_retries
anddelay
. By default,datasets.fetch_openml
will retry 3 times in case of a network failure with a delay between each try. #21901 by Rileran. - Fix
datasets.fetch_covtype
is now concurrent-safe: data is downloaded to a temporary directory before being moved to the data directory. #23113 by Ilion Beyst. - API Change
datasets.make_sparse_coded_signal
now accepts a parameterdata_transposed
to explicitly specify the shape of matrixX
. The default behaviorTrue
is to return a transposed matrixX
corresponding to a(n_features, n_samples)
shape. The default value will change toFalse
in version 1.3. #21425 by Gabriel Stefanini Vicente.
sklearn.decomposition
- Major Feature Added a new estimator
decomposition.MiniBatchNMF
. It is a faster but less accurate version of non-negative matrix factorization, better suited for large datasets. #16948 by Chiara Marmo, Patricio Cerda and Jérémie du Boisberranger. - Enhancement
decomposition.dict_learning
,decomposition.dict_learning_online
anddecomposition.sparse_encode
preserve dtype fornumpy.float32
.decomposition.DictionaryLearning
,decomposition.MiniBatchDictionaryLearning
anddecomposition.SparseCoder
preserve dtype fornumpy.float32
. #22002 by Takeshi Oura. - Enhancement
decomposition.PCA
exposes a parametern_oversamples
to tuneutils.randomized_svd
and get accurate results when the number of features is large. #21109 by Smile. - Enhancement The
decomposition.MiniBatchDictionaryLearning
anddecomposition.dict_learning_online
have been refactored and now have a stopping criterion based on a small change of the dictionary or objective function, controlled by the newmax_iter
,tol
andmax_no_improvement
parameters. In addition, some of their parameters and attributes are deprecated.- the
n_iter
parameter of both is deprecated. Usemax_iter
instead. - the
iter_offset
,return_inner_stats
,inner_stats
andreturn_n_iter
parameters ofdecomposition.dict_learning_online
serve internal purpose and are deprecated. - the
inner_stats_
,iter_offset_
andrandom_state_
attributes ofdecomposition.MiniBatchDictionaryLearning
serve internal purpose and are deprecated. - the default value of the
batch_size
parameter of both will change from 3 to 256 in version 1.3.
- the
- Enhancement
decomposition.SparsePCA
anddecomposition.MiniBatchSparsePCA
preserve dtype fornumpy.float32
. #22111 by Takeshi Oura. - Enhancement
decomposition.TruncatedSVD
now allowsn_components == n_features
, ifalgorithm='randomized'
. #22181 by Zach Deane-Mayer. - Enhancement Adds get_feature_names_out to all transformers in the
decomposition
module:decomposition.DictionaryLearning
,decomposition.FactorAnalysis
,decomposition.FastICA
,decomposition.IncrementalPCA
,decomposition.KernelPCA
,decomposition.LatentDirichletAllocation
,decomposition.MiniBatchDictionaryLearning
,decomposition.MiniBatchSparsePCA
,decomposition.NMF
,decomposition.PCA
,decomposition.SparsePCA
, anddecomposition.TruncatedSVD
. #21334 by Thomas Fan. - Enhancement
decomposition.TruncatedSVD
exposes the parametern_oversamples
andpower_iteration_normalizer
to tuneutils.randomized_svd
and get accurate results when the number of features is large, the rank of the matrix is high, or other features of the matrix make low rank approximation difficult. #21705 by Jay S. Stanley III. - Enhancement
decomposition.PCA
exposes the parameterpower_iteration_normalizer
to tuneutils.randomized_svd
and get more accurate results when low rank approximation is difficult. #21705 by Jay S. Stanley III. - Fix
decomposition.FastICA
now validates input parameters infit
instead of__init__
. #21432 by Hannah Bohle and Maren Westermann. - Fix
decomposition.FastICA
now acceptsnp.float32
data without silent upcasting. The dtype is preserved byfit
andfit_transform
and the main fitted attributes use a dtype of the same precision as the training data. #22806 by Jihane Bennis and Olivier Grisel. - Fix
decomposition.FactorAnalysis
now validates input parameters infit
instead of__init__
. #21713 by Haya and Krum Arnaudov. - Fix
decomposition.KernelPCA
now validates input parameters infit
instead of__init__
. #21567 by Maggie Chege. - Fix
decomposition.PCA
anddecomposition.IncrementalPCA
more safely calculate precision using the inverse of the covariance matrix ifself.noise_variance_
is zero. #22300 by Meekail Zain and #15948 by @sysuresh. - Fix Greatly reduced peak memory usage in
decomposition.PCA
when callingfit
orfit_transform
. #22553 by Meekail Zain. - API Change
decomposition.FastICA
now supports unit variance for whitening. The default value of itswhiten
argument will change fromTrue
(which behaves like'arbitrary-variance'
) to'unit-variance'
in version 1.3. #19490 by Facundo Ferrin and Julien Jerphanion.
sklearn.discriminant_analysis
- Enhancement Adds get_feature_names_out to
discriminant_analysis.LinearDiscriminantAnalysis
. #22120 by Thomas Fan. - Fix
discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior. #15984 by Okon Samuel and #22696 by Meekail Zain.
sklearn.dummy
- Fix
dummy.DummyRegressor
no longer overrides theconstant
parameter duringfit
. #22486 by Thomas Fan.
sklearn.ensemble
- Major Feature Added additional option
loss="quantile"
toensemble.HistGradientBoostingRegressor
for modelling quantiles. The quantile level can be specified with the new parameterquantile
. #21800 and #20567 by Christian Lorentzen. - Efficiency
fit
ofensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
now callsutils.check_array
with parameterforce_all_finite=False
for non initial warm-start runs as it has already been checked before. #22159 by Geoffrey Paris. - Enhancement
ensemble.HistGradientBoostingClassifier
is faster, for binary and in particular for multiclass problems thanks to the new private loss function module. #20811, #20567 and #21814 by Christian Lorentzen. - Enhancement Adds support to use pre-fit models with
cv="prefit"
inensemble.StackingClassifier
andensemble.StackingRegressor
. #16748 by Siqi He and #22215 by Meekail Zain. - Enhancement
ensemble.RandomForestClassifier
andensemble.ExtraTreesClassifier
have the newcriterion="log_loss"
, which is equivalent tocriterion="entropy"
. #23047 by Christian Lorentzen. - Enhancement Adds get_feature_names_out to
ensemble.VotingClassifier
,ensemble.VotingRegressor
,ensemble.StackingClassifier
, andensemble.StackingRegressor
. #22695 and #22697 by Thomas Fan. - Enhancement
ensemble.RandomTreesEmbedding
now has an informative get_feature_names_out function that includes both tree index and leaf index in the output feature names. #21762 by Zhehao Liu and Thomas Fan. - Efficiency Fitting a
ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
,ensemble.ExtraTreesRegressor
, andensemble.RandomTreesEmbedding
is now faster in a multiprocessing setting, especially for subsequent fits withwarm_start
enabled. #22106 by Pieter Gijsbers. - Fix Change the parameter
validation_fraction
inensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
so that an error is raised if anything other than a float is passed in as an argument. #21632 by Genesis Valencia. - Fix Removed a potential source of CPU oversubscription in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
when CPU resource usage is limited, for instance using cgroups quota in a docker container. #22566 by Jérémie du Boisberranger. - Fix
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
no longer warns when fitting on a pandas DataFrame with a non-defaultscoring
parameter and early_stopping enabled. #22908 by Thomas Fan. - Fix Fixes HTML repr for
ensemble.StackingClassifier
andensemble.StackingRegressor
. #23097 by Thomas Fan. - API Change The attribute
loss_
ofensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
has been deprecated and will be removed in version 1.3. #23079 by Christian Lorentzen. - API Change Changed the default of
max_features
to 1.0 forensemble.RandomForestRegressor
and to"sqrt"
forensemble.RandomForestClassifier
. Note that these give the same fit results as before, but are much easier to understand. The old default value"auto"
has been deprecated and will be removed in version 1.3. The same changes are also applied forensemble.ExtraTreesRegressor
andensemble.ExtraTreesClassifier
. #20803 by Brian Sun. - Efficiency Improve runtime performance of
ensemble.IsolationForest
by skipping repetitive input checks. #23149 by Zhehao Liu.
sklearn.feature_extraction
- Feature
feature_extraction.FeatureHasher
now supports PyPy. #23023 by Thomas Fan. - Fix
feature_extraction.FeatureHasher
now validates input parameters intransform
instead of__init__
. #21573 by Hannah Bohle and Maren Westermann. - Fix
feature_extraction.text.TfidfVectorizer
now does not create afeature_extraction.text.TfidfTransformer
at__init__
as required by our API. #21832 by Guillaume Lemaitre.
sklearn.feature_selection
- Feature Added auto mode to
feature_selection.SequentialFeatureSelector
. If the argumentn_features_to_select
is'auto'
, select features until the score improvement does not exceed the argumenttol
. The default value ofn_features_to_select
changed fromNone
to'warn'
in 1.1 and will become'auto'
in 1.3.None
and'warn'
will be removed in 1.3. #20145 by murata-yu. - Feature Added the ability to pass callables to the
max_features
parameter offeature_selection.SelectFromModel
. Also introduced new attributemax_features_
which is inferred frommax_features
and the data duringfit
. Ifmax_features
is an integer, thenmax_features_ = max_features
. Ifmax_features
is a callable, thenmax_features_ = max_features(X)
. #22356 by Meekail Zain. - Enhancement
feature_selection.GenericUnivariateSelect
preserves float32 dtype. #18482 by Thierry Gameiro and Daniel Kharsa and #22370 by Meekail Zain. - Enhancement Add a parameter
force_finite
tofeature_selection.f_regression
andfeature_selection.r_regression
. This parameter allows to force the output to be finite in the case where a feature or a the target is constant or that the feature and target are perfectly correlated (only for the F-statistic). #17819 by Juan Carlos Alfaro Jiménez. - Efficiency Improve runtime performance of
feature_selection.chi2
with boolean arrays. #22235 by Thomas Fan. - Efficiency Reduced memory usage of
feature_selection.chi2
. #21837 by Louis Wagner.
sklearn.gaussian_process
- Fix
predict
andsample_y
methods ofgaussian_process.GaussianProcessRegressor
now return arrays of the correct shape in single-target and multi-target cases, and for bothnormalize_y=False
andnormalize_y=True
. #22199 by Guillaume Lemaitre, Aidar Shakerimoff and Tenavi Nakamura-Zimmerer. - Fix
gaussian_process.GaussianProcessClassifier
raises a more informative error ifCompoundKernel
is passed viakernel
. #22223 by MarcoM.
sklearn.impute
- Enhancement
impute.SimpleImputer
now warns with feature names when features which are skipped due to the lack of any observed values in the training set. #21617 by Christian Ritter. - Enhancement Added support for
pd.NA
inimpute.SimpleImputer
. #21114 by Ying Xiong. - Enhancement Adds get_feature_names_out to
impute.SimpleImputer
,impute.KNNImputer
,impute.IterativeImputer
, andimpute.MissingIndicator
. #21078 by Thomas Fan. - API Change The
verbose
parameter was deprecated forimpute.SimpleImputer
. A warning will always be raised upon the removal of empty columns. #21448 by Oleh Kozynets and Christian Ritter.
sklearn.inspection
- Feature Add a display to plot the boundary decision of a classifier by using the method
inspection.DecisionBoundaryDisplay.from_estimator
. #16061 by Thomas Fan. - Enhancement In
inspection.PartialDependenceDisplay.from_estimator
, allowkind
to accept a list of strings to specify which type of plot to draw for each feature interaction. #19438 by Guillaume Lemaitre. - Enhancement
inspection.PartialDependenceDisplay.from_estimator
,inspection.PartialDependenceDisplay.plot
, andinspection.plot_partial_dependence
now support plotting centered Individual Conditional Expectation (cICE) and centered PDP curves controlled by setting the parametercentered
. #18310 by Johannes Elfner and Guillaume Lemaitre.
sklearn.isotonic
- Enhancement Adds get_feature_names_out to
isotonic.IsotonicRegression
. #22249 by Thomas Fan.
sklearn.kernel_approximation
- Enhancement Adds get_feature_names_out to
kernel_approximation.AdditiveChi2Sampler
.kernel_approximation.Nystroem
,kernel_approximation.PolynomialCountSketch
,kernel_approximation.RBFSampler
, andkernel_approximation.SkewedChi2Sampler
. #22137 and #22694 by Thomas Fan.
sklearn.linear_model
- Feature
linear_model.ElasticNet
,linear_model.ElasticNetCV
,linear_model.Lasso
andlinear_model.LassoCV
supportsample_weight
for sparse inputX
. #22808 by Christian Lorentzen. - Feature
linear_model.Ridge
withsolver="lsqr"
now supports to fit sparse input withfit_intercept=True
. #22950 by Christian Lorentzen. - Enhancement
linear_model.QuantileRegressor
support sparse input for the highs based solvers. #21086 by Venkatachalam Natchiappan. In addition, those solvers now use the CSC matrix right from the beginning which speeds up fitting. #22206 by Christian Lorentzen. - Enhancement
linear_model.LogisticRegression
is faster forsolvers="lbfgs"
andsolver="newton-cg"
, for binary and in particular for multiclass problems thanks to the new private loss function module. In the multiclass case, the memory consumption has also been reduced for these solvers as the target is now label encoded (mapped to integers) instead of label binarized (one-hot encoded). The more classes, the larger the benefit. #21808, #20567 and #21814 by Christian Lorentzen. - Enhancement
linear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
are faster forsolvers="lbfgs"
. #22548, #21808 and #20567 by Christian Lorentzen. - Enhancement Rename parameter
base_estimator
toestimator
inlinear_model.RANSACRegressor
to improve readability and consistency.base_estimator
is deprecated and will be removed in 1.3. #22062 by Adrian Trujillo. - Enhancement
linear_model.ElasticNet
and and other linear model classes using coordinate descent show error messages when non-finite parameter weights are produced. #22148 by Christian Ritter and Norbert Preining. - Enhancement
linear_model.ElasticNet
andlinear_model.Lasso
now raise consistent error messages when passed invalid values forl1_ratio
,alpha
,max_iter
andtol
. #22240 by Arturo Amor. - Enhancement
linear_model.BayesianRidge
andlinear_model.ARDRegression
now preserve float32 dtype. #9087 by Arthur Imbert and #22525 by Meekail Zain. - Enhancement
linear_model.RidgeClassifier
is now supporting multilabel classification. #19689 by Guillaume Lemaitre. - Enhancement
linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
now raise consistent error message when passed invalid values foralphas
. #21606 by Arturo Amor. - Enhancement
linear_model.Ridge
andlinear_model.RidgeClassifier
now raise consistent error message when passed invalid values foralpha
,max_iter
andtol
. #21341 by Arturo Amor. - Enhancement
linear_model.orthogonal_mp_gram
preservse dtype fornumpy.float32
. #22002 by Takeshi Oura. - Fix
linear_model.LassoLarsIC
now correctly computes AIC and BIC. An error is now raised whenn_features > n_samples
and when the noise variance is not provided. #21481 by Guillaume Lemaitre and Andrés Babino. - Fix
linear_model.TheilSenRegressor
now validates input parametermax_subpopulation
infit
instead of__init__
. #21767 by Maren Westermann. - Fix
linear_model.ElasticNetCV
now produces correct warning whenl1_ratio=0
. #21724 by Yar Khine Phyo. - Fix
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now set then_iter_
attribute with a shape that respects the docstring and that is consistent with the shape obtained when using the other solvers in the one-vs-rest setting. Previously, it would record only the maximum of the number of iterations for each binary sub-problem while now all of them are recorded. #21998 by Olivier Grisel. - Fix The property
family
oflinear_model.TweedieRegressor
is not validated in__init__
anymore. Instead, this (private) property is deprecated inlinear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
, and will be removed in 1.3. #22548 by Christian Lorentzen. - Fix The
coef_
andintercept_
attributes oflinear_model.LinearRegression
are now correctly computed in the presence of sample weights when the input is sparse. #22891 by Jérémie du Boisberranger. - Fix The
coef_
andintercept_
attributes oflinear_model.Ridge
withsolver="sparse_cg"
andsolver="lbfgs"
are now correctly computed in the presence of sample weights when the input is sparse. #22899 by Jérémie du Boisberranger. - Fix
linear_model.SGDRegressor
andlinear_model.SGDClassifier
now computes the validation error correctly when early stopping is enabled. #23256 by Zhehao Liu. - API Change
linear_model.LassoLarsIC
now exposesnoise_variance
as a parameter in order to provide an estimate of the noise variance. This is particularly relevant whenn_features > n_samples
and the estimator of the noise variance cannot be computed. #21481 by Guillaume Lemaitre.
sklearn.manifold
- Feature
manifold.Isomap
now supports radius-based neighbors via theradius
argument. #19794 by Zhehao Liu. - Enhancement
manifold.spectral_embedding
andmanifold.SpectralEmbedding
supportsnp.float32
dtype and will preserve this dtype. #21534 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
manifold.Isomap
andmanifold.LocallyLinearEmbedding
. #22254 by Thomas Fan. - Enhancement added
metric_params
tomanifold.TSNE
constructor for additional parameters of distance metric to use in optimization. #21805 by Jeanne Dionisi and #22685 by Meekail Zain. - Enhancement
manifold.trustworthiness
raises an error ifn_neighbours >= n_samples / 2
to ensure a correct support for the function. #18832 by Hong Shao Yang and #23033 by Meekail Zain. - Fix
manifold.spectral_embedding
now uses Gaussian instead of the previous uniform on [0, 1] random initial approximations to eigenvectors in eigen_solverslobpcg
andamg
to improve their numerical stability. #21565 by Andrew Knyazev.
sklearn.metrics
- Feature
metrics.r2_score
andmetrics.explained_variance_score
have a newforce_finite
parameter. Setting this parameter toFalse
will return the actual non-finite score in case of perfect predictions or constanty_true
, instead of the finite approximation (1.0
and0.0
respectively) currently returned by default. #17266 by Sylvain Marié. - Feature
metrics.d2_pinball_score
andmetrics.d2_absolute_error_score
calculate the D2 regression score for the pinball loss and the absolute error respectively.metrics.d2_absolute_error_score
is a special case ofmetrics.d2_pinball_score
with a fixed quantile parameteralpha=0.5
for ease of use and discovery. The D2 scores are generalizations of ther2_score
and can be interpeted as the fraction of deviance explained. #22118 by Ohad Michel. - Enhancement
metrics.top_k_accuracy_score
raises an improved error message wheny_true
is binary andy_score
is 2d. #22284 by Thomas Fan. - Enhancement
metrics.roc_auc_score
now supportsaverage=None
in the multiclass case whenmulticlass='ovr'
which will return the score per class. #19158 by Nicki Skafte. - Enhancement Adds
im_kw
parameter tometrics.ConfusionMatrixDisplay.from_estimator
metrics.ConfusionMatrixDisplay.from_predictions
, andmetrics.ConfusionMatrixDisplay.plot
. Theim_kw
parameter is passed to thematplotlib.pyplot.imshow
call when plotting the confusion matrix. #20753 by Thomas Fan. - Fix
metrics.silhouette_score
now supports integer input for precomputed distances. #22108 by Thomas Fan. - Fix Fixed a bug in
metrics.normalized_mutual_info_score
which could return unbounded values. #22635 by Jérémie du Boisberranger. - Fix Fixes
metrics.precision_recall_curve
andmetrics.average_precision_score
when true labels are all negative. #19085 by Varun Agrawal. - API Change
metrics.SCORERS
is now deprecated and will be removed in 1.3. Please usemetrics.get_scorer_names
to retrieve the names of all available scorers. #22866 by Adrin Jalali. - API Change Parameters
sample_weight
andmultioutput
ofmetrics.mean_absolute_percentage_error
are now keyword-only, in accordance with SLEP009. A deprecation cycle was introduced. #21576 by Paul-Emile Dugnat. - API Change The
"wminkowski"
metric ofmetrics.DistanceMetric
is deprecated and will be removed in version 1.3. Instead the existing"minkowski"
metric now takes in an optionalw
parameter for weights. This deprecation aims at remaining consistent with SciPy 1.8 convention. #21873 by Yar Khine Phyo. - API Change
metrics.DistanceMetric
has been moved fromsklearn.neighbors
tosklearn.metrics
. Usingneighbors.DistanceMetric
for imports is still valid for backward compatibility, but this alias will be removed in 1.3. #21177 by Julien Jerphanion.
sklearn.mixture
- Enhancement
mixture.GaussianMixture
andmixture.BayesianGaussianMixture
can now be initialized using k-means++ and random data points. #20408 by Gordon Walsh, Alberto Ceballos and Andres Rios. - Fix Fix a bug that correctly initialize
precisions_cholesky_
inmixture.GaussianMixture
when providingprecisions_init
by taking its square root. #22058 by Guillaume Lemaitre. - Fix
mixture.GaussianMixture
now normalizesweights_
more safely, preventing rounding errors when callingmixture.GaussianMixture.sample
withn_components=1
. #23034 by Meekail Zain.
sklearn.model_selection
- Enhancement it is now possible to pass
scoring="matthews_corrcoef"
to all model selection tools with ascoring
argument to use the Matthews correlation coefficient (MCC). #22203 by Olivier Grisel. - Enhancement raise an error during cross-validation when the fits for all the splits failed. Similarly raise an error during grid-search when the fits for all the models and all the splits failed. #21026 by Loïc Estève.
- Fix
model_selection.GridSearchCV
,model_selection.HalvingGridSearchCV
now validate input parameters infit
instead of__init__
. #21880 by Mrinal Tyagi. - Fix
model_selection.learning_curve
now supportspartial_fit
with regressors. #22982 by Thomas Fan.
sklearn.multiclass
- Enhancement
multiclass.OneVsRestClassifier
now supports averbose
parameter so progress on fitting can be seen. #22508 by Chris Combs. - Fix
multiclass.OneVsOneClassifier.predict
returns correct predictions when the inner classifier only has a predict_proba. #22604 by Thomas Fan.
sklearn.neighbors
- Enhancement Adds get_feature_names_out to
neighbors.RadiusNeighborsTransformer
,neighbors.KNeighborsTransformer
andneighbors.NeighborhoodComponentsAnalysis
. #22212 by Meekail Zain. - Fix
neighbors.KernelDensity
now validates input parameters infit
instead of__init__
. #21430 by Desislava Vasileva and Lucy Jimenez. - Fix
neighbors.KNeighborsRegressor.predict
now works properly when given an array-like input ifKNeighborsRegressor
is first constructed with a callable passed to theweights
parameter. #22687 by Meekail Zain.
sklearn.neural_network
- Enhancement
neural_network.MLPClassifier
andneural_network.MLPRegressor
show error messages when optimizers produce non-finite parameter weights. #22150 by Christian Ritter and Norbert Preining. - Enhancement Adds get_feature_names_out to
neural_network.BernoulliRBM
. #22248 by Thomas Fan.
sklearn.pipeline
- Enhancement Added support for “passthrough” in
pipeline.FeatureUnion
. Setting a transformer to “passthrough” will pass the features unchanged. #20860 by Shubhraneel Pal. - Fix
pipeline.Pipeline
now does not validate hyper-parameters in__init__
but in.fit()
. #21888 by iofall and Arisa Y.. - Fix
pipeline.FeatureUnion
does not validate hyper-parameters in__init__
. Validation is now handled in.fit()
and.fit_transform()
. #21954 by iofall and Arisa Y.. - Fix Defines
__sklearn_is_fitted__
inpipeline.FeatureUnion
to return correct result withutils.validation.check_is_fitted
. #22953 by randomgeek78.
sklearn.preprocessing
- Feature
preprocessing.OneHotEncoder
now supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequency
ormax_categories
. #16018 by Thomas Fan. - Enhancement Adds a
subsample
parameter topreprocessing.KBinsDiscretizer
. This allows specifying a maximum number of samples to be used while fitting the model. The option is only available whenstrategy
is set toquantile
. #21445 by Felipe Bidu and Amanda Dsouza. - Enhancement Adds
encoded_missing_value
topreprocessing.OrdinalEncoder
to configure the encoded value for missing data. #21988 by Thomas Fan. - Enhancement Added the
get_feature_names_out
method and a new parameterfeature_names_out
topreprocessing.FunctionTransformer
. You can setfeature_names_out
to ‘one-to-one’ to use the input features names as the output feature names, or you can set it to a callable that returns the output feature names. This is especially useful when the transformer changes the number of features. Iffeature_names_out
is None (which is the default), thenget_output_feature_names
is not defined. #21569 by Aurélien Geron. - Enhancement Adds get_feature_names_out to
preprocessing.Normalizer
,preprocessing.KernelCenterer
,preprocessing.OrdinalEncoder
, andpreprocessing.Binarizer
. #21079 by Thomas Fan. - Fix
preprocessing.PowerTransformer
withmethod='yeo-johnson'
better supports significantly non-Gaussian data when searching for an optimal lambda. #20653 by Thomas Fan. - Fix
preprocessing.LabelBinarizer
now validates input parameters infit
instead of__init__
. #21434 by Krum Arnaudov. - Fix
preprocessing.FunctionTransformer
withcheck_inverse=True
now provides informative error message when input has mixed dtypes. #19916 by Zhehao Liu. - Fix
preprocessing.KBinsDiscretizer
handles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - Fix Adds
preprocessing.KBinsDiscretizer.get_feature_names_out
support whenencode="ordinal"
. #22735 by Thomas Fan.
sklearn.random_projection
- Enhancement Adds an
inverse_transform
method and acompute_inverse_transform
parameter torandom_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
. When the parameter is set to True, the pseudo-inverse of the components is computed duringfit
and stored asinverse_components_
. #21701 by Aurélien Geron. - Enhancement
random_projection.SparseRandomProjection
andrandom_projection.GaussianRandomProjection
preserves dtype fornumpy.float32
. #22114 by Takeshi Oura. - Enhancement Adds get_feature_names_out to all transformers in the
sklearn.random_projection
module:random_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
. #21330 by Loïc Estève.
sklearn.svm
- Enhancement
svm.OneClassSVM
,svm.NuSVC
,svm.NuSVR
,svm.SVC
andsvm.SVR
now exposen_iter_
, the number of iterations of the libsvm optimization routine. #21408 by Juan Martín Loyola. - Enhancement
svm.SVR
,svm.SVC
,svm.NuSVR
,svm.OneClassSVM
,svm.NuSVC
now raise an error when the dual-gap estimation produce non-finite parameter weights. #22149 by Christian Ritter and Norbert Preining. - Fix
svm.NuSVC
,svm.NuSVR
,svm.SVC
,svm.SVR
,svm.OneClassSVM
now validate input parameters infit
instead of__init__
. #21436 by Haidar Almubarak.
sklearn.tree
- Enhancement
tree.DecisionTreeClassifier
andtree.ExtraTreeClassifier
have the newcriterion="log_loss"
, which is equivalent tocriterion="entropy"
. #23047 by Christian Lorentzen. - Fix Fix a bug in the Poisson splitting criterion for
tree.DecisionTreeRegressor
. #22191 by Christian Lorentzen. - API Change Changed the default value of
max_features
to 1.0 fortree.ExtraTreeRegressor
and to"sqrt"
fortree.ExtraTreeClassifier
, which will not change the fit result. The original default value"auto"
has been deprecated and will be removed in version 1.3. Settingmax_features
to"auto"
is also deprecated fortree.DecisionTreeClassifier
andtree.DecisionTreeRegressor
. #22476 by Zhehao Liu.
sklearn.utils
- Enhancement
utils.check_array
andutils.multiclass.type_of_target
now accept aninput_name
parameter to make the error message more informative when passed invalid input data (e.g. with NaN or infinite values). #21219 by Olivier Grisel. - Enhancement
utils.check_array
returns a float ndarray withnp.nan
when passed aFloat32
orFloat64
pandas extension array withpd.NA
. #21278 by Thomas Fan. - Enhancement
utils.estimator_html_repr
shows a more helpful error message when running in a jupyter notebook that is not trusted. #21316 by Thomas Fan. - Enhancement
utils.estimator_html_repr
displays an arrow on the top left corner of the HTML representation to show how the elements are clickable. #21298 by Thomas Fan. - Enhancement
utils.check_array
withdtype=None
returns numeric arrays when passed in a pandas DataFrame with mixed dtypes.dtype="numeric"
will also make better infer the dtype when the DataFrame has mixed dtypes. #22237 by Thomas Fan. - Enhancement
utils.check_scalar
now has better messages when displaying the type. #22218 by Thomas Fan. - Fix Changes the error message of the
ValidationError
raised byutils.check_X_y
when y is None so that it is compatible with thecheck_requires_y_none
estimator check. #22578 by Claudio Salvatore Arcidiacono. - Fix
utils.class_weight.compute_class_weight
now only requires that all classes iny
have a weight inclass_weight
. An error is still raised when a class is present iny
but not inclass_weight
. #22595 by Thomas Fan. - Fix
utils.estimator_html_repr
has an improved visualization for nested meta-estimators. #21310 by Thomas Fan. - Fix
utils.check_scalar
raises an error wheninclude_boundaries={"left", "right"}
and the boundaries are not set. #22027 by Marie Lanternier. - Fix
utils.metaestimators.available_if
correctly returns a bounded method that can be pickled. #23077 by Thomas Fan. - API Change
utils.estimator_checks.check_estimator
’s argument is now calledestimator
(previous name wasEstimator
). #22188 by Mathurin Massias. - API Change
utils.metaestimators.if_delegate_has_method
is deprecated and will be removed in version 1.3. Useutils.metaestimators.available_if
instead. #22830 by Jérémie du Boisberranger.
Have any questions?
Contact Exxact Today
scikit-learn 1.1 Released
scikit-learn 1.1 Now Available
scikit-learn is an open source machine learning library that supports supervised and unsupervised learning, and is used by an estimated 80% of data scientists, according to a recent Kaggle survey.
The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
Highlights include:
- Quantile loss in
ensemble.HistGradientBoostingRegressor
get_feature_names_out
Available in all Transformers- Grouping infrequent categories in
OneHotEncoder
- Performance improvements
- MiniBatchNMF: an online version of NMF
- BisectingKMeans: divide and cluster
For more details on the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
To install the latest version (with pip):
pip install --upgrade scikit-learn
or with conda:
conda install -c conda-forge scikit-learn
Version 1.1.0
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.1.
- Major Feature : something big that you couldn’t do before.
- Feature : something that you couldn’t do before.
- Efficiency : an existing feature now may not require as much computation or memory.
- Enhancement : a miscellaneous minor improvement.
- Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
- API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.1.0 of scikit-learn requires python 3.8+, numpy 1.17.3+ and scipy 1.3.2+. Optional minimal dependency is matplotlib 3.1.2+.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
- Efficiency
cluster.KMeans
now defaults toalgorithm="lloyd"
instead ofalgorithm="auto"
, which was equivalent toalgorithm="elkan"
. Lloyd’s algorithm and Elkan’s algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd’s algorithm uses much less memory, and it is often faster. - Efficiency Fitting
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.GradientBoostingClassifier
, andensemble.GradientBoostingRegressor
is on average 15% faster than in previous versions thanks to a new sort algorithm to find the best split. Models might be different because of a different handling of splits with tied criterion values: both the old and the new sorting algorithm are unstable sorting algorithms. #22868 by Thomas Fan. - Fix The eigenvectors initialization for
cluster.SpectralClustering
andmanifold.SpectralEmbedding
now samples from a Gaussian when using the'amg'
or'lobpcg'
solver. This change improves numerical stability of the solver, but may result in a different model. - Fix
feature_selection.f_regression
andfeature_selection.r_regression
will now returned finite score by default instead ofnp.nan
andnp.inf
for some corner case. You can useforce_finite=False
if you really want to get non-finite values and keep the old behavior. - Fix Panda’s DataFrames with all non-string columns such as a MultiIndex no longer warns when passed into an Estimator. Estimators will continue to ignore the column names in DataFrames with non-string columns. For
feature_names_in_
to be defined, columns must be all strings. #22410 by Thomas Fan. - Fix
preprocessing.KBinsDiscretizer
changed handling of bin edges slightly, which might result in a different encoding with the same data. - Fix
calibration.calibration_curve
changed handling of bin edges slightly, which might result in a different output curve given the same data. - Fix
discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior. - Fix
feature_selection.SelectFromModel.fit
andfeature_selection.SelectFromModel.partial_fit
can now be called withprefit=True
.estimators_
will be a deep copy ofestimator
whenprefit=True
. #23271 by Guillaume Lemaitre.
Changelog
- Efficiency Low-level routines for reductions on pairwise distances for dense float64 datasets have been refactored. The following functions and estimators now benefit from improved performances in terms of hardware scalability and speed-ups:
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min
sklearn.cluster.AffinityPropagation
sklearn.cluster.Birch
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.neighbors.RadiusNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
For instance
sklearn.neighbors.NearestNeighbors.kneighbors
andsklearn.neighbors.NearestNeighbors.radius_neighbors
can respectively be up to ×20 and ×5 faster than previously. #21987, #22064, #22065, #22288 and #22320 by Julien Jerphanion. - Enhancement All scikit-learn models now generate a more informative error message when some input contains unexpected
NaN
or infinite values. In particular the message contains the input name (“X”, “y” or “sample_weight”) and if an unexpectedNaN
value is found inX
, the error message suggests potential solutions. #21219 by Olivier Grisel. - Enhancement All scikit-learn models now generate a more informative error message when setting invalid hyper-parameters with
set_params
. #21542 by Olivier Grisel. - Enhancement Removes random unique identifiers in the HTML representation. With this change, jupyter notebooks are reproducible as long as the cells are run in the same order. #23098 by Thomas Fan.
- Fix Estimators with
non_deterministic
tag set toTrue
will skip bothcheck_methods_sample_order_invariance
andcheck_methods_subset_invariance
tests. #22318 by Zhehao Liu. - API Change The option for using the log loss, aka binomial or multinomial deviance, via the
loss
parameters was made more consistent. The preferred way is by setting the value to"log_loss"
. Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.- For
ensemble.GradientBoostingClassifier
, theloss
parameter name “deviance” is deprecated in favor of the new name “log_loss”, which is now the default. #23036 by Christian Lorentzen. - For
ensemble.HistGradientBoostingClassifier
, theloss
parameter names “auto”, “binary_crossentropy” and “categorical_crossentropy” are deprecated in favor of the new name “log_loss”, which is now the default. #23040 by Christian Lorentzen. - For
linear_model.SGDClassifier
, theloss
parameter name “log” is deprecated in favor of the new name “log_loss”. #23046 by Christian Lorentzen.
- For
- API Change Rich html representation of estimators is now enabled by default in Jupyter notebooks. It can be deactivated by setting
display='text'
insklearn.set_config
. #22856 by Jérémie du Boisberranger. - Enhancement The error message is improved when importing
model_selection.HalvingGridSearchCV
,model_selection.HalvingRandomSearchCV
, orimpute.IterativeImputer
without importing the experimental flag. #23194 by Thomas Fan. - Enhancement Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values. #23198 by Lise Kleiber, Zhehao Liu and Chiara Marmo.
sklearn.calibration
- Enhancement
calibration.calibration_curve
accepts a parameterpos_label
to specify the positive class label. #21032 by Guillaume Lemaitre. - Enhancement
calibration.CalibratedClassifierCV.fit
now supports passingfit_params
, which are routed to thebase_estimator
. #18170 by Benjamin Bossan. - Enhancement
calibration.CalibrationDisplay
accepts a parameterpos_label
to add this information to the plot. #21038 by Guillaume Lemaitre. - Fix
calibration.calibration_curve
handles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - API Change
calibration.calibration_curve
’snormalize
parameter is now deprecated and will be removed in version 1.3. It is recommended that a proper probability (i.e. a classifier’s predict_proba positive class) is used fory_prob
. #23095 by Jordan Silke.
sklearn.cluster
- Major Feature
BisectingKMeans
introducing Bisecting K-Means algorithm #20031 by Michal Krawczyk, Tom Dupre la Tour and Jérémie du Boisberranger. - Enhancement
cluster.SpectralClustering
andcluster.spectral_clustering
now include the new'cluster_qr'
method that clusters samples in the embedding space as an alternative to the existing'kmeans'
and'discrete'
methods. Seecluster.spectral_clustering
for more details. #21148 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
cluster.Birch
,cluster.FeatureAgglomeration
,cluster.KMeans
,cluster.MiniBatchKMeans
. #22255 by Thomas Fan. - Enhancement
cluster.SpectralClustering
now raises consistent error messages when passed invalid values forn_clusters
,n_init
,gamma
,n_neighbors
,eigen_tol
ordegree
. #21881 by Hugo Vassard. - Enhancement
cluster.AffinityPropagation
now returns cluster centers and labels if they exist, even if the model has not fully converged. When returning these potentially-degenerate cluster centers and labels, a new warning message is shown. If no cluster centers were constructed, then the cluster centers remain an empty list with labels set to-1
and the original warning message is shown. #22217 by Meekail Zain. - Efficiency In
cluster.KMeans
, the defaultalgorithm
is now"lloyd"
which is the full classical EM-style algorithm. Both"auto"
and"full"
are deprecated and will be removed in version 1.3. They are now aliases for"lloyd"
. The previous default was"auto"
, which relied on Elkan’s algorithm. Lloyd’s algorithm uses less memory than Elkan’s, it is faster on many datasets, and its results are identical, hence the change. #21735 by Aurélien Geron. - Fix
cluster.KMeans
’sinit
parameter now properly supports array-like input and NumPy string scalars. #22154 by Thomas Fan.
sklearn.compose
- Fix
compose.ColumnTransformer
now removes validation errors from__init__
andset_params
methods. #22537 by iofall and Arisa Y.. - Fix get_feature_names_out functionality in
compose.ColumnTransformer
was broken when columns were specified usingslice
. This is fixed in #22775 and #22913 by randomgeek78.
sklearn.covariance
- Fix
covariance.GraphicalLassoCV
now accepts NumPy array for the parameteralphas
. #22493 by Guillaume Lemaitre.
sklearn.cross_decomposition
- Enhancement the
inverse_transform
method ofcross_decomposition.PLSRegression
,cross_decomposition.PLSCanonical
andcross_decomposition.CCA
now allows reconstruction of aX
target when aY
parameter is given. #19680 by Robin Thibaut. - Enhancement Adds get_feature_names_out to all transformers in the
cross_decomposition
module:cross_decomposition.CCA
,cross_decomposition.PLSSVD
,cross_decomposition.PLSRegression
, andcross_decomposition.PLSCanonical
. #22119 by Thomas Fan. - Fix The shape of the coef_ attribute of
cross_decomposition.CCA
,cross_decomposition.PLSCanonical
andcross_decomposition.PLSRegression
will change in version 1.3, from(n_features, n_targets)
to(n_targets, n_features)
, to be consistent with other linear models and to make it work with interface expecting a specific shape forcoef_
(e.g.feature_selection.RFE
). #22016 by Guillaume Lemaitre. - API Change add the fitted attribute
intercept_
tocross_decomposition.PLSCanonical
,cross_decomposition.PLSRegression
, andcross_decomposition.CCA
. The methodpredict
is indeed equivalent toY = X @ coef_ + intercept_
. #22015 by Guillaume Lemaitre.
sklearn.datasets
- Feature
datasets.load_files
now accepts a ignore list and an allow list based on file extensions. #19747 by Tony Attalla and #22498 by Meekail Zain. - Enhancement
datasets.make_swiss_roll
now supports the optional argument hole; when set to True, it returns the swiss-hole dataset. #21482 by Sebastian Pujalte. - Enhancement
datasets.make_blobs
no longer copies data during the generation process, therefore uses less memory. #22412 by Zhehao Liu. - Enhancement
datasets.load_diabetes
now accepts the parameterscaled
, to allow loading unscaled data. The scaled version of this dataset is now computed from the unscaled data, and can produce slightly different results that in previous version (within a 1e-4 absolute tolerance). #16605 by Mandy Gu. - Enhancement
datasets.fetch_openml
now has two optional argumentsn_retries
anddelay
. By default,datasets.fetch_openml
will retry 3 times in case of a network failure with a delay between each try. #21901 by Rileran. - Fix
datasets.fetch_covtype
is now concurrent-safe: data is downloaded to a temporary directory before being moved to the data directory. #23113 by Ilion Beyst. - API Change
datasets.make_sparse_coded_signal
now accepts a parameterdata_transposed
to explicitly specify the shape of matrixX
. The default behaviorTrue
is to return a transposed matrixX
corresponding to a(n_features, n_samples)
shape. The default value will change toFalse
in version 1.3. #21425 by Gabriel Stefanini Vicente.
sklearn.decomposition
- Major Feature Added a new estimator
decomposition.MiniBatchNMF
. It is a faster but less accurate version of non-negative matrix factorization, better suited for large datasets. #16948 by Chiara Marmo, Patricio Cerda and Jérémie du Boisberranger. - Enhancement
decomposition.dict_learning
,decomposition.dict_learning_online
anddecomposition.sparse_encode
preserve dtype fornumpy.float32
.decomposition.DictionaryLearning
,decomposition.MiniBatchDictionaryLearning
anddecomposition.SparseCoder
preserve dtype fornumpy.float32
. #22002 by Takeshi Oura. - Enhancement
decomposition.PCA
exposes a parametern_oversamples
to tuneutils.randomized_svd
and get accurate results when the number of features is large. #21109 by Smile. - Enhancement The
decomposition.MiniBatchDictionaryLearning
anddecomposition.dict_learning_online
have been refactored and now have a stopping criterion based on a small change of the dictionary or objective function, controlled by the newmax_iter
,tol
andmax_no_improvement
parameters. In addition, some of their parameters and attributes are deprecated.- the
n_iter
parameter of both is deprecated. Usemax_iter
instead. - the
iter_offset
,return_inner_stats
,inner_stats
andreturn_n_iter
parameters ofdecomposition.dict_learning_online
serve internal purpose and are deprecated. - the
inner_stats_
,iter_offset_
andrandom_state_
attributes ofdecomposition.MiniBatchDictionaryLearning
serve internal purpose and are deprecated. - the default value of the
batch_size
parameter of both will change from 3 to 256 in version 1.3.
- the
- Enhancement
decomposition.SparsePCA
anddecomposition.MiniBatchSparsePCA
preserve dtype fornumpy.float32
. #22111 by Takeshi Oura. - Enhancement
decomposition.TruncatedSVD
now allowsn_components == n_features
, ifalgorithm='randomized'
. #22181 by Zach Deane-Mayer. - Enhancement Adds get_feature_names_out to all transformers in the
decomposition
module:decomposition.DictionaryLearning
,decomposition.FactorAnalysis
,decomposition.FastICA
,decomposition.IncrementalPCA
,decomposition.KernelPCA
,decomposition.LatentDirichletAllocation
,decomposition.MiniBatchDictionaryLearning
,decomposition.MiniBatchSparsePCA
,decomposition.NMF
,decomposition.PCA
,decomposition.SparsePCA
, anddecomposition.TruncatedSVD
. #21334 by Thomas Fan. - Enhancement
decomposition.TruncatedSVD
exposes the parametern_oversamples
andpower_iteration_normalizer
to tuneutils.randomized_svd
and get accurate results when the number of features is large, the rank of the matrix is high, or other features of the matrix make low rank approximation difficult. #21705 by Jay S. Stanley III. - Enhancement
decomposition.PCA
exposes the parameterpower_iteration_normalizer
to tuneutils.randomized_svd
and get more accurate results when low rank approximation is difficult. #21705 by Jay S. Stanley III. - Fix
decomposition.FastICA
now validates input parameters infit
instead of__init__
. #21432 by Hannah Bohle and Maren Westermann. - Fix
decomposition.FastICA
now acceptsnp.float32
data without silent upcasting. The dtype is preserved byfit
andfit_transform
and the main fitted attributes use a dtype of the same precision as the training data. #22806 by Jihane Bennis and Olivier Grisel. - Fix
decomposition.FactorAnalysis
now validates input parameters infit
instead of__init__
. #21713 by Haya and Krum Arnaudov. - Fix
decomposition.KernelPCA
now validates input parameters infit
instead of__init__
. #21567 by Maggie Chege. - Fix
decomposition.PCA
anddecomposition.IncrementalPCA
more safely calculate precision using the inverse of the covariance matrix ifself.noise_variance_
is zero. #22300 by Meekail Zain and #15948 by @sysuresh. - Fix Greatly reduced peak memory usage in
decomposition.PCA
when callingfit
orfit_transform
. #22553 by Meekail Zain. - API Change
decomposition.FastICA
now supports unit variance for whitening. The default value of itswhiten
argument will change fromTrue
(which behaves like'arbitrary-variance'
) to'unit-variance'
in version 1.3. #19490 by Facundo Ferrin and Julien Jerphanion.
sklearn.discriminant_analysis
- Enhancement Adds get_feature_names_out to
discriminant_analysis.LinearDiscriminantAnalysis
. #22120 by Thomas Fan. - Fix
discriminant_analysis.LinearDiscriminantAnalysis
now uses the correct variance-scaling coefficient which may result in different model behavior. #15984 by Okon Samuel and #22696 by Meekail Zain.
sklearn.dummy
- Fix
dummy.DummyRegressor
no longer overrides theconstant
parameter duringfit
. #22486 by Thomas Fan.
sklearn.ensemble
- Major Feature Added additional option
loss="quantile"
toensemble.HistGradientBoostingRegressor
for modelling quantiles. The quantile level can be specified with the new parameterquantile
. #21800 and #20567 by Christian Lorentzen. - Efficiency
fit
ofensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
now callsutils.check_array
with parameterforce_all_finite=False
for non initial warm-start runs as it has already been checked before. #22159 by Geoffrey Paris. - Enhancement
ensemble.HistGradientBoostingClassifier
is faster, for binary and in particular for multiclass problems thanks to the new private loss function module. #20811, #20567 and #21814 by Christian Lorentzen. - Enhancement Adds support to use pre-fit models with
cv="prefit"
inensemble.StackingClassifier
andensemble.StackingRegressor
. #16748 by Siqi He and #22215 by Meekail Zain. - Enhancement
ensemble.RandomForestClassifier
andensemble.ExtraTreesClassifier
have the newcriterion="log_loss"
, which is equivalent tocriterion="entropy"
. #23047 by Christian Lorentzen. - Enhancement Adds get_feature_names_out to
ensemble.VotingClassifier
,ensemble.VotingRegressor
,ensemble.StackingClassifier
, andensemble.StackingRegressor
. #22695 and #22697 by Thomas Fan. - Enhancement
ensemble.RandomTreesEmbedding
now has an informative get_feature_names_out function that includes both tree index and leaf index in the output feature names. #21762 by Zhehao Liu and Thomas Fan. - Efficiency Fitting a
ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
,ensemble.ExtraTreesRegressor
, andensemble.RandomTreesEmbedding
is now faster in a multiprocessing setting, especially for subsequent fits withwarm_start
enabled. #22106 by Pieter Gijsbers. - Fix Change the parameter
validation_fraction
inensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
so that an error is raised if anything other than a float is passed in as an argument. #21632 by Genesis Valencia. - Fix Removed a potential source of CPU oversubscription in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
when CPU resource usage is limited, for instance using cgroups quota in a docker container. #22566 by Jérémie du Boisberranger. - Fix
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
no longer warns when fitting on a pandas DataFrame with a non-defaultscoring
parameter and early_stopping enabled. #22908 by Thomas Fan. - Fix Fixes HTML repr for
ensemble.StackingClassifier
andensemble.StackingRegressor
. #23097 by Thomas Fan. - API Change The attribute
loss_
ofensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
has been deprecated and will be removed in version 1.3. #23079 by Christian Lorentzen. - API Change Changed the default of
max_features
to 1.0 forensemble.RandomForestRegressor
and to"sqrt"
forensemble.RandomForestClassifier
. Note that these give the same fit results as before, but are much easier to understand. The old default value"auto"
has been deprecated and will be removed in version 1.3. The same changes are also applied forensemble.ExtraTreesRegressor
andensemble.ExtraTreesClassifier
. #20803 by Brian Sun. - Efficiency Improve runtime performance of
ensemble.IsolationForest
by skipping repetitive input checks. #23149 by Zhehao Liu.
sklearn.feature_extraction
- Feature
feature_extraction.FeatureHasher
now supports PyPy. #23023 by Thomas Fan. - Fix
feature_extraction.FeatureHasher
now validates input parameters intransform
instead of__init__
. #21573 by Hannah Bohle and Maren Westermann. - Fix
feature_extraction.text.TfidfVectorizer
now does not create afeature_extraction.text.TfidfTransformer
at__init__
as required by our API. #21832 by Guillaume Lemaitre.
sklearn.feature_selection
- Feature Added auto mode to
feature_selection.SequentialFeatureSelector
. If the argumentn_features_to_select
is'auto'
, select features until the score improvement does not exceed the argumenttol
. The default value ofn_features_to_select
changed fromNone
to'warn'
in 1.1 and will become'auto'
in 1.3.None
and'warn'
will be removed in 1.3. #20145 by murata-yu. - Feature Added the ability to pass callables to the
max_features
parameter offeature_selection.SelectFromModel
. Also introduced new attributemax_features_
which is inferred frommax_features
and the data duringfit
. Ifmax_features
is an integer, thenmax_features_ = max_features
. Ifmax_features
is a callable, thenmax_features_ = max_features(X)
. #22356 by Meekail Zain. - Enhancement
feature_selection.GenericUnivariateSelect
preserves float32 dtype. #18482 by Thierry Gameiro and Daniel Kharsa and #22370 by Meekail Zain. - Enhancement Add a parameter
force_finite
tofeature_selection.f_regression
andfeature_selection.r_regression
. This parameter allows to force the output to be finite in the case where a feature or a the target is constant or that the feature and target are perfectly correlated (only for the F-statistic). #17819 by Juan Carlos Alfaro Jiménez. - Efficiency Improve runtime performance of
feature_selection.chi2
with boolean arrays. #22235 by Thomas Fan. - Efficiency Reduced memory usage of
feature_selection.chi2
. #21837 by Louis Wagner.
sklearn.gaussian_process
- Fix
predict
andsample_y
methods ofgaussian_process.GaussianProcessRegressor
now return arrays of the correct shape in single-target and multi-target cases, and for bothnormalize_y=False
andnormalize_y=True
. #22199 by Guillaume Lemaitre, Aidar Shakerimoff and Tenavi Nakamura-Zimmerer. - Fix
gaussian_process.GaussianProcessClassifier
raises a more informative error ifCompoundKernel
is passed viakernel
. #22223 by MarcoM.
sklearn.impute
- Enhancement
impute.SimpleImputer
now warns with feature names when features which are skipped due to the lack of any observed values in the training set. #21617 by Christian Ritter. - Enhancement Added support for
pd.NA
inimpute.SimpleImputer
. #21114 by Ying Xiong. - Enhancement Adds get_feature_names_out to
impute.SimpleImputer
,impute.KNNImputer
,impute.IterativeImputer
, andimpute.MissingIndicator
. #21078 by Thomas Fan. - API Change The
verbose
parameter was deprecated forimpute.SimpleImputer
. A warning will always be raised upon the removal of empty columns. #21448 by Oleh Kozynets and Christian Ritter.
sklearn.inspection
- Feature Add a display to plot the boundary decision of a classifier by using the method
inspection.DecisionBoundaryDisplay.from_estimator
. #16061 by Thomas Fan. - Enhancement In
inspection.PartialDependenceDisplay.from_estimator
, allowkind
to accept a list of strings to specify which type of plot to draw for each feature interaction. #19438 by Guillaume Lemaitre. - Enhancement
inspection.PartialDependenceDisplay.from_estimator
,inspection.PartialDependenceDisplay.plot
, andinspection.plot_partial_dependence
now support plotting centered Individual Conditional Expectation (cICE) and centered PDP curves controlled by setting the parametercentered
. #18310 by Johannes Elfner and Guillaume Lemaitre.
sklearn.isotonic
- Enhancement Adds get_feature_names_out to
isotonic.IsotonicRegression
. #22249 by Thomas Fan.
sklearn.kernel_approximation
- Enhancement Adds get_feature_names_out to
kernel_approximation.AdditiveChi2Sampler
.kernel_approximation.Nystroem
,kernel_approximation.PolynomialCountSketch
,kernel_approximation.RBFSampler
, andkernel_approximation.SkewedChi2Sampler
. #22137 and #22694 by Thomas Fan.
sklearn.linear_model
- Feature
linear_model.ElasticNet
,linear_model.ElasticNetCV
,linear_model.Lasso
andlinear_model.LassoCV
supportsample_weight
for sparse inputX
. #22808 by Christian Lorentzen. - Feature
linear_model.Ridge
withsolver="lsqr"
now supports to fit sparse input withfit_intercept=True
. #22950 by Christian Lorentzen. - Enhancement
linear_model.QuantileRegressor
support sparse input for the highs based solvers. #21086 by Venkatachalam Natchiappan. In addition, those solvers now use the CSC matrix right from the beginning which speeds up fitting. #22206 by Christian Lorentzen. - Enhancement
linear_model.LogisticRegression
is faster forsolvers="lbfgs"
andsolver="newton-cg"
, for binary and in particular for multiclass problems thanks to the new private loss function module. In the multiclass case, the memory consumption has also been reduced for these solvers as the target is now label encoded (mapped to integers) instead of label binarized (one-hot encoded). The more classes, the larger the benefit. #21808, #20567 and #21814 by Christian Lorentzen. - Enhancement
linear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
are faster forsolvers="lbfgs"
. #22548, #21808 and #20567 by Christian Lorentzen. - Enhancement Rename parameter
base_estimator
toestimator
inlinear_model.RANSACRegressor
to improve readability and consistency.base_estimator
is deprecated and will be removed in 1.3. #22062 by Adrian Trujillo. - Enhancement
linear_model.ElasticNet
and and other linear model classes using coordinate descent show error messages when non-finite parameter weights are produced. #22148 by Christian Ritter and Norbert Preining. - Enhancement
linear_model.ElasticNet
andlinear_model.Lasso
now raise consistent error messages when passed invalid values forl1_ratio
,alpha
,max_iter
andtol
. #22240 by Arturo Amor. - Enhancement
linear_model.BayesianRidge
andlinear_model.ARDRegression
now preserve float32 dtype. #9087 by Arthur Imbert and #22525 by Meekail Zain. - Enhancement
linear_model.RidgeClassifier
is now supporting multilabel classification. #19689 by Guillaume Lemaitre. - Enhancement
linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
now raise consistent error message when passed invalid values foralphas
. #21606 by Arturo Amor. - Enhancement
linear_model.Ridge
andlinear_model.RidgeClassifier
now raise consistent error message when passed invalid values foralpha
,max_iter
andtol
. #21341 by Arturo Amor. - Enhancement
linear_model.orthogonal_mp_gram
preservse dtype fornumpy.float32
. #22002 by Takeshi Oura. - Fix
linear_model.LassoLarsIC
now correctly computes AIC and BIC. An error is now raised whenn_features > n_samples
and when the noise variance is not provided. #21481 by Guillaume Lemaitre and Andrés Babino. - Fix
linear_model.TheilSenRegressor
now validates input parametermax_subpopulation
infit
instead of__init__
. #21767 by Maren Westermann. - Fix
linear_model.ElasticNetCV
now produces correct warning whenl1_ratio=0
. #21724 by Yar Khine Phyo. - Fix
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now set then_iter_
attribute with a shape that respects the docstring and that is consistent with the shape obtained when using the other solvers in the one-vs-rest setting. Previously, it would record only the maximum of the number of iterations for each binary sub-problem while now all of them are recorded. #21998 by Olivier Grisel. - Fix The property
family
oflinear_model.TweedieRegressor
is not validated in__init__
anymore. Instead, this (private) property is deprecated inlinear_model.GammaRegressor
,linear_model.PoissonRegressor
andlinear_model.TweedieRegressor
, and will be removed in 1.3. #22548 by Christian Lorentzen. - Fix The
coef_
andintercept_
attributes oflinear_model.LinearRegression
are now correctly computed in the presence of sample weights when the input is sparse. #22891 by Jérémie du Boisberranger. - Fix The
coef_
andintercept_
attributes oflinear_model.Ridge
withsolver="sparse_cg"
andsolver="lbfgs"
are now correctly computed in the presence of sample weights when the input is sparse. #22899 by Jérémie du Boisberranger. - Fix
linear_model.SGDRegressor
andlinear_model.SGDClassifier
now computes the validation error correctly when early stopping is enabled. #23256 by Zhehao Liu. - API Change
linear_model.LassoLarsIC
now exposesnoise_variance
as a parameter in order to provide an estimate of the noise variance. This is particularly relevant whenn_features > n_samples
and the estimator of the noise variance cannot be computed. #21481 by Guillaume Lemaitre.
sklearn.manifold
- Feature
manifold.Isomap
now supports radius-based neighbors via theradius
argument. #19794 by Zhehao Liu. - Enhancement
manifold.spectral_embedding
andmanifold.SpectralEmbedding
supportsnp.float32
dtype and will preserve this dtype. #21534 by Andrew Knyazev. - Enhancement Adds get_feature_names_out to
manifold.Isomap
andmanifold.LocallyLinearEmbedding
. #22254 by Thomas Fan. - Enhancement added
metric_params
tomanifold.TSNE
constructor for additional parameters of distance metric to use in optimization. #21805 by Jeanne Dionisi and #22685 by Meekail Zain. - Enhancement
manifold.trustworthiness
raises an error ifn_neighbours >= n_samples / 2
to ensure a correct support for the function. #18832 by Hong Shao Yang and #23033 by Meekail Zain. - Fix
manifold.spectral_embedding
now uses Gaussian instead of the previous uniform on [0, 1] random initial approximations to eigenvectors in eigen_solverslobpcg
andamg
to improve their numerical stability. #21565 by Andrew Knyazev.
sklearn.metrics
- Feature
metrics.r2_score
andmetrics.explained_variance_score
have a newforce_finite
parameter. Setting this parameter toFalse
will return the actual non-finite score in case of perfect predictions or constanty_true
, instead of the finite approximation (1.0
and0.0
respectively) currently returned by default. #17266 by Sylvain Marié. - Feature
metrics.d2_pinball_score
andmetrics.d2_absolute_error_score
calculate the D2 regression score for the pinball loss and the absolute error respectively.metrics.d2_absolute_error_score
is a special case ofmetrics.d2_pinball_score
with a fixed quantile parameteralpha=0.5
for ease of use and discovery. The D2 scores are generalizations of ther2_score
and can be interpeted as the fraction of deviance explained. #22118 by Ohad Michel. - Enhancement
metrics.top_k_accuracy_score
raises an improved error message wheny_true
is binary andy_score
is 2d. #22284 by Thomas Fan. - Enhancement
metrics.roc_auc_score
now supportsaverage=None
in the multiclass case whenmulticlass='ovr'
which will return the score per class. #19158 by Nicki Skafte. - Enhancement Adds
im_kw
parameter tometrics.ConfusionMatrixDisplay.from_estimator
metrics.ConfusionMatrixDisplay.from_predictions
, andmetrics.ConfusionMatrixDisplay.plot
. Theim_kw
parameter is passed to thematplotlib.pyplot.imshow
call when plotting the confusion matrix. #20753 by Thomas Fan. - Fix
metrics.silhouette_score
now supports integer input for precomputed distances. #22108 by Thomas Fan. - Fix Fixed a bug in
metrics.normalized_mutual_info_score
which could return unbounded values. #22635 by Jérémie du Boisberranger. - Fix Fixes
metrics.precision_recall_curve
andmetrics.average_precision_score
when true labels are all negative. #19085 by Varun Agrawal. - API Change
metrics.SCORERS
is now deprecated and will be removed in 1.3. Please usemetrics.get_scorer_names
to retrieve the names of all available scorers. #22866 by Adrin Jalali. - API Change Parameters
sample_weight
andmultioutput
ofmetrics.mean_absolute_percentage_error
are now keyword-only, in accordance with SLEP009. A deprecation cycle was introduced. #21576 by Paul-Emile Dugnat. - API Change The
"wminkowski"
metric ofmetrics.DistanceMetric
is deprecated and will be removed in version 1.3. Instead the existing"minkowski"
metric now takes in an optionalw
parameter for weights. This deprecation aims at remaining consistent with SciPy 1.8 convention. #21873 by Yar Khine Phyo. - API Change
metrics.DistanceMetric
has been moved fromsklearn.neighbors
tosklearn.metrics
. Usingneighbors.DistanceMetric
for imports is still valid for backward compatibility, but this alias will be removed in 1.3. #21177 by Julien Jerphanion.
sklearn.mixture
- Enhancement
mixture.GaussianMixture
andmixture.BayesianGaussianMixture
can now be initialized using k-means++ and random data points. #20408 by Gordon Walsh, Alberto Ceballos and Andres Rios. - Fix Fix a bug that correctly initialize
precisions_cholesky_
inmixture.GaussianMixture
when providingprecisions_init
by taking its square root. #22058 by Guillaume Lemaitre. - Fix
mixture.GaussianMixture
now normalizesweights_
more safely, preventing rounding errors when callingmixture.GaussianMixture.sample
withn_components=1
. #23034 by Meekail Zain.
sklearn.model_selection
- Enhancement it is now possible to pass
scoring="matthews_corrcoef"
to all model selection tools with ascoring
argument to use the Matthews correlation coefficient (MCC). #22203 by Olivier Grisel. - Enhancement raise an error during cross-validation when the fits for all the splits failed. Similarly raise an error during grid-search when the fits for all the models and all the splits failed. #21026 by Loïc Estève.
- Fix
model_selection.GridSearchCV
,model_selection.HalvingGridSearchCV
now validate input parameters infit
instead of__init__
. #21880 by Mrinal Tyagi. - Fix
model_selection.learning_curve
now supportspartial_fit
with regressors. #22982 by Thomas Fan.
sklearn.multiclass
- Enhancement
multiclass.OneVsRestClassifier
now supports averbose
parameter so progress on fitting can be seen. #22508 by Chris Combs. - Fix
multiclass.OneVsOneClassifier.predict
returns correct predictions when the inner classifier only has a predict_proba. #22604 by Thomas Fan.
sklearn.neighbors
- Enhancement Adds get_feature_names_out to
neighbors.RadiusNeighborsTransformer
,neighbors.KNeighborsTransformer
andneighbors.NeighborhoodComponentsAnalysis
. #22212 by Meekail Zain. - Fix
neighbors.KernelDensity
now validates input parameters infit
instead of__init__
. #21430 by Desislava Vasileva and Lucy Jimenez. - Fix
neighbors.KNeighborsRegressor.predict
now works properly when given an array-like input ifKNeighborsRegressor
is first constructed with a callable passed to theweights
parameter. #22687 by Meekail Zain.
sklearn.neural_network
- Enhancement
neural_network.MLPClassifier
andneural_network.MLPRegressor
show error messages when optimizers produce non-finite parameter weights. #22150 by Christian Ritter and Norbert Preining. - Enhancement Adds get_feature_names_out to
neural_network.BernoulliRBM
. #22248 by Thomas Fan.
sklearn.pipeline
- Enhancement Added support for “passthrough” in
pipeline.FeatureUnion
. Setting a transformer to “passthrough” will pass the features unchanged. #20860 by Shubhraneel Pal. - Fix
pipeline.Pipeline
now does not validate hyper-parameters in__init__
but in.fit()
. #21888 by iofall and Arisa Y.. - Fix
pipeline.FeatureUnion
does not validate hyper-parameters in__init__
. Validation is now handled in.fit()
and.fit_transform()
. #21954 by iofall and Arisa Y.. - Fix Defines
__sklearn_is_fitted__
inpipeline.FeatureUnion
to return correct result withutils.validation.check_is_fitted
. #22953 by randomgeek78.
sklearn.preprocessing
- Feature
preprocessing.OneHotEncoder
now supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequency
ormax_categories
. #16018 by Thomas Fan. - Enhancement Adds a
subsample
parameter topreprocessing.KBinsDiscretizer
. This allows specifying a maximum number of samples to be used while fitting the model. The option is only available whenstrategy
is set toquantile
. #21445 by Felipe Bidu and Amanda Dsouza. - Enhancement Adds
encoded_missing_value
topreprocessing.OrdinalEncoder
to configure the encoded value for missing data. #21988 by Thomas Fan. - Enhancement Added the
get_feature_names_out
method and a new parameterfeature_names_out
topreprocessing.FunctionTransformer
. You can setfeature_names_out
to ‘one-to-one’ to use the input features names as the output feature names, or you can set it to a callable that returns the output feature names. This is especially useful when the transformer changes the number of features. Iffeature_names_out
is None (which is the default), thenget_output_feature_names
is not defined. #21569 by Aurélien Geron. - Enhancement Adds get_feature_names_out to
preprocessing.Normalizer
,preprocessing.KernelCenterer
,preprocessing.OrdinalEncoder
, andpreprocessing.Binarizer
. #21079 by Thomas Fan. - Fix
preprocessing.PowerTransformer
withmethod='yeo-johnson'
better supports significantly non-Gaussian data when searching for an optimal lambda. #20653 by Thomas Fan. - Fix
preprocessing.LabelBinarizer
now validates input parameters infit
instead of__init__
. #21434 by Krum Arnaudov. - Fix
preprocessing.FunctionTransformer
withcheck_inverse=True
now provides informative error message when input has mixed dtypes. #19916 by Zhehao Liu. - Fix
preprocessing.KBinsDiscretizer
handles bin edges more consistently now. #14975 by Andreas Müller and #22526 by Meekail Zain. - Fix Adds
preprocessing.KBinsDiscretizer.get_feature_names_out
support whenencode="ordinal"
. #22735 by Thomas Fan.
sklearn.random_projection
- Enhancement Adds an
inverse_transform
method and acompute_inverse_transform
parameter torandom_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
. When the parameter is set to True, the pseudo-inverse of the components is computed duringfit
and stored asinverse_components_
. #21701 by Aurélien Geron. - Enhancement
random_projection.SparseRandomProjection
andrandom_projection.GaussianRandomProjection
preserves dtype fornumpy.float32
. #22114 by Takeshi Oura. - Enhancement Adds get_feature_names_out to all transformers in the
sklearn.random_projection
module:random_projection.GaussianRandomProjection
andrandom_projection.SparseRandomProjection
. #21330 by Loïc Estève.
sklearn.svm
- Enhancement
svm.OneClassSVM
,svm.NuSVC
,svm.NuSVR
,svm.SVC
andsvm.SVR
now exposen_iter_
, the number of iterations of the libsvm optimization routine. #21408 by Juan Martín Loyola. - Enhancement
svm.SVR
,svm.SVC
,svm.NuSVR
,svm.OneClassSVM
,svm.NuSVC
now raise an error when the dual-gap estimation produce non-finite parameter weights. #22149 by Christian Ritter and Norbert Preining. - Fix
svm.NuSVC
,svm.NuSVR
,svm.SVC
,svm.SVR
,svm.OneClassSVM
now validate input parameters infit
instead of__init__
. #21436 by Haidar Almubarak.
sklearn.tree
- Enhancement
tree.DecisionTreeClassifier
andtree.ExtraTreeClassifier
have the newcriterion="log_loss"
, which is equivalent tocriterion="entropy"
. #23047 by Christian Lorentzen. - Fix Fix a bug in the Poisson splitting criterion for
tree.DecisionTreeRegressor
. #22191 by Christian Lorentzen. - API Change Changed the default value of
max_features
to 1.0 fortree.ExtraTreeRegressor
and to"sqrt"
fortree.ExtraTreeClassifier
, which will not change the fit result. The original default value"auto"
has been deprecated and will be removed in version 1.3. Settingmax_features
to"auto"
is also deprecated fortree.DecisionTreeClassifier
andtree.DecisionTreeRegressor
. #22476 by Zhehao Liu.
sklearn.utils
- Enhancement
utils.check_array
andutils.multiclass.type_of_target
now accept aninput_name
parameter to make the error message more informative when passed invalid input data (e.g. with NaN or infinite values). #21219 by Olivier Grisel. - Enhancement
utils.check_array
returns a float ndarray withnp.nan
when passed aFloat32
orFloat64
pandas extension array withpd.NA
. #21278 by Thomas Fan. - Enhancement
utils.estimator_html_repr
shows a more helpful error message when running in a jupyter notebook that is not trusted. #21316 by Thomas Fan. - Enhancement
utils.estimator_html_repr
displays an arrow on the top left corner of the HTML representation to show how the elements are clickable. #21298 by Thomas Fan. - Enhancement
utils.check_array
withdtype=None
returns numeric arrays when passed in a pandas DataFrame with mixed dtypes.dtype="numeric"
will also make better infer the dtype when the DataFrame has mixed dtypes. #22237 by Thomas Fan. - Enhancement
utils.check_scalar
now has better messages when displaying the type. #22218 by Thomas Fan. - Fix Changes the error message of the
ValidationError
raised byutils.check_X_y
when y is None so that it is compatible with thecheck_requires_y_none
estimator check. #22578 by Claudio Salvatore Arcidiacono. - Fix
utils.class_weight.compute_class_weight
now only requires that all classes iny
have a weight inclass_weight
. An error is still raised when a class is present iny
but not inclass_weight
. #22595 by Thomas Fan. - Fix
utils.estimator_html_repr
has an improved visualization for nested meta-estimators. #21310 by Thomas Fan. - Fix
utils.check_scalar
raises an error wheninclude_boundaries={"left", "right"}
and the boundaries are not set. #22027 by Marie Lanternier. - Fix
utils.metaestimators.available_if
correctly returns a bounded method that can be pickled. #23077 by Thomas Fan. - API Change
utils.estimator_checks.check_estimator
’s argument is now calledestimator
(previous name wasEstimator
). #22188 by Mathurin Massias. - API Change
utils.metaestimators.if_delegate_has_method
is deprecated and will be removed in version 1.3. Useutils.metaestimators.available_if
instead. #22830 by Jérémie du Boisberranger.
Have any questions?
Contact Exxact Today