Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multioutput problem #971

Open
kradant opened this issue Dec 3, 2019 · 11 comments
Open

multioutput problem #971

kradant opened this issue Dec 3, 2019 · 11 comments

Comments

@kradant
Copy link

kradant commented Dec 3, 2019

I am working on a multioutput regression problem, that is the target values have more than 1-dim.
A number of regressors from scikit-learn can only be used for multi output problems when used with the class MultiOutputRegressor (see especially https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression).

MultiOutputRegressor takes a regressor as an argument and fits then one regressor per target. In this way most single-output regressors can deal multidimensional output. So I want to use it with TPOT.

The issue #903 deals with the changes that must be applied to base.py (which I did and this worked fine) in order to work for multiple output. But neither in #747, #810 nor in #903 it was clarified how to actually use MultiOutputRegressor with several regressors.

This is my config dictionary:
custom_regressor_config_dict = { 'sklearn.multioutput.MultiOutputRegressor': { 'estimator': {'sklearn.linear_model.ElasticNetCV': { 'l1_ratio': np.arange(0.0, 1.01, 0.05), 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]}}}

Now I want to insert more regressors. But I tried to put them in a list or in dictionary and whatever, for example I tried

'sklearn.multioutput.MultiOutputRegressor': { 'estimator': [ {'sklearn.linear_model.ElasticNetCV': { 'l1_ratio': np.arange(0.0, 1.01, 0.05), 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]}}, {'sklearn.ensemble.AdaBoostRegressor': { 'n_estimators': [100], 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'loss': ["linear", "square", "exponential"]}} ] }

and it always throws an error like:

RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly[...]

How to format it properly so that I can use more models? Is there a workaround? Thank you!

@weixuanfu
Copy link
Contributor

TPOT currently do not support multi-output regression and current configuration did not support more than one estimator option (similar to #956) within this kind of meta estimators, but we are working on adding supports for > 1 estimators and I think that is the first step for supporting multi-output regression.

@kradant
Copy link
Author

kradant commented Dec 3, 2019

Do you know an alternative tool like TPOT that can do multi-output regression? Or a workaround? Like starting a loop with TPOT and changing each time the estimator?

@jhmenke
Copy link
Contributor

jhmenke commented Dec 3, 2019

so using only the regressors that natively perform multi-output regression is not viable for now? because that seems to work already with some slight modifications (changing some sklearn metric iirc)

@weixuanfu
Copy link
Contributor

so using only the regressors that natively perform multi-output regression is not viable for now? because that seems to work already with some slight modifications (changing some sklearn metric iirc)

Hmm, maybe it is practical workaround. Could you please share the modifications with a demo via pull request?

@kradant
Copy link
Author

kradant commented Dec 3, 2019

so using only the regressors that natively perform multi-output regression is not viable for now? because that seems to work already with some slight modifications (changing some sklearn metric iirc)

Now I am a bit confused :) I already implemented the changes of base.py as suggested in #903 . So now I am already able to run TPOT with the regressors that natively perform multi-output regression. I didn't need to adjust metrics or whatever. Am I doing this wrong??
And well yes, I wanted to compare most regressors, I thought this was the whole point about TPOT and automatic ML: to compare a large set of different pipelines/algorithms.

@jhmenke
Copy link
Contributor

jhmenke commented Dec 4, 2019

so using only the regressors that natively perform multi-output regression is not viable for now? because that seems to work already with some slight modifications (changing some sklearn metric iirc)

Hmm, maybe it is practical workaround. Could you please share the modifications with a demo via pull request?

Sure, i'll look it up when i find the time. But i think there were no changes necessary in tpot directly, just a slight modification of a sklearn metric. Will post once i get around to it.

Now I am a bit confused :) I already implemented the changes of base.py as suggested in #903 . So now I am already able to run TPOT with the regressors that natively perform multi-output regression.

Yes that was my question. At least it works for those.

@kradant
Copy link
Author

kradant commented Dec 6, 2019

I am still not getting it :)

  1. In Flag to allow multioutput. #903 the code changes to use the "native" multioutput regressors is already presented. So why is there a need to

share the modifications with a demo via pull request?

since it already happened?

  1. And if multioutput regressors can be used with adjustments of Flag to allow multioutput. #903 and also we can make use of MultiOutputRegressor (with currently just one estimator), why @weixuanfu are you saying that multioutput isn't supported?

@jhmenke
Copy link
Contributor

jhmenke commented Dec 6, 2019

That pull request is not merged, so it is not native to TPOT and therefore unsupported.

Also if there is a solution that does not require a flag, that would of course be better (again, i'm looking into it when i find the time).

@weixuanfu
Copy link
Contributor

weixuanfu commented Dec 6, 2019

Yes, one of the reasons that we did not merge #903 is that we hoped there was a nice solution without the flag. I forgot to push a comment to that PR.

@jhmenke
Copy link
Contributor

jhmenke commented Jan 14, 2020

Can someone confirm that changing this line:

tpot/tpot/base.py

Line 1160 in aea42a5

X, y = check_X_y(features, target, accept_sparse=True, dtype=None)

to

X, y = check_X_y(features, target, accept_sparse=True, dtype=None, multi_output=len(target.shape) > 1 and target.shape[1] > 1)

multioutputs are supported correctly? It seems to work for me and was the only change i made, but i'd rather see it confirmed by someone before making a proper PR.

edit: i just looked up the PR #903 and it seems to be doing the same change, albeit with a manual flag.

@windowshopr
Copy link

Just found this thread myself, and I'm getting the same error as in #747 which is:

  File "C:\Users\...\tpot\base.py", line 1393, in _check_dataset
    "Error: Input data is not in a valid format. Please confirm "
ValueError: Error: Input data is not in a valid format. Please confirm that the input data is scikit-learn compatible. For example, the features must be a 2-D array and target labels must be a 1-D array.

I know it's not officially supported, but would love to be able to use TPOT for a multi-output regression problem.

For a simple reproducible problem, use this code:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from tpot import TPOTRegressor
from numpy import arange

RANDOM_SEED = 42

X, y = make_regression(n_samples=500,
                       n_features=5,
                       n_informative=2,
                       n_targets=2,
                       shuffle=True,
                       random_state=RANDOM_SEED)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=RANDOM_SEED)

regressor_config_dict = {
    'sklearn.multioutput.MultiOutputRegressor': {
        'estimator': {
            'sklearn.ensemble.ExtraTreesRegressor': {
                'n_estimators': [100],
                'max_features': arange(0.05, 1.01, 0.05)
            }
        }
    }
}

tpot = TPOTRegressor(generations=100, 
                     population_size=100,
                     offspring_size=None, 
                     mutation_rate=0.9,
                     crossover_rate=0.1,
                     scoring='neg_mean_squared_error', 
                     cv=3,
                     subsample=1.0, 
                     n_jobs=4,
                     max_time_mins=None, 
                     max_eval_time_mins=5,
                     random_state=None, 
                     config_dict=regressor_config_dict,
                     template=None,
                     warm_start=False,
                     memory=None,
                     use_dask=True,
                     periodic_checkpoint_folder=None,
                     early_stop=2,
                     verbosity=2,
                     disable_update_check=False)

tpot.fit(X_train, y_train)

preds = tpot.predict([1.0,1.0,1.0])
print(r2_score(y_test, preds))
print(preds)

How can we make this work? Hacky solutions are welcome! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants