-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for Flexible Preprocessing #897
base: development
Are you sure you want to change the base?
Allow for Flexible Preprocessing #897
Conversation
|
||
|
||
|
||
def load_scoring_function(scoring_func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help for getting relative imports working would be appreciated here... tpot.drivers.load_scoring_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, directly importing via from ..driver import load_scoring_function
will cause some conflicts with
from .tpot import TPOTClassifier, TPOTRegressor
from ._version import __version__
A workaround is to move this function to tpot/metrics.py
and then add from ..metrics import load_scoring_function
to tpot/builtins/preprocessing.py
column_transform_dict[k] = config_dict[k] | ||
else: | ||
column_transform_dict[k] = [config_dict[k]] | ||
self._config_dict['tpot.builtins.PreprocessTransformer'] = column_transform_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This injection could be dangerous - do we have opinions on how it is supposed to be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think preprocess_config_dict
should be a argument within PreprocessTransformer instead of TPOT. And users should be able to customize it via config_dict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is certainly possible (and possible right now with no changes to TPOT master technically) via the use of templates - I think the question arises related to #507; if its possible to have a "built-in" configuration with text or not.
Maybe the answer is we can't
) | ||
|
||
# override some settings... | ||
if config_dict.get('impute', None) is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line deals with #836 - might be overloading this PR and might be an item for later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more related to #889. I think we need add imputation into config_dict
too. We may allow TPOT skip imputation if the pipeline only has XGBClassifier or XGBRegressor.
Ok, Given the higher level comments @weixuanfu how should we proceed. In my opinion, I think this PR is just too big, what we probably want is:
(as per comments above)
versus (as per approach taken in this PR)
|
[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]
What does this PR do?
Too many things - so many things that I'm pretty sure there will be discussions on whether this is the way to proceed. Putting it up here before I invest more time on it. This PR addresses several items (possibly too many again) including
#507
#836
Handling of categorical data #771 - I know its not directly related; but it is the closest issue, and in my mind it would allow for extending it to using different encodings like using this: https://github.com/scikit-learn-contrib/categorical-encoding
Which are all related to how TPOT does preprocessing. This PR does a number of things including re-introducing "RandomTree" option in templates that allows specifying templates in the form
Transformer-RandomTree
; which will then allow things likeMy_Preprocessing-RandomTree
.The high level approach is to inject additional preprocessing steps when
_fit_init
is called which then alters the behaviour of TPOT.Where should the reviewer start?
tpot/base.py
- will add comments in files as part of this PR with my thoughts...Also can't get relative imports working - so that would be appreciated for the
tpot.drivers.load_scoring_function
partHow should this PR be tested?
Happy to add new tests later; when the design is approved...
Any background context you want to provide?
NIL see above
What are the relevant issues?
#507
#836
Handling of categorical data #771
Screenshots (if appropriate)
Example of API using modified Iris dataset
Questions: