Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature dictionary issue #222

Open
diyessi opened this issue Feb 3, 2022 · 2 comments
Open

Feature dictionary issue #222

diyessi opened this issue Feb 3, 2022 · 2 comments

Comments

@diyessi
Copy link

diyessi commented Feb 3, 2022

The construction of the dictionaries that remap the categorical features is really part of the training and should not include the test and validation data; in actual use feature values would show up that were not in the feature dictionaries and the test/validation data should reflect that.

Since the data preparation uses 0 for values not supplied for a feature, this seems like the intended OOV unmapped feature value, but there is no guarantee that every feature will have a missing value among the training data, so 0 might not even be a key in the feature's dictionary.

@mnaumovfb
Copy link
Contributor

mnaumovfb commented Feb 9, 2022

I'm not sure I completely agree with your first statement, this might depend on the use case and the meaning of the feature. However, you can always write your own pre-processing routine that does it differently.

I don't believe that this should be a problem because even if a feature does not have a missing value and 0 is indeed not a key in its dictionary, then the embedding vector at index 0 will simply not be accessed during training. Let me know if I'm missing something.

@diyessi
Copy link
Author

diyessi commented Feb 9, 2022

The set of keys that appear in the dictionary is part of the training; what you remap them to is not since the embedding training that comes afterwards is not dependent on what the actual remappings are, just on how you fold them before remapping.
Unrelated to that, I actually do have some changes to the data preparation that make it more suitable for running on a less powerful server. I hope to submit a PR or two for that in the next week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants