Feature dictionary issue #222

diyessi · 2022-02-03T19:56:59Z

The construction of the dictionaries that remap the categorical features is really part of the training and should not include the test and validation data; in actual use feature values would show up that were not in the feature dictionaries and the test/validation data should reflect that.

Since the data preparation uses 0 for values not supplied for a feature, this seems like the intended OOV unmapped feature value, but there is no guarantee that every feature will have a missing value among the training data, so 0 might not even be a key in the feature's dictionary.

mnaumovfb · 2022-02-09T06:38:36Z

I'm not sure I completely agree with your first statement, this might depend on the use case and the meaning of the feature. However, you can always write your own pre-processing routine that does it differently.

I don't believe that this should be a problem because even if a feature does not have a missing value and 0 is indeed not a key in its dictionary, then the embedding vector at index 0 will simply not be accessed during training. Let me know if I'm missing something.

diyessi · 2022-02-09T16:02:04Z

The set of keys that appear in the dictionary is part of the training; what you remap them to is not since the embedding training that comes afterwards is not dependent on what the actual remappings are, just on how you fold them before remapping.
Unrelated to that, I actually do have some changes to the data preparation that make it more suitable for running on a less powerful server. I hope to submit a PR or two for that in the next week or two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature dictionary issue #222

Feature dictionary issue #222

diyessi commented Feb 3, 2022

mnaumovfb commented Feb 9, 2022 •

edited

Loading

diyessi commented Feb 9, 2022

Feature dictionary issue #222

Feature dictionary issue #222

Comments

diyessi commented Feb 3, 2022

mnaumovfb commented Feb 9, 2022 • edited Loading

diyessi commented Feb 9, 2022

mnaumovfb commented Feb 9, 2022 •

edited

Loading