You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The construction of the dictionaries that remap the categorical features is really part of the training and should not include the test and validation data; in actual use feature values would show up that were not in the feature dictionaries and the test/validation data should reflect that.
Since the data preparation uses 0 for values not supplied for a feature, this seems like the intended OOV unmapped feature value, but there is no guarantee that every feature will have a missing value among the training data, so 0 might not even be a key in the feature's dictionary.
The text was updated successfully, but these errors were encountered:
I'm not sure I completely agree with your first statement, this might depend on the use case and the meaning of the feature. However, you can always write your own pre-processing routine that does it differently.
I don't believe that this should be a problem because even if a feature does not have a missing value and 0 is indeed not a key in its dictionary, then the embedding vector at index 0 will simply not be accessed during training. Let me know if I'm missing something.
The set of keys that appear in the dictionary is part of the training; what you remap them to is not since the embedding training that comes afterwards is not dependent on what the actual remappings are, just on how you fold them before remapping.
Unrelated to that, I actually do have some changes to the data preparation that make it more suitable for running on a less powerful server. I hope to submit a PR or two for that in the next week or two.
The construction of the dictionaries that remap the categorical features is really part of the training and should not include the test and validation data; in actual use feature values would show up that were not in the feature dictionaries and the test/validation data should reflect that.
Since the data preparation uses 0 for values not supplied for a feature, this seems like the intended OOV unmapped feature value, but there is no guarantee that every feature will have a missing value among the training data, so 0 might not even be a key in the feature's dictionary.
The text was updated successfully, but these errors were encountered: