CheckPoint doesn't record the current epoch. #795

TheAutumnOfRice · 2021-07-29T03:39:36Z

If I set a CheckPoint callback and a max_epochs parameter, the max_epochs will not be the true MAX epochs, but checkpoint's epoch PLUS max_epochs.

For example, if I set the max_epochs=400 and load the checkpoint at epoch=125, the fit-loop will start at epoch=125 and end at epoch=526, which is not expected.

It is noticed that the function NeuralNet.fit_loop has a parameter called epochs, the default value is None, namely, max_epochs. So the training will be looped for max_epochs times no matter what the current epoch is. I think it might be better if the default value is set to:

max_epochs-net.history[-1,'epoch']

Or, modify the for-loop in the fit_loop function:

# net.py, line 786
# for _ in range(epochs):
for _ in range(self.history[-1,'epoch'],epochs):

Version: 0.10.0

The text was updated successfully, but these errors were encountered:

ottonemo · 2021-07-29T13:04:39Z

Thanks for the report. I agree that the current behavior of max_epochs is confusing in this regard. I think this is very similar to #674

If it is all the same to you, I would rather continue the discussion in the thread #674?

TheAutumnOfRice · 2021-07-29T14:42:50Z

@ottonemo Thanks for your reply! I didn't notice that thread. Yes it is the same problem.

TheAutumnOfRice closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CheckPoint doesn't record the current epoch. #795

CheckPoint doesn't record the current epoch. #795

TheAutumnOfRice commented Jul 29, 2021

ottonemo commented Jul 29, 2021

TheAutumnOfRice commented Jul 29, 2021

CheckPoint doesn't record the current epoch. #795

CheckPoint doesn't record the current epoch. #795

Comments

TheAutumnOfRice commented Jul 29, 2021

ottonemo commented Jul 29, 2021

TheAutumnOfRice commented Jul 29, 2021