Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheckPoint doesn't record the current epoch. #795

Closed
TheAutumnOfRice opened this issue Jul 29, 2021 · 2 comments
Closed

CheckPoint doesn't record the current epoch. #795

TheAutumnOfRice opened this issue Jul 29, 2021 · 2 comments

Comments

@TheAutumnOfRice
Copy link
Contributor

If I set a CheckPoint callback and a max_epochs parameter, the max_epochs will not be the true MAX epochs, but checkpoint's epoch PLUS max_epochs.

For example, if I set the max_epochs=400 and load the checkpoint at epoch=125, the fit-loop will start at epoch=125 and end at epoch=526, which is not expected.

It is noticed that the function NeuralNet.fit_loop has a parameter called epochs, the default value is None, namely, max_epochs. So the training will be looped for max_epochs times no matter what the current epoch is. I think it might be better if the default value is set to:

max_epochs-net.history[-1,'epoch']

Or, modify the for-loop in the fit_loop function:

# net.py, line 786
# for _ in range(epochs):
for _ in range(self.history[-1,'epoch'],epochs):

Version: 0.10.0

@ottonemo
Copy link
Member

Thanks for the report. I agree that the current behavior of max_epochs is confusing in this regard. I think this is very similar to #674

If it is all the same to you, I would rather continue the discussion in the thread #674?

@TheAutumnOfRice
Copy link
Contributor Author

@ottonemo Thanks for your reply! I didn't notice that thread. Yes it is the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants