Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service state tests and fixes #22

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

ThirteenFish
Copy link
Contributor

There's a handful of minor fixes in here, but the bulk of it is unit tests for Service. Before this gets merged though I'd like to pin down some intended behaviors of Service, which will then need another couple commits on this branch.

  1. Are ServiceState.STOPPING and ServiceState.FAILED supposed to have the same value (currently 3)? Trying to print Service.state will always show STOPPING, even if the state is explicitly set to FAILED which confused me while debugging for a bit. On the other hand the only way FAILED is set is if Service.stop() is explicitly called with _error set. Until that point the Service will be STOPPING. On the third hand, the only user of FAILED (the watchdog thread in oresat-c3-software) seems to depend on the behavior that FAILED and STOPPING are the same. If the two states are made different values, the watchdog won't ever notice a service that raised an unhandled exception in on_loop().
  2. What should happen when an exception occurs in on_loop_error()? The current behavior is to crash the Service thread but otherwise leave it in state RUNNING.
  3. Is a Service supposed to be in state STOPPING until Service.stop() is called from the outside? For example if in on_loop(), service.cancel() is called, the thread will stop but it will remain only in STOPPING, not transitioning to STOPPED until outside action is taken.
  4. Does it make sense to call start() on a STOPPING service to restart it?
  5. What should happen if start() is called twice on the same Service?
  6. What should happen if stop() is called twice on the same Service?
  7. What should happen if start() is called after stop()?

It looks like in 829de29 Updater.updates_cached changed from returning
the length of _cache.files() to just that list directly. The
corresponding test was not updated though because it looks like it will
be skipped unless it's being run as root, which is not the common
configuration.

My local github runner however runs it as root and was failing on this
test.
Users of the Gpio class expect GpioError on exceptional conditions, but
if the class didn't find a particular pin, it would raise a KeyError.

This changes it to GpioError, so that users don't have to except both
KeyError and GpioError.
There are no guarantees which thread __del__() is run in. It could
potentially be run in the Service's thread, where it attempts to
join() itself and RuntimeErrors, as captured by this exception I
witnessed during testing:

Exception ignored in: <function Service.__del__ at 0x719dbf61e8c0>
Traceback (most recent call last):
  File "oresat-olaf/olaf/common/service.py", line 48, in __del__
    self._thread.join()
  File "/usr/lib/python3.10/threading.py", line 1093, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Furthermore it doesn't really make sense to wait for the thread to have
stopped in __del__(). We don't do anything with that information and it
doesn't make sense to pause GC until the thread is stopped. Setting the
_event seems enough, the thread can die at its leisure.
This takes the Service class through its lifecycle, starting, stopping,
and causing errors. Some questions remain about the intended behavior.

In most tests it derives a new class that overrides the specific
behavior that the test requires, but black mangles the formatting of
that so I turn it off for the class creation.
Copy link
Member

@ryanpdx ryanpdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though I made a mermaid diagram for service state. I guess not.

  1. They should have different values. But STOPPING should always exist between RUNNING and FAILED/STOPPED
  2. If the child class has overrode it, it should be used to put the service (and any hardware in a nice state after an error, if possible) and then it should go to the FAILED state with the thread stopped.
  3. Umm, idk. state mostly just a flag anyways.
  4. Nope, once the service has reached STOPPING, STOPPED, or FAILED, it should remand in STOPPED or FAILED. No restarts. Trying to keep things simple.
  5. The service should handle it
  6. The service should handle it
  7. Same as 4

@ThirteenFish
Copy link
Contributor Author

  1. They should have different values. But STOPPING should always exist between RUNNING and FAILED/STOPPED

The exact meaning of "always exist" is perhaps a bit unclear to me. Because concurrency is weird it might make the most sense to talk about which hooks run under which state. Certainly on_stop() should be run under STOPPING but what should on_loop_error() be run under? RUNNING? STOPPING? A new state FAILING?

  1. If the child class has overrode it, it should be used to put the service (and any hardware in a nice state after an error, if possible) and then it should go to the FAILED state with the thread stopped.

That sounds a whole lot like what the stuff in stop() does. Looking at the uses of on_loop_error(), they mostly also expect on_stop() to reset the hardware/state. What if then on_stop() and the setting of STOPPED/FAILED were moved into the end of _loop()? That way the state and on_stop() would be set promptly, allowing the C3 watchdog to pick up on FAILED again?

The changes then would look like:

  • After _loop() (and thus the thread) ends the service is only in either STOPPED or FAILED
  • on_stop_before() doesn't get run if on_loop_error() runs.
  • _loop() looks like (sans logging and comments):
def _loop(self):
    self._status = ServiceState.RUNNING
    while not self._event.is_set():
        try:
            self.on_loop()
        except: 
            try:
                self.on_loop_error(e)
            except: 
                pass  # really logger.exception()
            self._event.set()
            self._error = True
    self._status = ServiceState.STOPPING
    try:
        self.on_stop()
    except:
        self._error = True

    self._status = ServiceState.FAILED if self._error else ServiceState.STOPPED
  • stop() looks like:
def stop(self):
    self._status = ServiceState.STOPPING
    try:
        self.on_stop_before()
    except: 
        self._error = True

    self._event.set()
    self._thread.join()

The join ensures on_stop() gets run after on_stop_pre() and before stop() returns.

A few additional ideas I had while writing this:

  1. If on_stop_before() were to be run under RUNNING instead, it would make it so on_loop() doesn't have to consider running under STOPPING.
  2. If on_stop_before() failing doesn't imply the service FAILED then the implementation could be simplified by removing self._error.
  3. on_start() could probably be moved into _loop() for symmetry. It would also make OLAF startup quicker because any IO wouldn't block other service starts.

This fixes confusion between the two states: when printing it would
always show STOPPING even explicitly set to FAILING, and when checking
for FAILING it would be true when set to just STOPPING.

This additionally changes the type from IntEnum to just Enum. IntEnums
are intended for when the members should also be used interchangeably
with normal ints. Arithmetic, ordering comparisons, and the like. It
doesn't really make sense to use the int properties for ServiceState
except as a way to distinguish them so we shouldn't advertise the option
that it could be done.

Finally @unique is a fun decorator to prevent the kind of error that
occurred here. It ensures that each member has a unique value.
@ryanpdx
Copy link
Member

ryanpdx commented Mar 7, 2024

The exact meaning of "always exist" is perhaps a bit unclear to me. Because concurrency is weird it might make the most sense to talk about which hooks run under which state. Certainly on_stop() should be run under STOPPING but what should on_loop_error() be run under? RUNNING? STOPPING? A new state FAILING?

Yea, we could add a FAILING state

That sounds a whole lot like what the stuff in stop() does. Looking at the uses of on_loop_error(), they mostly also expect on_stop() to reset the hardware/state. What if then on_stop() and the setting of STOPPED/FAILED were moved into the end of _loop()? That way the state and on_stop() would be set promptly, allowing the C3 watchdog to pick up on FAILED again?

on_loop_error() is separate from on_stop() as it could do extra things (send a EMCY message, maybe a log, or for the C3, the service may need to send SDOs). That _loop() change makes sense.

  1. If on_stop_before() were to be run under RUNNING instead, it would make it so on_loop() doesn't have to consider running under STOPPING.

It is kinda apart of the stopping process. That does make sense tho.

  1. If on_stop_before() failing doesn't imply the service FAILED then the implementation could be simplified by removing self._error.

It should problaby go into FAILED.

  1. on_start() could probably be moved into _loop() for symmetry. It would also make OLAF startup quicker because any IO wouldn't block other service starts.

Yeah, that makes sense

Copy link
Member

@ryanpdx ryanpdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good.

I'm am surprised those asserts in the updater unit-tests passed previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

2 participants