Service state tests and fixes #22

ThirteenFish · 2024-03-01T03:00:08Z

There's a handful of minor fixes in here, but the bulk of it is unit tests for Service. Before this gets merged though I'd like to pin down some intended behaviors of Service, which will then need another couple commits on this branch.

Are ServiceState.STOPPING and ServiceState.FAILED supposed to have the same value (currently 3)? Trying to print Service.state will always show STOPPING, even if the state is explicitly set to FAILED which confused me while debugging for a bit. On the other hand the only way FAILED is set is if Service.stop() is explicitly called with _error set. Until that point the Service will be STOPPING. On the third hand, the only user of FAILED (the watchdog thread in oresat-c3-software) seems to depend on the behavior that FAILED and STOPPING are the same. If the two states are made different values, the watchdog won't ever notice a service that raised an unhandled exception in on_loop().
What should happen when an exception occurs in on_loop_error()? The current behavior is to crash the Service thread but otherwise leave it in state RUNNING.
Is a Service supposed to be in state STOPPING until Service.stop() is called from the outside? For example if in on_loop(), service.cancel() is called, the thread will stop but it will remain only in STOPPING, not transitioning to STOPPED until outside action is taken.
Does it make sense to call start() on a STOPPING service to restart it?
What should happen if start() is called twice on the same Service?
What should happen if stop() is called twice on the same Service?
What should happen if start() is called after stop()?

It looks like in 829de29 Updater.updates_cached changed from returning the length of _cache.files() to just that list directly. The corresponding test was not updated though because it looks like it will be skipped unless it's being run as root, which is not the common configuration. My local github runner however runs it as root and was failing on this test.

Users of the Gpio class expect GpioError on exceptional conditions, but if the class didn't find a particular pin, it would raise a KeyError. This changes it to GpioError, so that users don't have to except both KeyError and GpioError.

There are no guarantees which thread __del__() is run in. It could potentially be run in the Service's thread, where it attempts to join() itself and RuntimeErrors, as captured by this exception I witnessed during testing: Exception ignored in: <function Service.__del__ at 0x719dbf61e8c0> Traceback (most recent call last): File "oresat-olaf/olaf/common/service.py", line 48, in __del__ self._thread.join() File "/usr/lib/python3.10/threading.py", line 1093, in join raise RuntimeError("cannot join current thread") RuntimeError: cannot join current thread Furthermore it doesn't really make sense to wait for the thread to have stopped in __del__(). We don't do anything with that information and it doesn't make sense to pause GC until the thread is stopped. Setting the _event seems enough, the thread can die at its leisure.

This takes the Service class through its lifecycle, starting, stopping, and causing errors. Some questions remain about the intended behavior. In most tests it derives a new class that overrides the specific behavior that the test requires, but black mangles the formatting of that so I turn it off for the class creation.

ryanpdx

I though I made a mermaid diagram for service state. I guess not.

They should have different values. But STOPPING should always exist between RUNNING and FAILED/STOPPED
If the child class has overrode it, it should be used to put the service (and any hardware in a nice state after an error, if possible) and then it should go to the FAILED state with the thread stopped.
Umm, idk. state mostly just a flag anyways.
Nope, once the service has reached STOPPING, STOPPED, or FAILED, it should remand in STOPPED or FAILED. No restarts. Trying to keep things simple.
The service should handle it
The service should handle it
Same as 4

ThirteenFish · 2024-03-06T09:39:59Z

They should have different values. But STOPPING should always exist between RUNNING and FAILED/STOPPED

The exact meaning of "always exist" is perhaps a bit unclear to me. Because concurrency is weird it might make the most sense to talk about which hooks run under which state. Certainly on_stop() should be run under STOPPING but what should on_loop_error() be run under? RUNNING? STOPPING? A new state FAILING?

If the child class has overrode it, it should be used to put the service (and any hardware in a nice state after an error, if possible) and then it should go to the FAILED state with the thread stopped.

That sounds a whole lot like what the stuff in stop() does. Looking at the uses of on_loop_error(), they mostly also expect on_stop() to reset the hardware/state. What if then on_stop() and the setting of STOPPED/FAILED were moved into the end of _loop()? That way the state and on_stop() would be set promptly, allowing the C3 watchdog to pick up on FAILED again?

The changes then would look like:

After _loop() (and thus the thread) ends the service is only in either STOPPED or FAILED
on_stop_before() doesn't get run if on_loop_error() runs.
_loop() looks like (sans logging and comments):

def _loop(self):
    self._status = ServiceState.RUNNING
    while not self._event.is_set():
        try:
            self.on_loop()
        except: 
            try:
                self.on_loop_error(e)
            except: 
                pass  # really logger.exception()
            self._event.set()
            self._error = True
    self._status = ServiceState.STOPPING
    try:
        self.on_stop()
    except:
        self._error = True

    self._status = ServiceState.FAILED if self._error else ServiceState.STOPPED

stop() looks like:

def stop(self):
    self._status = ServiceState.STOPPING
    try:
        self.on_stop_before()
    except: 
        self._error = True

    self._event.set()
    self._thread.join()

The join ensures on_stop() gets run after on_stop_pre() and before stop() returns.

A few additional ideas I had while writing this:

If on_stop_before() were to be run under RUNNING instead, it would make it so on_loop() doesn't have to consider running under STOPPING.
If on_stop_before() failing doesn't imply the service FAILED then the implementation could be simplified by removing self._error.
on_start() could probably be moved into _loop() for symmetry. It would also make OLAF startup quicker because any IO wouldn't block other service starts.

This fixes confusion between the two states: when printing it would always show STOPPING even explicitly set to FAILING, and when checking for FAILING it would be true when set to just STOPPING. This additionally changes the type from IntEnum to just Enum. IntEnums are intended for when the members should also be used interchangeably with normal ints. Arithmetic, ordering comparisons, and the like. It doesn't really make sense to use the int properties for ServiceState except as a way to distinguish them so we shouldn't advertise the option that it could be done. Finally @unique is a fun decorator to prevent the kind of error that occurred here. It ensures that each member has a unique value.

ryanpdx · 2024-03-07T01:59:27Z

The exact meaning of "always exist" is perhaps a bit unclear to me. Because concurrency is weird it might make the most sense to talk about which hooks run under which state. Certainly on_stop() should be run under STOPPING but what should on_loop_error() be run under? RUNNING? STOPPING? A new state FAILING?

Yea, we could add a FAILING state

That sounds a whole lot like what the stuff in stop() does. Looking at the uses of on_loop_error(), they mostly also expect on_stop() to reset the hardware/state. What if then on_stop() and the setting of STOPPED/FAILED were moved into the end of _loop()? That way the state and on_stop() would be set promptly, allowing the C3 watchdog to pick up on FAILED again?

on_loop_error() is separate from on_stop() as it could do extra things (send a EMCY message, maybe a log, or for the C3, the service may need to send SDOs). That _loop() change makes sense.

If on_stop_before() were to be run under RUNNING instead, it would make it so on_loop() doesn't have to consider running under STOPPING.

It is kinda apart of the stopping process. That does make sense tho.

If on_stop_before() failing doesn't imply the service FAILED then the implementation could be simplified by removing self._error.

It should problaby go into FAILED.

on_start() could probably be moved into _loop() for symmetry. It would also make OLAF startup quicker because any IO wouldn't block other service starts.

Yeah, that makes sense

ryanpdx

This all looks good.

I'm am surprised those asserts in the updater unit-tests passed previously.

ThirteenFish added 4 commits February 28, 2024 21:03

Gpio class raises GpioError when a pin isn't found

9917b9f

Users of the Gpio class expect GpioError on exceptional conditions, but if the class didn't find a particular pin, it would raise a KeyError. This changes it to GpioError, so that users don't have to except both KeyError and GpioError.

ThirteenFish requested a review from ryanpdx March 1, 2024 03:00

ryanpdx reviewed Mar 6, 2024

View reviewed changes

ThirteenFish requested a review from ryanpdx March 6, 2024 09:52

ryanpdx reviewed Mar 7, 2024

View reviewed changes

ryanpdx assigned ThirteenFish Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service state tests and fixes #22

Service state tests and fixes #22

ThirteenFish commented Mar 1, 2024

ryanpdx left a comment

ThirteenFish commented Mar 6, 2024

ryanpdx commented Mar 7, 2024

ryanpdx left a comment

Service state tests and fixes #22

Are you sure you want to change the base?

Service state tests and fixes #22

Conversation

ThirteenFish commented Mar 1, 2024

ryanpdx left a comment

Choose a reason for hiding this comment

ThirteenFish commented Mar 6, 2024

ryanpdx commented Mar 7, 2024

ryanpdx left a comment

Choose a reason for hiding this comment