Adding multithreading / non-blocking IO to Codex #506

elcritch · 2023-08-04T01:33:25Z

elcritch
Aug 4, 2023

Motivation

Based on discussions with @dryajov we're at a point where the file access layer is affecting the performance of test nodes. While sqlite or the datastore are fast on average there can be large tail latencies which result from blocking the main Chronos thread.

We could try and rely on just spinning up multiple independent Chronos threads, but this will just continue causing large tail latencies. The overall throughput might be similar but the experience would be much worse due requests being delayed. It may also be causing failures on the networking side where keep-alives or quick status updates are delayed over their timeout thresholds.

Proposed solution: threaded Disk-IO and Compute

Overall Codex nodes (clients) are likely to be bound by IO and compute rather than by networking. This leads to a solution where disk IO and compute are performed in specific taskpools. The results would be sent back to the application side (the Chronos networking layer) when ready.

This solution would likely look like:

Main thread the "application side" using Chronos
IO taskpool(s) using an existing or bespoke threadpool
Compute threadpools using nim-taskpools

The main thread would kick off work to the threadpools as an async operation. When the results were ready, then the async network handling would continue.

Implementation Plan

It seems to make sense to focus on the Disk-IO portion first since that affects the overall client stability the most.

@dryajov and I though that datastore would be a good layer of the API to create a bridge between the application layer and the Disk-IO threadpools.

Review existing IO threadpools and systems like libuv
Scope out work to add to datastore and propose API changes
Implement non-blocking IO using single threads via datastore
Ideally test the latency effects of adding the IO pool (@benbierens may have good input there)

Considerations

Starting small with a single-thread each for sqlite and general disk IO seems like a prudent path to start on.

Asynchronous Notifications in Chronos

The first challenge is how to notify the async networking side that data is ready. This requires plumbing into the asyncdispatcher which Chronos recently added in PR 406. It was used in Nimbus recently as well.

This seems sufficient for our needs, and we can build on Nimbus teams experience. Note that Nim's stdlib asyncdispatch also has similar support for asynchrounous thread capable notification as well using a very similar mechanism. I've used the AsyncFD approach myself on a couple of projects now and it works well. There's some

Data Sharing

The primary challenge in multi-threading is figuring out the data types and how to share them between the threads. We must have a zero-copy mechanism for IO since we'll be dealing with large blocks of data.

I haven't fully reviewed how we're passing data currently, but it appears to be a mix of seq[bytes] and some buffers wrapping raw pointers. It would be important to make these data types as consistent as possible.

Garbage Collection / Shared Heap

The default GC in Nim 1.6 series is refc which poses some challenges in sharing data between threads. Nimbus used raw pointers and a manually verified "lifetime" to share data to the worker taskpool. This requires that the application side keep the data alive until it's ready.

This works for simple cases, but I could foresee challenges that'd occur when say futures are canceled, etc. Effectively scaling out the data interactions throughout the system could be error prone.

It might also necessitate protecting the data with locks. This may prove unneeded however since our data-flow is fairly straightforward (put block, get block).

It's important to reason about upfront to avoid race conditions! Race conditions would plague us with ongoing bugs. I actually have run into similar race conditions with docker image servers written in Golang -- it's really not fun.

ORC / ARC

The route I favor and have the most experience with multithreading in Nim is with ARC/ORC. It's the default GC in Nim 2.0, which was recently released. I've been running it in production embedded devices for several years now, though without async.

While ORC itself isn't thread-safe, it offers the ability to implement thread-safe types. This is possible because ORC provides a single heap and deterministic destructors. From this thread-safe data sharing can be done using channels and Isolate[T] or by using Smart Pointers. These both provide the ability to safely share data between threads.

However, ORC is a large change from prior Nim GC's as it's somewhat a move from Java/Javascript style GC to a C++/SharedPtr or Swift-like memory management system with moves and value. As such some upstream dependencies need to be modified to work well with ORC. The performance characteristics can also differ with ORC.

So far I have gotten Codex unit tests passing on ARM, and am to the point of tracking down some issues that may be endian related. It's also useful IMHO to shake up the code a bit and see what abstractions / libraries are possibly finicky.

My recommendation is that it's worth pursuing ORC and to begin testing Codex with ORC, but to be cautious about depending on it for non-blocking IO for the end-of-year release. I'm cautiously optimistic that before the end of year that Chronos and the rest of the ecosystem will be ready for switching.

Roughly speaking in that scenario, I imagine doing an initial non-blocking IO with raw pointers for some disk IO but then using ORC datatypes for bigger multi-threading overhauls.

Summary

Working on this will be fun! There's a few challenges, and look forward to any thoughts, comments, etc that others will have.

elcritch · 2023-08-04T01:34:08Z

elcritch
Aug 4, 2023
Author

P.S. I wrote this in one sitting -- so please be kind on grammar issues. ;) I'll come back and edit it later, but needed to get something written down.

0 replies

AuHau · 2023-08-08T06:48:58Z

AuHau
Aug 8, 2023
Collaborator

Great write up! 👍
I don't have so low-level knowledge of Nim, so can't really comment regarding GC approach, but generally like your though process and agree with your solution and plan 👍

1 reply

elcritch Aug 8, 2023
Author

Nice! That's good to hear, and that it appeared to make sense.

elcritch · 2023-08-08T22:20:13Z

elcritch
Aug 8, 2023
Author

I've been pondering about this all weekend. One thing I didn't elaborate on too much is metrics. I think first class metrics for async times would be very valuable as a guide for this.

Doing threading would be pointless work if it doesn't noticeably improve the metrics. Also, it'd be good to guide where the efforts were being placed.

@benbierens I would be curious about your thoughts about that? Personally I think a histogram of timings would be pretty helpful.

0 replies

arnetheduck · 2023-08-09T15:00:11Z

arnetheduck
Aug 9, 2023
Collaborator

My recommendation would be to completely leave ORC off the table and use manual memory management (ie allocate flat shared buffers and pass pointers around) - ORC has unknown scope, significant compiler bug surface and will easily bring what normally would be a man-week of work to a year-long effort to deal with the unrelated fallout.

From the description in here, it is trivial / simple to solve codex's needs with the same techniques that Nimbus uses, and these took no more than a week to implement. Once ORC is production ready, the threading strategy can easily be swapped out, and compared to the manual solution you'll save 10-20 lines of code, but other than that, everything else will remain exactly the same as a manually managed strategy.

2 replies

dryajov Aug 9, 2023
Maintainer

I agree. We want to look at ORC for general context, but we're leaning towards using manual memory and GC.

elcritch Aug 9, 2023
Author

Yah, ORC is more of a stretch goal. It'd be nice, but we don't want to have it hold up our non-blocking IO. My initial tests with ORC and Codex haven't managed to pass tests yet.

The only piece I am a bit concerned with manual management would be the potential for scenarios with more complex lifetimes. That seems unlikely given what I've seen of the Codex design, but could occur with more complicated use cases. Even if those cases did occur, I suspect it'd be more a matter of optimizations or easy-of-use than real blockers with manual management.

elcritch · 2023-08-24T01:24:00Z

elcritch
Aug 24, 2023
Author

We've made some progress on this. There's some challenges with the lifetimes of data in Codex and Datastore, particularly if we want to avoid large copies.

For simple cases like put it'd be possible to use the strategy utilized in Nimbus and keep the data alive until the io-thread is finished with it. However, it becomes more complicated for get and query operations where a buffer must be allocated on the io-thread and then moved to the main Chronos thread. There's an issue with a few more details at codex-storage/nim-datastore#41.

There's a few ways to solve this issue, but the best ones are just copying data between threads or using an atomic shared pointer approach. The current outstanding question though is how much of a performance impact just copying data between threads would entail. It may be negligible compared to the overhead of verification and disk IO. However, I believe it would be best to try and avoid copying. It's likely to impact performance and to cause later headaches in dealing with achieving usable performance for Codex.

In the interest of avoiding trapping ourselves I believe the best approach is to change nim-datastore from seq[byte] types over to a custom Buffer or BlockStream object. This would provide ability to define the semantics of the type and swap out the implementation. We can use this to benchmark the overhead and effectiveness of different approaches. While seq[byte] is convenient the behavior with refc and multi-threading limits our options.

The next important piece is the threadpool choice. Both nim-taskpool and malebolgia offer decent options in this regard. They are slightly different in their approaches, but utilize a similar (same?) Task[T] style API, which should allow swapping implementations. The main other difference is that nim-taskpool is work-stealing and uses FlowVar's which may not be desired for IO use cases. The current std/threadpools appears overly complicated for our needs.

The next more concrete steps are to begin changing out the Datastore API with a Buffer type. Then use that to perform some bench-marking on the overhead of copying. Also, there's some questions about how to best do an atomic shared pointer in refc as there's two ways to do it which need further vetting.

0 replies

dryajov · 2023-08-24T03:56:26Z

dryajov
Aug 24, 2023
Maintainer

In the interest of avoiding trapping ourselves I believe the best approach is to change nim-datastore from seq[byte] types over to a custom Buffer or BlockStream object.

I know we explored several possibilities, and at some point I was leaning towards zero copy as well, but I've since become convinced that copying is both the simples and the safest approach to start with.

One reason is because we don't have a good understanding of the performance profile of Codex (and probably wont until we get more real life feedback). In other words we don't really know what are the large bottlenecks and trying to address this first sounds like premature optimization. Also, I don't think that overall copying is going to have a large impact neither on the memory footprint, nor the performance, a good heuristic for that is asyncdispatch and chronos - they both resort to copying.

Another reason is the potential changes across the database this will incur - changing the datastore api will lead to changes all the way down to the network layer, this requires more testing and will most likely lead to more bugs. Overall, it just doesn't sound like it's worth the effort at this point.

As for the thread/task pool, we can probably try out nim-taskpool, but I would keep two separate pools, one for compute and one for IO. There is even a disclaimer in taskpools.nim:

Doing IO on a compute threadpool should be avoided
In case a thread is blocked for IO, other threads can steal pending tasks in that thread.
If all threads are pending for IO, the threadpool will not make any progress and be soft-locked.

But I think as long as we have two separate instances, we should be fine?

@mratsim might have some additional thoughts on this.

9 replies

elcritch Aug 24, 2023
Author

@mratsim I don't see why nim-taskpools wouldn't work fine. Though would work-stealing cause potential deadlocks or stalling issues? I don't think we'd need to recursively spawn tasks.

dryajov Aug 24, 2023
Maintainer

The reason I still lean toward a Buffer type isn't just for zero-copy across threads. It provides simplified tracking the life time of the data.

Your example flow above shows why copying the blocks can actually be more complicated. This whole piece is needed to track the lifetimes of the data buffer's on other threads:

You will need something even more complicated in the shared buffer, what I outlined is actually the "simplest flow" you get under this conditions. The shared buffer's complexity is 10 times higher because it has to work under different flows and ownership configurations, so you're not gaining much here.

I went through and mocked it out tonight. The datastore itself was easy to swap types over. The Codex side doesn't seem too bad either, IMHO, but I only got ~1/2 through. Less work than updating the deps! Oddly, it touches many of the same locations.

Thats not where the problem is, you'll have to switch it in the Block abstraction internal data buffer, and that will cascade the changes, most of them are in the blockexchange - I want to avoid that, this change has to be localized.

elcritch Aug 24, 2023
Author

You will need something even more complicated in the shared buffer, what I outlined is actually the "simplest flow" you get under this conditions. The shared buffer's complexity is 10 times higher because it has to work under different flows and ownership configurations, so you're not gaining much here.

I don't believe that's correct - I already have one shared buffer implemented and it's simpler than the workflow you described and tests fine. It'd also be possible to also do intermediate copies which would be simpler as well. There are implementations of a shared memory setup which would be more complicated, but those should be avoided.

I've solved this exact problem a few times, and cross thread signaling usually ends up being more fragile and difficult to debug. Not to mention it also complicates any other taskpool patterns that might be beneficial, like forwarding data to another computational task in the threadpool after writing, etc. We'd need signals for each time it crossed a thread boundary.

However that said, it's possible to keep the mechanism behind the datastore API.

Thats not where the problem is, you'll have to switch it in the Block abstraction internal data buffer, and that will cascade the changes, most of them are in the blockexchange - I want to avoid that, this change has to be localized.

Ok, fair enough. Keeping any changes behind the datastore API should be easy to do. It would get us to an initial threadpool more quickly too. Actually, with the shared pointer (or intermediate copy approach) I could possibly have the disk IO in a taskpool running today or Monday.

dryajov Aug 24, 2023
Maintainer

Ok, the intermediate copy, if I understand it correctly is just a shared buffer between the io thread boundaries? In that case, I think that should be fine, as long as the changes are contained behind the datastore abstraction.

elcritch Aug 28, 2023
Author

Ok, the intermediate copy, if I understand it correctly is just a shared buffer between the io thread boundaries? In that case, I think that should be fine, as long as the changes are contained behind the datastore abstraction.

Yes, that's the essential idea. It can be contained behind the datastore API. The threaded code can use it behind the scenes, and just copy data back.

Longer term if zero-copy is really needed for performance then it'd be possible to switch Codex to ORC and use seq's and move instead of a manual shared buffer to achieve zero-copy (or near zero copy) for performance improvements. As in the SharedBuffer would be backed by a seq[byte] instead of an ptr UncheckedArray[bye] and data could just be moved. That'd avoid needing to change the API at all which has benefits. We'll cross that bridge when we get to it. :-)

As a side note, I am a bit concerned about the overhead of creating ThreadSignal as they require a couple of system calls to create and destroy them. For big blocks it should be negligible but for lots of small SQLite queries it may add up. It'd still be better than being blocked on disk block retrieval.

mratsim · 2023-08-24T08:50:27Z

mratsim
Aug 24, 2023

I'm curious tho, even if initially subpar, can taskpools still be used for IO threads, or are there some specific design decision made that will prevent that.

Yes it can, the only tricky parts are:

thoroughly testing passing seq/string/ref types.
thoroughly test shared references with --mm:atomicarc as Futures and Timers might be prone to end up accessible from both the producer and consumer.

I don't see why nim-taskpools wouldn't work fine. Though would work-stealing cause potential deadlocks or stalling issues? I don't think we'd need to recursively spawn tasks.

If you have deadlocks in work-stealing, you likely would also have deadlock in just async (i.e. 2 tasks depending on each other).

1 reply

elcritch Aug 24, 2023
Author

Thanks, that's a good point about the deadlocking.

elcritch · 2023-08-28T18:42:58Z

elcritch
Aug 28, 2023
Author

I started making a prototype "proxy" datastore that runs another datastore on a different thread. The data passing and thread async mechanisms are working beautifully in conjunction with nim-taskpools. However, @dryajov had a good point about limiting the scope of the changes. Setting up true threaded-IO will require more internal rework than I thought initially due to pieces like query. It shouldn't present any technical blockers, but it'll be tedious as all the IO points which need to be converted are spread-out over various async pieces.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multithreading / non-blocking IO to Codex #506

{{title}}

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding multithreading / non-blocking IO to Codex #506

elcritch Aug 4, 2023

Motivation

Proposed solution: threaded Disk-IO and Compute

Implementation Plan

Considerations

Asynchronous Notifications in Chronos

Data Sharing

Garbage Collection / Shared Heap

ORC / ARC

Summary

Replies: 8 comments · 13 replies

elcritch Aug 4, 2023 Author

AuHau Aug 8, 2023 Collaborator

elcritch Aug 8, 2023 Author

elcritch Aug 8, 2023 Author

arnetheduck Aug 9, 2023 Collaborator

dryajov Aug 9, 2023 Maintainer

elcritch Aug 9, 2023 Author

elcritch Aug 24, 2023 Author

dryajov Aug 24, 2023 Maintainer

elcritch Aug 24, 2023 Author

dryajov Aug 24, 2023 Maintainer

elcritch Aug 24, 2023 Author

dryajov Aug 24, 2023 Maintainer

elcritch Aug 28, 2023 Author

mratsim Aug 24, 2023

elcritch Aug 24, 2023 Author

elcritch Aug 28, 2023 Author

elcritch
Aug 4, 2023

Replies: 8 comments 13 replies

elcritch
Aug 4, 2023
Author

AuHau
Aug 8, 2023
Collaborator

elcritch Aug 8, 2023
Author

elcritch
Aug 8, 2023
Author

arnetheduck
Aug 9, 2023
Collaborator

dryajov Aug 9, 2023
Maintainer

elcritch Aug 9, 2023
Author

elcritch
Aug 24, 2023
Author

dryajov
Aug 24, 2023
Maintainer

elcritch Aug 24, 2023
Author

dryajov Aug 24, 2023
Maintainer

elcritch Aug 24, 2023
Author

dryajov Aug 24, 2023
Maintainer

elcritch Aug 28, 2023
Author

mratsim
Aug 24, 2023

elcritch Aug 24, 2023
Author

elcritch
Aug 28, 2023
Author