Replies: 8 comments 13 replies
-
P.S. I wrote this in one sitting -- so please be kind on grammar issues. ;) I'll come back and edit it later, but needed to get something written down. |
Beta Was this translation helpful? Give feedback.
-
Great write up! 👍 |
Beta Was this translation helpful? Give feedback.
-
I've been pondering about this all weekend. One thing I didn't elaborate on too much is metrics. I think first class metrics for async times would be very valuable as a guide for this. Doing threading would be pointless work if it doesn't noticeably improve the metrics. Also, it'd be good to guide where the efforts were being placed. @benbierens I would be curious about your thoughts about that? Personally I think a histogram of timings would be pretty helpful. |
Beta Was this translation helpful? Give feedback.
-
My recommendation would be to completely leave ORC off the table and use manual memory management (ie allocate flat shared buffers and pass pointers around) - ORC has unknown scope, significant compiler bug surface and will easily bring what normally would be a man-week of work to a year-long effort to deal with the unrelated fallout. From the description in here, it is trivial / simple to solve codex's needs with the same techniques that Nimbus uses, and these took no more than a week to implement. Once ORC is production ready, the threading strategy can easily be swapped out, and compared to the manual solution you'll save 10-20 lines of code, but other than that, everything else will remain exactly the same as a manually managed strategy. |
Beta Was this translation helpful? Give feedback.
-
We've made some progress on this. There's some challenges with the lifetimes of data in Codex and Datastore, particularly if we want to avoid large copies. For simple cases like There's a few ways to solve this issue, but the best ones are just copying data between threads or using an atomic shared pointer approach. The current outstanding question though is how much of a performance impact just copying data between threads would entail. It may be negligible compared to the overhead of verification and disk IO. However, I believe it would be best to try and avoid copying. It's likely to impact performance and to cause later headaches in dealing with achieving usable performance for Codex. In the interest of avoiding trapping ourselves I believe the best approach is to change The next important piece is the threadpool choice. Both The next more concrete steps are to begin changing out the Datastore API with a |
Beta Was this translation helpful? Give feedback.
-
I know we explored several possibilities, and at some point I was leaning towards zero copy as well, but I've since become convinced that copying is both the simples and the safest approach to start with. One reason is because we don't have a good understanding of the performance profile of Codex (and probably wont until we get more real life feedback). In other words we don't really know what are the large bottlenecks and trying to address this first sounds like premature optimization. Also, I don't think that overall copying is going to have a large impact neither on the memory footprint, nor the performance, a good heuristic for that is asyncdispatch and chronos - they both resort to copying. Another reason is the potential changes across the database this will incur - changing the datastore api will lead to changes all the way down to the network layer, this requires more testing and will most likely lead to more bugs. Overall, it just doesn't sound like it's worth the effort at this point. As for the thread/task pool, we can probably try out
But I think as long as we have two separate instances, we should be fine? @mratsim might have some additional thoughts on this. |
Beta Was this translation helpful? Give feedback.
-
Yes it can, the only tricky parts are:
If you have deadlocks in work-stealing, you likely would also have deadlock in just async (i.e. 2 tasks depending on each other). |
Beta Was this translation helpful? Give feedback.
-
I started making a prototype "proxy" datastore that runs another datastore on a different thread. The data passing and thread async mechanisms are working beautifully in conjunction with |
Beta Was this translation helpful? Give feedback.
-
Motivation
Based on discussions with @dryajov we're at a point where the file access layer is affecting the performance of test nodes. While sqlite or the datastore are fast on average there can be large tail latencies which result from blocking the main Chronos thread.
We could try and rely on just spinning up multiple independent Chronos threads, but this will just continue causing large tail latencies. The overall throughput might be similar but the experience would be much worse due requests being delayed. It may also be causing failures on the networking side where keep-alives or quick status updates are delayed over their timeout thresholds.
Proposed solution: threaded Disk-IO and Compute
Overall Codex nodes (clients) are likely to be bound by IO and compute rather than by networking. This leads to a solution where disk IO and compute are performed in specific taskpools. The results would be sent back to the application side (the Chronos networking layer) when ready.
This solution would likely look like:
The main thread would kick off work to the threadpools as an async operation. When the results were ready, then the async network handling would continue.
Implementation Plan
It seems to make sense to focus on the Disk-IO portion first since that affects the overall client stability the most.
@dryajov and I though that
datastore
would be a good layer of the API to create a bridge between the application layer and the Disk-IO threadpools.datastore
and propose API changesdatastore
Considerations
Starting small with a single-thread each for sqlite and general disk IO seems like a prudent path to start on.
Asynchronous Notifications in Chronos
The first challenge is how to notify the async networking side that data is ready. This requires plumbing into the asyncdispatcher which Chronos recently added in PR 406. It was used in Nimbus recently as well.
This seems sufficient for our needs, and we can build on Nimbus teams experience. Note that Nim's stdlib asyncdispatch also has similar support for asynchrounous thread capable notification as well using a very similar mechanism. I've used the
AsyncFD
approach myself on a couple of projects now and it works well. There's someData Sharing
The primary challenge in multi-threading is figuring out the data types and how to share them between the threads. We must have a zero-copy mechanism for IO since we'll be dealing with large blocks of data.
I haven't fully reviewed how we're passing data currently, but it appears to be a mix of
seq[bytes]
and some buffers wrapping raw pointers. It would be important to make these data types as consistent as possible.Garbage Collection / Shared Heap
The default GC in Nim 1.6 series is
refc
which poses some challenges in sharing data between threads. Nimbus used raw pointers and a manually verified "lifetime" to share data to the worker taskpool. This requires that the application side keep the data alive until it's ready.This works for simple cases, but I could foresee challenges that'd occur when say futures are canceled, etc. Effectively scaling out the data interactions throughout the system could be error prone.
It might also necessitate protecting the data with locks. This may prove unneeded however since our data-flow is fairly straightforward (put block, get block).
It's important to reason about upfront to avoid race conditions! Race conditions would plague us with ongoing bugs. I actually have run into similar race conditions with docker image servers written in Golang -- it's really not fun.
ORC / ARC
The route I favor and have the most experience with multithreading in Nim is with ARC/ORC. It's the default GC in Nim 2.0, which was recently released. I've been running it in production embedded devices for several years now, though without async.
While ORC itself isn't thread-safe, it offers the ability to implement thread-safe types. This is possible because ORC provides a single heap and deterministic destructors. From this thread-safe data sharing can be done using channels and Isolate[T] or by using Smart Pointers. These both provide the ability to safely share data between threads.
However, ORC is a large change from prior Nim GC's as it's somewhat a move from Java/Javascript style GC to a C++/SharedPtr or Swift-like memory management system with moves and value. As such some upstream dependencies need to be modified to work well with ORC. The performance characteristics can also differ with ORC.
So far I have gotten Codex unit tests passing on ARM, and am to the point of tracking down some issues that may be endian related. It's also useful IMHO to shake up the code a bit and see what abstractions / libraries are possibly finicky.
My recommendation is that it's worth pursuing ORC and to begin testing Codex with ORC, but to be cautious about depending on it for non-blocking IO for the end-of-year release. I'm cautiously optimistic that before the end of year that Chronos and the rest of the ecosystem will be ready for switching.
Roughly speaking in that scenario, I imagine doing an initial non-blocking IO with raw pointers for some disk IO but then using ORC datatypes for bigger multi-threading overhauls.
Summary
Working on this will be fun! There's a few challenges, and look forward to any thoughts, comments, etc that others will have.
Beta Was this translation helpful? Give feedback.
All reactions