Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make block size configurable for SPT #315

Closed
wants to merge 2 commits into from

Conversation

ricarkol
Copy link
Collaborator

@ricarkol ricarkol commented Jan 21, 2019

Make block size configurable for SPT. This is desirable when the FS does IO in blocks larger than the current 512 Bytes. Having this issue in rumprun; IncludeOS is not an issue at the moment as only fat is supported (which uses 512B blocks as default). Other FSes default to different block sizes, like ISOFS to 2048, ext2 to 1024, and LFS to 4096. Splitting the FS IO in pieces is not efficient, and makes handling consistency a pain (like when just a piece of a large write fails).

Ideally, would like to pass the block size as an argument (like --disk=abc --bsize=1024), but I'm guessing that's not a very good idea for security reasons. Instead, this suggested change sets block size at compilation time. The idea would be to build solo5 like this: SPT_EXTRA_CFLAGS=-DSPT_BLOCK_SIZE=2048 make.

Alternatively, could also relax the block size checks (in the bindings and tenders) to only multiples of block size. This would be by far the easiest option, but not sure if it's the safest.

Ricardo Koller added 2 commits January 21, 2019 17:05
Signed-off-by: Ricardo Koller <kollerr@us.ibm.com>
Signed-off-by: Ricardo Koller <kollerr@us.ibm.com>
@ricarkol ricarkol requested a review from mato January 21, 2019 22:24
@mato
Copy link
Member

mato commented Jan 22, 2019

Make block size configurable for SPT. This is desirable when the FS does IO in blocks larger than the current 512 Bytes.

What you're actually asking for here is "Enable block operations for >1 block".

Ideally, would like to pass the block size as an argument (like --disk=abc --bsize=1024),

The underlying block size is a minimum I/O unit, which should eventually be detected by the tenders (using some ioctl which I don't remember right now). The user should not need to ever touch this manually. Same goes for the -DSPT_BLOCK_SIZE approach.

Alternatively, could also relax the block size checks (in the bindings and tenders) to only multiples of block size

This (I/O in multiples of minimum block size) is the correct way forward, however needs to be done carefully due to several reasons:

  1. Current spt implementation uses libseccomp, which, if multiples of block size were allowed, doesn't let us impose a meaningful set of rules to enforce that the guest can't write beyond the end of a file-backed block device. See discussion at Add "spt" (seccomp) target #310 (comment). This could be solved in the future by constructing our seccomp BPF filter manually, but that is a sizeable chunk of work.
  2. The Solo5 block API semantics mandate the operation either succeeds in its entirety, or fails. This is important. We can guarantee this by simply returning SOLO5_R_EUNSPEC from block operations if the operation did not complete (e.g. nbytes returned from pwrite() != nbytes passed in), but I'd like to understand the implications better first.
  3. Our virtio implementation does not support block operations with >1 block. Possible solutions: Try and implement it, or just file it under "virtio limitations" and not bother.

Having this issue in rumprun; IncludeOS is not an issue at the moment

If this is for a FOSDEM demo then I'd suggest either hacking rumprun to use a smaller block size, or demoing with IncludeOS instead. Will discuss with you OOB.

In summary: Right now I'd leave things as they are (as there's a risk of breaking what works), file an issue and then subject the block API to a proper review, including things like adding barriers/fsync.

Does that make sense?

/cc @cfcs

@cfcs
Copy link

cfcs commented Jan 23, 2019

  1. The ioctl() is called BLKSSZGET(EDIT: see Make block size configurable for SPT #315 (comment) ), but if we detect the block size dynamically we also need to be able to communicate this block size to the guest somehow, so thinking about it again I'm not so sure that this is the way to go. If we stick to 512 we have (well a bit of) overhead from extraneous context switching, but it keeps the API simple, and at the moment we're using buffered I/O, so we're not trashing the underlying storage too much. Note 512 is very much the "lowest common denominator," most disks (if not all?) have large sector sizes today.

  2. I'm not sure I understand the problem correctly; if the file system implementation has hard-coded a different sector size than the underlying controller (a.k.a. Solo5 in this case), it would appear to me that it was designed incorrectly? Shouldn't it just chunk writes of block size into a bunch of individual writes?

  3. I have a feeling that barriers / fsync() is something we would like to have, somewhere. Not sure if it should be implicit or as a separate operation; having it implicit would make for the cleanest API, but I fear it may also be rather slow since we will be utterly and completely under-utilizing the disk controller write cache pipeline?

  4. AFAICT buffered atomic I/O on Linux is still a WiP. From a quick google search it seems that it's doable in O_DIRECT mode with AIO, but then the cost of a fixed size of 512 really goes through the roof.

  5. Regarding files not growing, it just struck me that we could also use setrlimit(RLIMIT_FSIZE, max_capacity), but not sure how reliable that is in practice.

@ricarkol
Copy link
Collaborator Author

ricarkol commented Jan 27, 2019

@mato: will create an issue about improving the solo5 block interface: increase
the IO request size, and add a barrier/sync call.

@mato and @cfcs: I realize that I added a lot of confusion by talking about FS
blocks, which are different to disk blocks (aka sectors). The Solo5 block
interface is implementing a virtual disk, with sector size of 512, but that has
nothing to do with the FS block size on top. With this PR I was trying to
increase disk block size to the FS block size. This sounded pretty convenient
at first, but after thinking about atomicity of writes, not so much (more on
that at the end).

Was thinking about this, and also trying some stuff in a Linux box to clear my
understanding. Wanted to share some of that.

FSes are optimized to make large IO requests. One way of doing so is by
defining large data and meta-data blocks (FS blocks). FSes call this "block
size", and it's usually a format time argument. For example, I formatted a
disk with ext4 and FS blocks of size 4096, and observed that every IO coming
from that FS to the block layer is aligned to 4096 and of length being a multiple
of 4096 (usually just 4096).

The underlying disk is addressable in disk blocks (aka sectors), and accessed
via requests of multiple disk blocks. The FS makes requests to the IO
scheduler, and this one submits them to the disk driver. @cfcs: the FS does not
have to split their FS block sized requests into requests of disk block size;
and the IO scheduler does not split those requests either. The block layer
defines a min and max request size. The min is usually 512, and most FSes have
a min block size of 512 as well. FSes must query the block layer for the max,
and split requests accordingly. Although the max is much larger than the min
(1024 disk sectors in my box). Also observed that the requests coming out of
the IO scheduler queue into the disk driver are of the same size or larger
(requests can be merged) than those coming from the FS.

As an example, this data is from an SSD in Linux. This SSD defines sectors in
512 Bytes, has a max request size of 2048 KBytes, and is formatted using ext4
with an FS block size of 4 KBytes:

[kollerr@oc6638465227 ~]$ sudo dumpe2fs /dev/nvme0n1p1 | grep 'Block size'
dumpe2fs 1.42.9 (28-Dec-2013)
Block size:               4096

[kollerr@oc6638465227 ~]$ cat /sys/block/nvme0n1/queue/minimum_io_size
512 # max request size to the IO scheduler queue
[kollerr@oc6638465227 ~]$ cat /sys/block/nvme0n1/queue/max_sectors_kb
512 # NOTICE THE KB, these are 1000 sectors
[kollerr@oc6638465227 ~]$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
2048 # max request size out of the IO scheduler queue into the disk driver
[kollerr@oc6638465227 ~]$ cat /sys/block/nvme0n1/queue/logical_block_size
512 # this was introduced for disks using 4K sector sizes [0]
[kollerr@oc6638465227 ~]$ cat /sys/block/nvme0n1/queue/hw_sector_size
512

Regarding atomicity guarantees for writes. Still not sure what are all the FSes
and setups that ensure atomicity for multi-sector writes. But one thing for
sure is that aligned writes of physical sector size are atomic [1]. So, it
makes sense for the unikernel to know and use this atomicity guarantee. We
could tell the unikernel "block size is 8192", and store the virtual disk as a
file in a fancy FS that ensures atomic writes of 8192. But, until we do that,
it's best to leave it at 512 (or whatever the disk block size is).

[0] https://lwn.net/Articles/322777/
[1] Although don't think appends or writes to holes in sparse files can always be atomic.

@cfcs
Copy link

cfcs commented Jan 28, 2019

@ricarkol

Aligned writes of physical sector size in direct IO mode are atomic (on most functional hardware, but idk if eg tape devices work that way).

Our aligned writes of physical sector size are not guaranteed to be atomic, even if they happen to be most of the time. For instance we don't consider alignment_offset (which is pretty much always 0).
I don't know if there's a benefit to trying to make them more atomic or aligned in practice.

Anyway, I still don't really understand what the implementation of this FS has to do with Solo5. Note that as Solo5 is not using direct IO, we are already taking advantage of a block cache layer here. Multiple 512-byte writes WILL get cached and sent in batches to the storage controller whenever the operating system pleases to do so. The only performance bottleneck you will encounter is the overhead of context switching between the guest and the tender (free with spt) and between tender and underlying platform (linux kernel in this case). If this is a problem for you, and you don't care so much about precise error information, you might be better off just exposing an mmap where you can avoid context switching even more effectively.

The underlying disk is addressable in disk blocks (aka sectors), and accessed via requests of multiple disk blocks.

I found this article helpful to explain a bit of what is going on under the hood:
http://fibrevillage.com/storage/563-storage-i-o-alignment-and-size

Interestingly it helped correct my assumption that BLKSSZGET would be sufficient; it turns out we would need BLKPBSZGET to pull the physical block size as opposed to logical block size, and adjust for the alignment offset.

Copy link
Member

@mato mato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Adding a dummy review here, as this has turned into a discussion about the block API, and the proposed change is not part of that. There should be a way to convert a PR into an issue.)

@ricarkol
Copy link
Collaborator Author

Moving the discussion to #325

@mato mato closed this Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants