tl;dr - Finally, the results of the benchmarking. You can find the code on GitLab. There are some issues with the benchmarks but there was enough decent data to make a decision for me at least. As far as which storage plugins I’m going to run, I’m actually going to run both OpenEBS Mayastor and Ceph via Rook on LVM. I look forward to emails from users/corporations/devrel letting me know how I misused their products if I did – please file an issue on GitLab!
NOTE: This a multi-part blog-post!
In part 4 we worked through getting the testing tools set up, figuring out how to get our results back out, and now we can finally make use of that beautiful data!
Thanks for sticking with me (those of you that did) all the way here – I’m sure it must have felt like I was doing some sort of “growth hack” drip-feed but I just didn’t want to drop 6 posts all at the same time once I realized how big this series was going to have to be. No one wants to see a scroll bar nub the size of an atom.
NOTE: No you’re not crazy, this post series was reduced from 6 parts to 5, I combined Part 6 into this post just so I could be done quicker. I need to actually get back to shipping software.
ResultsTo keep the tables narrow I used some acronyms, so a legend is required:
usec
)fio
, some values are un-knowable)Remember, “RW” does not stand for “read/write”!
No replication (“JBOD”)This round of testing is on a single node this means a single node with a single copy – i.e. your data is not safe but it probably got there fast. As you might expect, the theoretical limit for this a hostPath
volume, and the storage plugin that should be closest to the upper bound is OpenEBS LocalPV Hostpath.
fio
If you’re into tabular data (powered by Simple DataTables):
Plugin (JBOD, no replicas) RR IOPS RR BW RW IOPS RW BW Avg RL Avg WL SR SW Mixed RR IOPS Mixed RW IOPS OpenEBS HostPath (O_DIRECT
) 295,000 1428 318,000 2652 52.69 13.53 3378 2721 252,000 83,900 OpenEBS Mayastor ISCSI (O_DIRECT
) 128,000 1399 76,800 2028 89.18 78.72 111 2532 78,100 26,300 OpenEBS Mayastor NVMe-oF (O_DIRECT
) 90,300 1398 78,500 2053 88.55 86.66 2660 2105 64,700 21,600 Rook Ceph LVM (O_DIRECT
) 51,600 4141 5307 499 203.46 ??? 5419 2190 4726 1582 OpenEBS Hostpath 19,200 6079 517,000 4135 210.88 ??? 9943 3718 18,500 6194 LINSTOR drbd9 17,600 6074 473,000 3860 230.79 ??? 9815 3389 17,200 5708 OpenEBS Mayastor ISCSI 12,300 4582 522,000 5210 323.12 ??? 105 5645 11,800 3921 OpenEBS Jiva (O_DIRECT
) 6129 235 6883 324 682.68 570.70 409 466 4470 1491 OpenEBS Jiva 5564 218 456,000 4046 738 ??? 1169 3681 3486 1157 OpenEBS cStor (O_DIRECT
) 5454 159 4983 180 895.86 991.47 57.3 156 3942 1311 Rook Ceph LVM 4883 268 507,000 4007 828.36 ??? 7103 3416 4544 1519 OpenEBS cStor 2190 34.2 487,000 4142 1854.37 ??? 118 3687 923 304 OpenEBS Mayastor NVMe-oF 11.800 4640 530,000 5196 341.55 ??? 10,200 5662 11,700 3666 pgbench
And if you’re into tabular data:
Plugin (JBOD, no replicas) # clients # threads transactions/client Latency Avg (ms) tps w/ establish LINSTOR drbd9 1 1 10 2.482 402.97 OpenEBS Jiva 1 1 10 2.156 463.85 OpenEBS LocalPV HostPath 1 1 10 2.216 451.18 OpenEBS Mayastor ISCSI 1 1 10 2.218 450.82 OpenEBS Mayastor (NVMe-oF) 1 1 10 2.228 448.90 Rook Ceph LVM 1 1 10 2.182 458.23 Single copy replication (RAID1)As previously mentioned this round of testing is single node, so “RAID1” is the traditional meaning of RAID1 – duplicated data – in this case, with 2 disks. The closest we might get to perfection here would be mirrored LVM, but I don’t think we have a benchmark that is quite in that area. The closest thing we have is OpenEBS LocalPV ZFS because the underlying ZFS pool is mirrored but ZFS has a heavier overhead than LVM so let’s just see what the results shape up as.
fio
And if you’re into tabular data:
Plugin (RAID1, 2 replicas 1 node) RR IOPS RR BW RW IOPS RW BW Avg RL Avg WL SR SW Mixed RR IOPS Mixed RW IOPS OpenEBS cStor 2205 35.30 493,000 4039 1856.88 ??? 125 3422 920 303 OpenEBS cStor (O_DIRECT
) 5441 159 4979 185 910.12 983.81 60.90 167 3979 1315 OpenEBS LocalPV ZFS 148,000 2304 41,500 646 27.51 105 5098 629 46,700 15,600 OpenEBS LocalPV ZFS (O_DIRECT
) 146,000 2298 30,500 473 27.31 121.63 5239 634 46,800 15,600 Rook Ceph LVM 4837 250 358,000 3404 840.48 ??? 6978 2555 4730 1581 Rook Ceph LVM (O_DIRECT
) 50,700 4076 4132 370 206.70 ??? 5330 1124 2790 939 pgbench
And if you’re into tabular data:
Plugin (RAID1, 2 replicas 1 node) # clients # threads transactions/client Latency Avg (ms) tps w/ establish OpenEBS cStor 1 1 10 2.193 455 OpenEBS LocalPV ZFS 1 1 10 3.958 252.65 Rook Ceph LVM 1 1 10 2.257 443.038I won’t get into too much analysis just yet, but it looks like this is the difference between the speed of LVM RAID1 (at the lower level) and ZFS-based mirrored setups. It looks like ZFS performance is about 55% of the performance – which is what you might expect from having to write twice the amount of data. I wonder if LVM writes data initially as striped then does some settling later? It’s a bit curious that only ZFS has this penalty.
Note also that the ZFS (cStor is also ZFS) results might be skewed due to how ZFS writes to memory (“asynchronous write”) O_DIRECT
results are probably a little better to use for eyeball comparisons.
Here are some of the thoughts that I shared with my mailing list:
JBOD (single disk) resultsA few thoughts on the single disk results and anomalies:
And a few results on the disk-level RAID1 setup:
There’s a large amount of things left to test/re-run. Getting all this automated is a good first step (so it will be easier to iterate on testing any individual one of these later), but I’ve left out the following variations, just to start:
fio
block sizesfio
worker thread countAll of these things could have huge impact and could change the skew of these results – its’ a huge matrix of possibilities. What I need is a framework like the one I built back during the PM2 cluster sizing experiments. I may have to reach over and use that in the future, but for now I’ll take these results as just one cubby hole in the matrix.
Sidenote: Making graphs is harder than it should beRepeating it here, but someone needs to make a site where you can paste a table and get a simple bar/line graph out. For these graphs I used the venerable gnuplot
. Though I did see some nice options:
Looks like I’ve got a new side project to do at some point – if this was automated it would be a lot easier to run all the tests, and I could spend more time on only analysis.
Not completed: Clustered deploymentsThere’s obviously some more to be done here – I need to try out and get some decent numbers on clustered deployments. Multi-node setups are the obvious day 2 for better disaster recovery and availability, but this analysis was only single-node.
What if: btrfsZFS and btrfs are really similar – there are differences, but btrfs is actually somewhat easier to manage in some ways (heterogeneous disks, etc), and while people used to often site questions of reliability, Facebook is well known to have switched to btrfs and btrfs is the default filesystem of Fedora 33. And after all, btrfs has made it into the mainline linux kernel.
In the past I have ran into some posts on ZFS’s NVMe performance that show it’s defnitely not a free lunch and some care has to be taken when using ZFS in new places (it seems like performance was bad because trying to write to ZIL/SLOG was actually causing writes to go slower because of just how fast NVMe is). While I don’t really have the problem of NVMe drives on any of the machines I run right now it’s worth keeping an eye on how the area evolves.
NVMe aside, ZFS’s performance penalty seems to be kind of high – there’s a paper out there which shows some pretty favorable numbers for BTRFS though it’s back from 2015. In general ZFS’s performance cost seems to be kinda high.
ZFS has some really important features though, in particular:
One thing you might ask is why you’d need Copy-on-Write and journaling – well ZFS has a feature called SLOG that lets you keep the Write-Ahead-Log on a completely separate drive! I may need to test ZFS with and without the ZIL in the future.
But getting back to btrfs
– I’ve heard some mixed reviews over the years about btrfs
, but it’s important to note that it’s in the kernel already. I found a paper with btrfs
doing quite well compared to ext4
and xfs
and some levelheaded helpful reviews.
Maybe it’s worth giving it a shot? Are there any provisioners for btrfs
PVCs? Should I just format some local loopback mounted disks with btrfs
(ex. OpenEBS rawfile LocalPV)? The easiest to start with would likely be the latter, especially for testing – loopback drives shouldn’t suffer too much of a penalty to at least make a comparison.
One thing I’ve always wanted/needed in btrfs is synchronous remote writes (kind of like ceph). There’s quite a performance hit, but being able to ensure writes are persisted on a remove drive before continuing for certain workloads would be really useful. Unfortuantely brtfs
doesn’t support this either. btrfs
does have some very nice features:
btrfs
has better disk size flexibilitybtrfs
has no downtime pool expandingbtrfs
has volume shrinkingbtrfs
has automatic data redistributionZFS seems to be better at raidzX but I know for most dedicated servers I’m going to have 2x some drive so I don’t think I’m too worried about RAID5/Z5.
The last thing to worry about is whether btrfs
is stable. The FUD is quite persistent on this front, but there’s a whole section on it in the kernel wiki, and Facebook happens to use it. It’s also the default for OpenSUSE’s enterprise distribution so honestly it’s good enough for me.
One area of optimization that we could do on storage is considering whether we need ZIL for certain workloads on ZFS. ZFS is already copy on write, so it makes sense to disable similar features at the application level (and various scrubbing/checksumming features) if we know that we’re getting rock solid data protectionf rom ZFS. It looks like sometimes it’s actually better to have it for performance, but it’s worth considering.
A Generate a searchable/static results pageWould be cool to have a simple static results page that we could throw up on GitLab pages and use to share the results. Maybe using some simple markdown-driven site generator would be a good idea/enough. If we wanted to get really fancy, we could even schedule the tests to be run every week or month. I’ll leave that for the next time I touch this project (which will probably be when I add clustered testing as well)
If you’re going to have lots of OSDs, watch out foraio-max-nr
While reading some stuff around the internet I took some notes on a presentation given by some people pushing the boundaries of storage setup in Japan. This slide seemed like something worth remembering, maybe even worth setting this value to something high right off the bat on nodes that are going to be storage-focused.
Looks like Ceph has some documentation on this as well
QoS support?PVC-central Quality of Service limitations/features are lacking on all the baremetal hobbyist to enterprise I’ve discussed here. Generally if you want to scale to platform level service, you’re probably going to want to make sure you can avoid/limit the noisy neighbor problem. At the time of this post Kubernetes itself doesn’t even support hard-drive QOS quite yet, and there’s a KEP out.
Longhorn? NopeLonghorn doesn’t support QoS (and OpenEBS Jiva since it’s based on it)
ZFS (Ceph/ZFS or OpenEBS localpv-zfs) ? NopeQoS features are in Oracle ZFS, but you know what they say about Oracle, not even once. OpenZFS doesn’t seem to have this feature right now so unless I’m willing to jump in and implement it (or someone else does) it’s not there for now.,
Ceph/Rook? Yes!QoS is supported by Ceph but not yet supported or easily modifiable via Rook and not by ceph-csi
either. It would be possible to set up some sort of admission controller or initContainer
s to set the information on PVCs via raw Ceph commands after creation though so I’m going to leave this as possible.
After setting up a Block you can clearly see the QOS settings:
kernel-level support? Yup (with v2 cgroups)Maybe we coudl limit IOPS with containerd
& v2 cgroups? We know that v1 cgroups couldn’t limit properly if you use the linux writeback cache (i.e. you weren’t using O_DIRECT
which postgres
doesn’t but mysql
does for example), but v2 looks to have some promise – there’s an awesome blog post out there by Andre Carvalho outlining how to do it with v2 cgroups.
Even though the kernel can do it, looks like Kubernetes doesn’t have the ability just yet:
containerd
(always the innovator, I remember when it was the first to get alternative non-runc
container engines underneath) looks to support v2 cgroups and io limiting thanks to hard work from Akihiro Suda and others).
Limiting IOPS is a crucial feature of being able to provide storage as a service, up until now I thought the best way was to do it at the storage driver layer but it looks like the kernel subsystems might be just as good a place and might work for everything at the same time. Hopefully sometime in teh future I can get aroudn to testing this.
Read Write Many (RWX)Read Write Many is another feature of Kubernetes storage plugins that isn’t talked about so much (since it’s not available that often) but it’s a really nice feature in my mind. A couple reasons you might want this:
Ceph is very robust software so it actually has a way to do block storage, object storage and file storage. For Ceph, CephFS is the piece of the puzzle that supports RWX drives. Since Rook is managing “our” Ceph instance, the question of whether it’s supported by Rook also comes up so there are a few issues in the rook repo about this:
It looks like CephFS has been fucntional since Rook v1.1, though there was one ticket that suggests it may have been broken intermittently between releases.
Someday I’d love to circle back and give this a try – hopefully sometime soon in the future!
RWX via OpenEBS + NFS? PossibleAnother way to achieve RWX would be to use OpenEBS’s in combination with NFS. The OpenEBS documentation details setting up RWX PVCs basedon NFS very well (also detailed on their blog back in 2018) and if you want to go straight to the code, openebs/dynamic-nfs-provisioner
is the place to start.
This path seems really easy, but a blog post on it would make things crystal clear, I’m not sure if many people have taken this route.
UPDATE: RWX via Longhorn? YesAnother good option for the RWX problem is Longhorn – as I’ve mentioned this is what OpenEBS Jiva is based on, and it’s worth mentioning that the ease with which you can set up OpenEBS Jiva is definitely attributable in part to Longhorn’s hard work. I personally have interacted with their team and they’re very capable (obviously, longhorn works so well it was extended), and gracious as well, even when I was lodging what might be seen as a complaint.
Originally I missed this point, and for that I have to thank /u/joshimoo on reddit who pointed it out two salient points:
At this point we have enough of these solutions that support RWX, it looks like next time I’m going to need to add longhorn
to the stable of solutions, and start running tests on RWX as well, as soon as I find out how you’re supposed to test it in the first place (somehow fio
doesn’t seem like the right choice). According to an SO post it looks like I can just check the NFS wiki for decent testing tools, and dbench
(I used this last time I did some benchmarking)or bonnie
might be good bets. Maybe I’ll be the first one to get some proper NFS vs CephFS performance comparisons. Even dd
and iozone
seem to be fine for benchmarking NFS so maybe I don’t need to think about it too hard.
The question of whether to use NFS with some not-ceph solution and CephFS is a big one. I’ve only been able to find one resource that seems to compare the two:
But even this is more about running samba
on top of them, which is a bit different from a knock-down drag-out benchmark.
The numbers I’ve gotten aren’t all that good. They’re pretty good in seeing some trends, but pretty bad in other ways, some of the various biases of certain providers are leaking through, and making it somewhat hard to get a good read. As I get more time to go over the benchmarks (or get some nice contributions) I think I’ll work on them more and may revisit, but for now I’m making the following determination:
Given a system with more than one NVMe disk, I’m going to install both OpenEBS Mayastor and Ceph, and use them both, with the bigger empty disk going to Ceph and the smaller partition going to OpenEBS Mayastor.
There are a few reasons I’ve arrived at this decision:
I really want to have ZFS underneath/as a part of my system (and maybe I still can by running LocalPV ZFS on top of mayastor or something), and have it’s amazing data safety, but the drastic drop in write speed (which you can see in the TPS reductions) and the addition of overhead when I can get some pretty decent availability improvements from the other two solutions just makes it a non-starter. We’ll see if I live to regret this decision, because if NVMe is 2x faster than SSD, then maybe it makes sense to use ZFS and go back to ~SSD performance with the benefit of essentially absolute data confidence and a really easy to configure/use system.
WrapupWell this started as one blog post and turned into a 5 part series! It’s been grueling but fun yak shave and hopefully a worthwile post to put together (along with the development on the back end though).
I’m glad I finally got some time to do some proper testing of these solutions and hopefully the analysis will help others out there. If this article ends up being anywhere as useful as the CNI benchmark posts have been, I’ll be more than thrilled.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4