TLDR: Not all architectures are created equal, but perhaps even more importantly, not all build servers we have access to are equal in performance, power, or ability to process builds reliably.
Important: Please do not post here with reports of individual image issues -- we're aware of the overall problem, and this issue is a discussion of solving it generally. Off-topic comments will be deleted.
When we merge an update PR to https://github.com/docker-library/official-images, it triggers Jenkins build jobs over in https://doi-janky.infosiftr.net/job/multiarch/ (see #2289 for more details on our multiarch approach).
Sometimes, we'll have non-amd64
image build jobs finish before their amd64
counterparts, and due to the way we push the manifest list objects to the library
namespace on the Docker Hub, that results in amd64
-using folks (our primary target users) getting errors of the form "no supported platform found in manifest list" or "no matching manifest for XXX in the manifest list entries" (see linked issues below for several reports from users of this variety).
Thus, manifest lists under the library
are "eventually consistent" -- once all arches complete successfully, the manifest lists get updated to include all the relevant sub-architectures.
Our current method for combating the main facet of this problem (missing amd64
images while other arches are successfully built and available) is to trigger amd64
build jobs within an hour after the update PR is merged, and all other arches only within 24 hours. This helps to some degree in ensuring that amd64
builds first, but not always. For example, our arm32vN
servers are significantly faster than our AWS-based amd64
server, so if those jobs happen to get queued at the same time as existing amd64
jobs are, they'll usually finish a lot more quickly. Additionally, given the slow IO speed of our AWS-based amd64
build server, the queue for amd64
build jobs piles up really quickly (which also doesn't help with keeping our build window low).
As for triggering jobs more directly, the GitHub webhooks support in Jenkins makes certain assumptions about how jobs and pipelines are structured/triggered, and thus we can't use GitHub's webhooks to effectively trigger these jobs (without doing additional custom development to sit between the two systems), and thus rely on the built-in Jenkins polling mechanism. This has been fine (we haven't noticed any scalability issues with how often we're polling), and even if we were triggering builds more aggressively, that's only half the problem (since then our build queues would just pile up faster).
One solution that has been proposed is to wait until all architectures successfully build before publishing the relevant manifest list. If a naïve version of this suggestion were implemented right now, we would have no image updates published because our s390x
worker is currently down (as an example -- we do frequently lose builder nodes given that all non-amd64
arches are using donated resources). Additionally, as noted above, some architectures build significantly slower than others (before we got our hyper-fast ARM hardware, arm32vN
used to take days to build images like python
), so it isn't exactly fair to force all architectures to wait for the one slowpoke before providing updated images to our userbase. As a final thought on this solution, some architectures outright fail, and the maintainers don't necessarily notice or even care (for example, mongo:3.6
on windows-amd64
has been failing consistently with a mysterious Windows+Docker graph driver error that we haven't had a chance to look into or escalate, and wouldn't be fair to block updated image availability on).
One compromise would be to use the Jenkins Node API (https://doi-janky.infosiftr.net/computer/multiarch-s390x/api/json) to determine whether a particular builder is down in order to determine whether to block on builds of that architecture. Additionally, we could try to get creative with checking pending builds / queue length for a particular architecture's builds to determine whether or not a given architecture is significantly backlogged and thus a good candidate for not waiting.
We could also attempt to determine when a particular tag was added/merged, and set a time limit for some number of hours before we just assume it must be backlogged, failing, or down and move along without that tag, but this is slightly more complicated (since we don't have a modification time for a particular tag directly, and really can only determine that information on an image level without complex Git walking / image manifest file parsing). Perhaps even just a time limit on the image level would be enough, but in the case of our mongo:3.6
example, that would mean all tag updates to mongo
(whether they're related to the 3.6 series or not) would wait the maximum amount of time before being updated due to one version+architecture combination failing.
Related issues: (non-comprehensive)
leonsp, MattF-NSIDC, aleskiontherun, lutskanu, alcalyn and 21 morehairyhenderson
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4