Description
I'm in the process of migrating an LXC setup to a Docker Swarm environment. The size of this setup is 50 nodes with about 10k services running. The cluster with 50 nodes (32GB RAM) seems to be working as expected. When I now try to start services I get stuck at around 500 - 700 services. After that point docker ls
shows entries like: rxlga525wqjd <service-name> replicated 0/1 <container>
and docker service ps <service-name>
shows:
ixoggohaa7kc <service-name> <container> prd-pro-16 Running Running 1 second ago
p0gziroccv42 \_ <service-name> <container> prd-pro-24 Shutdown Failed 12 seconds ago "starting container failed: co…"
j0fvbve416gz \_ <service-name> <container> prd-pro-35 Shutdown Failed 2 minutes ago "starting container failed: co…"
873vsi6c80vx \_ <service-name> <container> prd-pro-24 Shutdown Failed 4 minutes ago "starting container failed: co…"
Every container is connected to 2 networks, proxy
(10.1.0.0/16) and db
(10.2.0.0/16). Running containers are reachable.
Steps to reproduce the issue:
Describe the results you received:
After about 500 containers creating a new service doesn't work anymore.
Describe the results you expected:
Service creation works within 1 or 2 seconds.
Additional information you deem important (e.g. issue happens only occasionally):
When I log into the node where the container is started and run docker ps
the command hangs until the container fails. Creating a new container directly on a node that has failed before works as expected and container start up takes less than a second.
All machines appear to be idle and there are no tremendous peaks in load/memory from what I can see. The average load per machine is 10 - 15 containers with around 100 - 150 MB memory usage each.
I am aware that debugging this is hard and I'm very happy to do screen sharing or provide any logs that might be needed to get to the bottom of this behaviour. I'm grateful for every input!
Output of docker version
:
# docker version
Client:
Version: 1.13.0
API version: 1.25
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:50:17 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.0
API version: 1.25 (minimum version 1.12)
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:50:17 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
# docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 1.13.0
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 7
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: ssci1al4vgi7nzirptugjrpsl
Is Manager: true
ClusterID: qfi82y6p6rpnnfxkarxxfdca4
Managers: 3
Nodes: 50
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.133.6.234
Manager Addresses:
10.133.6.234:2377
10.133.8.162:2377
10.133.8.89:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
apparmor
Kernel Version: 4.4.0-53-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.42 GiB
Name: prd-pro-01
ID: 37WG:BKQR:YXQF:LTMK:P3IQ:CDQ5:ZEM5:PI6L:32DR:3KET:QH4Z:GB5N
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: ghostengineering
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
DigitalOcean
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4