A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/StackExchange/StackExchange.Redis/issues/2251 below:

Cluster Connect works partially, with a full timeout. · Issue #2251 · StackExchange/StackExchange.Redis · GitHub

AWS Redis 6.2.5 cluster 2 shards, each with replica, with a config endpoint. FWIW, the config endpoint is a DNS alias of a random cluster node, I believe.

Originally, a .NET Core app uses SE.Redis 2.2.88, configured with that single config endpoint. Everything works.

After upgrading to SE.Redis 2.6.66, random Redis commands started to return non-sensical errors that were reproducible somewhat in 50% of the cases.

E.g. a simple transaction with a single operation against a single random key would give EXECABORT in 50% of the runs. (was not the root issue but the specific reason for that was that in 50% tx would be sent to the server as an empty "multi"/"exec" pair, i.e. missing the actual op, according to redis-cli monitor).

Additionally, the time of the initial ConnectionMultiplexer.Connect would be always around ConnectTimeout, regardless of the value. Very much similar to what is described in this issue here.

All errors/delays would disappear if all 4 endpoints would be spelled out in configuration, i.e. the issue would take place only if the muxer has to discover remaining endpoints from the provided one.

The described issues were reproduced in tests, so what the app does and where it runs seems irrelevant. The configuration is mostly vanilla (dumped below) and didn't change during the upgrade. The app uses custom SSL callback that allows "*.amazonaws.com" certs, not sure if that makes a difference here.

SE.Redis 2.6.66 connection sequence (ConnectTimeout = 30s), from ConnectionMultplexer.Connect

**** Connecting...
01:32:28.2322: Connecting (sync) on .NET 6.0.7 (StackExchange.Redis: v2.6.66.47313)
01:32:28.3465: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380,keepAlive=60,syncTimeout=5000,asyncTimeout=5000,allowAdmin=False,version=5.0,connectTimeout=30000,password=*****,tiebreaker=__Booksleeve_TieBreak,ssl=True,sslHost=localhost,configChannel=__Booksleeve_MasterChanged,abortConnect=False,resolveDns=False,connectRetry=3,proxy=None,configCheckSeconds=60,defaultDatabase=0
01:32:28.3548: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Connecting...
01:32:28.3777: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: BeginConnectAsync
01:32:28.4253: 1 unique nodes specified (with tiebreaker)
01:32:28.4301: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:28.4350: Allowing 1 endpoint(s) 00:00:30 to respond...
01:32:28.4446: Awaiting 1 available task completion(s) for 30000ms, IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=1,Free=32766,Min=12,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=5)
01:32:28.5670: Configuring TLS
01:32:28.7636: TLS connection established successfully using protocol: Tls12
01:32:28.7692: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Connected 
01:32:28.7768: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Server handshake
01:32:28.7811: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Authenticating (password)
01:32:28.8240: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Setting client name: W-ITEPLO-TJ5N7(SE.Redis-v2.6.66.47313)
01:32:28.8331: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configuring...
01:32:28.8421: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Requesting tie-break (Key="__Booksleeve_TieBreak")...
01:32:28.8506: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Writing: GET __Booksleeve_TieBreak
01:32:28.8588: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Sending critical tracer (handshake): ECHO
01:32:28.8620: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Writing: ECHO
01:32:28.8711: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Flushing outbound buffer
01:32:28.8761: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnEstablishingAsync complete
01:32:28.8791: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Starting read
01:32:28.9559: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) role: primary
01:32:28.9617: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) version: 6.2.5
01:32:28.9678: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) server-type: cluster
01:32:29.0186: Response from clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive / GET __Booksleeve_TieBreak: Error: MOVED 6202 services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380
01:32:29.0252: Response from clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive / ECHO: BulkString: 16 bytes
01:32:29.0342: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync completed (From command: ECHO)
01:32:29.0390: All 1 available tasks completed cleanly, IOCP: (Busy=1,Free=999,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=10,QueuedItems=0,CompletedItems=20)
01:32:29.0441: Endpoint summary:
01:32:29.0482:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:29.0536:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Returned with success as Cluster primary (Source: From command: ECHO)
01:32:29.1106: services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1156: services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1202: services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1236: services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1273: Allowing 4 endpoint(s) 00:00:29.3080000 to respond...
01:32:29.1315: Awaiting 4 available task completion(s) for 29308ms, IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=10,QueuedItems=0,CompletedItems=26)
01:32:58.4418: Not all available tasks completed cleanly (from ReconfigureAsync#1284, timeout 29308ms), IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=73)
01:32:58.4475:   Server[0] (services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4510:   Server[1] (services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4550:   Server[2] (services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4591:   Server[3] (services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4632: Endpoint summary:
01:32:58.4668:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4737:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4795:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4833:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4875:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4908:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4940:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4969:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.5004: Cluster: 16384 of 16384 slots covered
01:32:58.5055: Endpoint Summary:
01:32:58.5101:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
01:32:58.5155:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
01:32:58.5243:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
01:32:58.5276:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5302:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5332:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
01:32:58.5361:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5385:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5409:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
01:32:58.5432:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5456:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5481:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
01:32:58.5510:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5535:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5560:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
01:32:58.5587: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
01:32:58.5610: Starting heartbeat...
01:32:58.5643: Total connect time: 30,332 ms
**** Connected

After debugging for some time, I believe the issue was introduced in #1947. Specifically this:

Even more specifically, this line, that was later slightly updated in #2001, but not in a way to help here (all my cluster nodes use subscriptions).

Super-specifically, during activation of the discovered endpoints, the connection sequence is awaiting on these tasks. These are TCS tasks that are set completed when this happens.

And this now (well, since 2.5.*) requires a Subscription PhysicalBridge present on that endpoint. For the life of me I don't see it being created during the Reconfigure flow anywhere but here, which clearly happens only for the initially provided endpoints.

How come it's been out for that long and nobody noticed? Things go rather bad in this partially connected state. So I mostly hope that my theory is missing something obvious and all of that is fixable with a simple option flip.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4