A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/confluentinc/librdkafka/issues/3109 below:

Robustness and resiliency on Azure · Issue #3109 · confluentinc/librdkafka · GitHub

Hi there,

first things first: Thanks for the tremendous amount of work you are putting into librdkafka, @edenhill. You know who you are.

Introduction

This is not meant to be a specific bug report as we believe the issues we have been experiencing when using librdkafka for connecting to Azure Event Hubs have already been mitigated within librdkafka 1.5.2 and newer. In fact, they might not have been specific to Azure Event Hubs anyway but also might have tripped others when just running Apache Kafka or the Confluent Stack on Azure in general.

Instead, we wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.

The topics are spanning the area of Azure networking (problems) in general, as well as things related to Kafka and Kubernetes.

So, here we go.

General research Azure LB closing idle network connections

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.

In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.

Quotes

TL;DR: Azure has a nasty artificial limitation that results in being unable to use long-lived TCP connections that have >= 4 minutes of radio silence at any given point.

They screwed it up so hard that when connection does timeout, they acknowledge the following TCP packets with an ok flag that makes the sender think “everything is okay - the data I sent was received succesfully”, which is 100 % unacceptable way to handle error conditions.

This caused me so much pain and loss of productive work time.

-- https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Resources Using Kafka and Event Hubs on Azure Quotes

"The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in."

-- https://stackoverflow.com/a/58385324

Magnus Edenhill:

Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure.
I recommend opening an issue with the Azure folks.

Joakim Blach Andersen:

I got an answer from Azure:
The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly (Kafka).

-- #2845

Magnus Edenhill:

We're trying to work around these Azure weak idle disconnects in the upcoming v1.5.0 release by reusing the least idle connection for things like metadata requests, which should keep that connection alive and not cause these idle disconnect request timeouts.

-- confluentinc/confluent-kafka-dotnet#1305 (comment)

Resources

With kind regards,
Andreas.

mhowlett, TsuyoshiUshio, MikeSchlosser16, bytejan, feocco and 8 moremhowlett, TsuyoshiUshio, feocco, PSanetra, jeqo and 3 more


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4