RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/confluentinc/librdkafka/issues/4577 below:

producer continually refreshing client metadata after broker disconnection · Issue #4577 · confluentinc/librdkafka · GitHub

Description

We are producing into a topic with 84 partitions over 8 brokers. We are seeing the issue where a recoverable error condition (difficult to reproduce, but an example would be transport failure causing a broker connection to fail) will trigger the underlying librdkafka library to continually refresh the metadata for this topic at the retry.backoff.max.ms, even though the metadata request is successful and all partitions have leaders.

e.g. here is the error condition happening which is triggered by a new connection to a broker being broken just ~8s after entering UP state

%7|1704816898.865|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816898.865|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816898.865|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.886|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.888|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#13)
%7|1704816898.907|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.926|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816898.926|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP) (_TRANSPORT)
%6|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP)
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state UP -> DOWN
%7|1704816906.827|REQERR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: ProduceRequest failed: Local: Broker transport failure: explicit actions Refresh,MsgPossiblyPersisted
%7|1704816906.827|BROKERUA|producer.host#producer-2| [thrd:ssl://broker-6:4]: produce_topic [51]: broker unavailable: produce: Local: Broker transport failure
%7|1704816906.827|FASTQUERY|producer.host#producer-2| [thrd:ssl://broker-6:4]: Starting fast leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Requesting metadata for 1/1 topics: broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-5:443/5: Request metadata for 1 topic(s): broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: Request metadata for 1 topic(s): partition leader query
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state DOWN -> INIT
%7|1704816906.827|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state INIT -> TRY_CONNECT
%7|1704816906.827|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: broker in state TRY_CONNECT connecting
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state TRY_CONNECT -> CONNECT
%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): broker down =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.859|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816906.859|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816906.859|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.878|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.880|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#14)
%7|1704816906.897|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.912|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816906.912|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query

and here we can see that even though every one of these metadata requests is returning successfully (I've truncated the response for readability) and all the partitions have leaders, the client is still refreshing the metadata as if something were broken. Note that during this the producer is working fine and messages are getting delivered fine, but we have noticed on our servers that the number of metadata requests from our clients is huge.

%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): partition leader query =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 100ms later
%7|1704816906.943|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.115|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 200ms later
%7|1704816907.131|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.475|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 400ms later
%7|1704816907.491|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816908.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 800ms later
%7|1704816908.243|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816909.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 1s later

It looks to me like there could be some sort of race condition around broker disconnects/reconnects and metadata refreshing?

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 2.3.0
Apache Kafka version: 3.4.0
librdkafka client configuration: message.max.bytes=25165824;socket.timeout.ms=10000;socket.keepalive.enable=true;debug=metadata,broker,topic
Operating system: centos7
Provide logs (with debug=.. as necessary) from librdkafka

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4