Skip to content

Conversation

@jzakaryan
Copy link
Collaborator

Even in flushness mode BMM's tasks do a flush call on the producer when shutting down. We have observed that producer flush call tends to get indefinitely stuck and this keeps the tasks from shutting down gracefully. The code change in this PR addresses this by wrapping the producer flush call in a future and blocking on that future with a timeout.

If the producer flush doesn't complete in the given timeout window, the task will proceed to committing safe offsets and shutting down. The timeout window is exposed through a configuration property.

@jzakaryan jzakaryan changed the title Added a timeout on the producer flush call in KafkaMirrorMakerConnecorTask Added a timeout on the producer flush call in KafkaMirrorMakerConnectorTask Aug 28, 2023
Copy link
Collaborator

@shrinandthakkar shrinandthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?

ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@jzakaryan
Copy link
Collaborator Author

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?

ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@shrinandthakkar the problem is that the the tasks were found to be stuck on flush after 60 seconds despite the default value of 5 seconds for offset.flush.timeout.ms. Connector logs show that it interrupted the producer.flush to shut down the task.
We can reach out to Kafka team and ask if the behavior of LKC wrt producer flush and config values are the same as in Apache Kafka. But that can happen in parallel with us addressing it.

@shrinandthakkar
Copy link
Collaborator

why shouldn't we just use the producer config offset.flush.timeout.ms instead of creating a newer config to wait on ?
ref: https://kafka.apache.org/21/documentation.html#producerconfigs

@shrinandthakkar the problem is that the the tasks were found to be stuck on flush after 60 seconds despite the default value of 5 seconds for offset.flush.timeout.ms. Connector logs show that it interrupted the producer.flush to shut down the task. We can reach out to Kafka team and ask if the behavior of LKC wrt producer flush and config values are the same as in Apache Kafka. But that can happen in parallel with us addressing it.

@jzakaryan
Within the EventProducer's flush call, I think we already have a configuration defined (ref). And the value of that flush timeout config is INT_MAX ? Is that why we are waiting forever ?

Do you think if we should rather try to reconfigure that value for MM clusters ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants