Infinitic 0.18.0: How We Fixed a Critical Orchestration Pattern Challenge
Solving Critical Message Ordering in Distributed Workflows
TL;DR
Infinitic 0.18.0 fixes Pulsar critical message ordering issues that could cause unexpected workflow behavior
We implemented transient storage for out-of-order messages and an optimistic locking pattern
These fixes ensure reliable workflow orchestration even during workers rebalancing events
Required action: Database migration needed to add a version column for optimistic locking
As we release Infinitic 0.18.0 today, I wanted to share some insights into a fundamental challenge we tackled in this version. If you've been using our workflow orchestration framework, this update will significantly improve reliability, especially when scaling up or down your workers.
The Critical Role of Key-Ordering in Orchestration
When I first designed Infinitic, one of my core principles was building a rock-solid foundation for workflow orchestration. At the heart of any reliable orchestration system lies a seemingly simple but absolutely critical requirement: key-ordering guarantee.
What exactly is key-ordering and why is it so essential?
In distributed systems like Infinitic, messages are constantly flowing between components. When these messages relate to the same workflow instance (identified by a unique key), the ordering of those messages can mean the difference between correct execution and subtle, dangerous race conditions.
Imagine a workflow that's waiting for two task completions before proceeding to the next step. If both tasks complete at nearly the same time and send their completion messages back to the workflow engine:
Task A completes and sends a completion message
Task B completes and sends a completion message
The workflow should process these in order to correctly update its state
Without key-ordering guarantees, the workflow might process Task B's completion first, then Task A's, potentially leading to incorrect state transitions, especially if the workflow logic depends on the specific order of completion or if the state update is done at the same exact moment.
This example illustrates why key-ordering isn't just a nice-to-have feature—it's fundamental to predictable workflow behavior. Without it, workflows become non-deterministic, making debugging nearly impossible and reducing reliability to unacceptable levels.
Apache Pulsar and the Key-Ordering Challenge
Infinitic has been built on Apache Pulsar since day one. We chose Pulsar for its modern architecture, scalability, and—crucially—its promised key-ordering guarantee.
In theory, Pulsar should ensure that all messages with the same key are delivered in the order they were published. However, as our users pushed Infinitic, we discovered edge cases where Pulsar's key-ordering guarantees weren't as robust as we expected.
The specific issue occurred during certain scaling scenarios. When Pulsar rebalances consumers (which happens during scaling events or after failures of a worker), there's actually a small window where messages with the same key could be processed out of order if they were assigned to different consumers.
For most messaging applications, this brief reordering might be acceptable. But for a workflow orchestration engine like Infinitic, even rare ordering violations can lead to incorrect workflow states and unpredictable behavior.
The Fix in Infinitic 0.18.0
Here's how we approached the problem:
Transient storage for out-of-order messages: We've implemented a system that detects when messages arrive out of sequence and temporarily stores them in a dedicated buffer. This ensures that even if Pulsar delivers messages in the wrong order, Infinitic can reorder them correctly before processing. When the expected preceding messages finally arrive, we can retrieve the buffered messages and process them in the correct sequence.
Optimistic locking pattern for state storage: We've introduced an optimistic locking mechanism when storing workflow state. Here's how it works:
When retrieving a workflow state, we also fetch its current version number
Before saving any updates to that state, we verify the version hasn't changed
If another process has updated the state in the meantime (detected by a version mismatch), we abort the current update, re-fetch the latest state, and retry the operation
This approach prevents multiple workers from overwriting each other's changes and maintains state consistency even when processing concurrent messages.
These changes work together to create a significantly more robust orchestration engine that maintains strict key-ordering guarantees even under heavy load, consumer failures, or scaling events.
What This Means for Infinitic Users
If you're upgrading to Infinitic 0.18.0, you'll experience improved overall reliability for mission-critical workflows
Important upgrade note: This release requires a database migration to add a version column to your workflow state tables. This version column is essential for the new optimistic locking mechanism. We've provided migration scripts for all supported databases in our documentation, and the migration process is straightforward. Be sure to run these migrations before deploying the new version to production environments.
Looking Forward
Key-ordering isn't a flashy feature, but it's fundamental to the correctness guarantees that make Infinitic valuable for critical business workflows.
The solutions we've implemented not only fix the immediate issues but strengthen Infinitic's architecture for future scalability.
As always, we welcome your feedback on this release. The real-world scenarios you bring to Infinitic continue to push us to build a better product.
UPGRADE NOW