System Design: Forward Retry Mechanism In Distributed Architecture

A Forward Retry is a mechanism where a failed operation to an external service(s) is automatically attempted again after a certain delay. The primary purpose of this design is to handle transient network failure, temporary outages thereby improving the reliability of the system.

I recently faced a similar situation at work where I had to make sure that my service is always in sync with the external system even in case of a temporary failure. I chose this implementation because of its ease and the optimistic flow. By optimistic flow, I mean designing the service to proceed assuming the ‘happy path’ will succeed, while ensuring the safety net (the reconciliation service) is in place for when it doesn’t.

That way if the call fails because of a temporary outage or network failure, the external reconciliation service (we will talk about it in a minute) can look at the intent and try to retry the flow to fix the state.

Table of Contents

Why Use this Pattern?

You would use this pattern when you don’t want your system to be out-of-sync and want a reliable way to reconcile the state on failure instead of blocking it.

It’s not a new pattern, but rather a variant or a pragmatic successor to complex solutions like Two-Phase Commit (2PC), which often introduce blocking and high latency in distributed transactions.

There are many variants to this pattern which you might know from the following names:

CDC –> publisher approaches. These patterns became widespread as microservices and highly available web systems replaced heavy distributed locking/2PC solutions.
transactional outbox,
reconciliation pattern,
saga/compensating transactions (for multi-step workflows)

Idempotency

Another important consideration when implementing this design pattern is to make sure the downstream services are Idempotent.

When the downstream services are idempotent it makes it so much easier to make that create call without worrying about the duplicity. Thus, it helps with retry scenarios like the one we will tackle below.

stateDiagram-v2
    direction TB
    %% Start
    [*] --> Idle

    %% Flow
    Idle --> ReceiveRequest: "Client → Request (idempotencyKey)"
    ReceiveRequest --> CheckSeen: "Lookup idempotencyKey"

    CheckSeen --> NotSeen: "Key NOT found"
    CheckSeen --> Seen: "Key found"

    NotSeen --> Execute: "Perform action / side-effect"
    Execute --> Record: "Persist result ⟶ (idempotencyKey ↦ response)"
    Record --> ReturnSuccess: "Return success to client"
    ReturnSuccess --> [*]

    Seen --> ReturnStored: "Return stored response (no re-run)"
    ReturnStored --> [*]

    %% Errors & retries
    Execute --> Failure: "Transient error"
    Failure --> Retry: "Retry (backoff)"
    Retry --> Execute

    %% Visual styling
    classDef stateBlue fill:#e8f0ff,stroke:#1565c0,stroke-width:2;
    classDef stateYellow fill:#fff8e6,stroke:#f57c00,stroke-width:2;
    classDef stateGreen fill:#e6fff0,stroke:#2e7d32,stroke-width:2;
    classDef stateRed fill:#fff0f0,stroke:#c62828,stroke-width:2;
    class ReceiveRequest,CheckSeen,stateBlue
    class NotSeen,Execute,Record,stateYellow
    class ReturnSuccess,ReturnStored,stateGreen
    class Failure,Retry,stateRed

Takeaway: it’s an industry standard approach when you need reliability and scalability without global transactions.

How can we develop this and see it in action? Let’s assume a mock scenario.

Problem Statement

Scenario: You are a senior engineer of your team and you are tasked to develop a Policy Management Service that will be calling external Policy Provider(s) to store and manage policies for the application teams. Since external provider can change we are developing a central service to provide that abstraction for our organization.

The Policy Management Service should always be in sync with the external policy provider service.

Demo the following failure scenarios and reconciliation tactics for each:

Policy Management Service crashes before making the call to external service
Policy Management Service crashes after making a successful call to external service
External Service fails to create the record and errors out

The above three scenarios will broadly cover different stages of reconciliation. There could be more stages and should be adjusted based on the amount of robustness required.

Designing the Architecture

Let’s look at high level design. And then we will break each flow into its own sequence diagram so its’ easier to follow along.

User calls the DNS server with the domain
DNS returns the ip address of the load balancer of your service
User calls the Policy Management Service to create the policy
Policy Management Service creates a Policy:
- Stores it in Local DB
- Triggers Delayed Reconciliation Workflow
- Call external policy provider service to create the policy
External service receives the request either:
- Creates policy in the database and returns externalPolicyId
- Fails at validation then returns 400 Bad Request
- Crashes that returns 500 Internal Server Error
Reconciliation worker awakes after, let’s say 5 seconds, and reads the policy in the local database with the policy id and tries to reconcile. There are 3 cases:
- state of the policy is in CREATE_PENDING.
- state of the policy is in ACTIVE
- state of the policy is in FAILED

Here’s a high level flow diagram (Left to Right).

---
title: High Level Architecture Policy Management Service
---
flowchart TD
    user["User"] -->|compensatingaction.bemyaficionado.com| dns["DNS"]
    dns -->|ip address| user
    user --> lb["Load Balancer"]

    lb --> pms

    pms["Policy Management Service"]
    pms -->|initiate delayed reconciliation workflow| reconciliation_service["Reconciliation Service"]
    pms -->|create policy with status 'pending'| localdb[("Local Policy Store DB")]
    pms -->|create policy| external_provider["External Policy Provider"]


    subgraph "Reconciliation Flow"
    reconciliation_service -->check_status{"Check Status?"}
    check_status -->|CREATE_PENDING|create_policy[["Create Policy"]]
    end

    reconciliation_service -->|fetch transaction state of 'policy'|localdb

    create_policy -->|"Create policy with the same parameters"|external_provider
    create_policy -->|"Update external_id in Local DB"| localdb

    

    subgraph "External Policy Provider"
    external_provider -->|create and store policy| policy_store_db[("Policy Store DB")]
    policy_store_db .->|success| external_provider
    end

Scenario 1 & 2/ Crash Before Making Call to External Policy Provider Service

First, let’s tackle the first scenario where the Policy Management Service crashes before making the call to the external policy provider.

PMS initiates a reconciliation service with a delay of 5 seconds.
- The delay is chosen at random as 5 seconds for illustration, in reality if the current sequence takes less than 200ms to complete, then a delay of 500ms or 1000ms is more than enough to trigger reconciliation process.
- The main aspect is that reconciliation process should start after the current process has completed.
Policy Management Service (PMS) creates and stores the Policy object in its local db with status='CREATE_PENDING'
PMS crashes afterwards.
- At this point we don’t know if the policy was created at the External Provider or not. And this is where the Idempotency of the services becomes useful (that I discussed above). Idempotency in this case means I can trigger this call as many times as possible without any side-effect.
Reconciliation Service starts after the set delay.
- Reads the status from the local db with the policyId. It finds: status='CREATE_PENDING'
- It triggers the external policy management service to create the policy.
- Updates the external policy id and the status in the database.
  - external_id={ExternalPolicyId}
  - status='ACTIVE'
Reconciliation successful

---
title: Policy Management Service Crash Before Making Call to External Policy Provider Service
---
sequenceDiagram
    title: Policy Management Service Crashes before calling external Policy Provider
    participant pms as PolicyManagementService
    participant localdb as LocalDB
    participant policyprovider as ExternalPolicyProviderService
    participant externaldb as ExternalDB
    participant reconciliation as Reconciliation Service

    pms ->> reconciliation: initiate delayed reconciliation<br/> with `PolicyId`<br/>(5 seconds delay)
    pms ->>+ localdb: create policy with ID and Status = 'CREATE_PENDING'
    localdb -->>- pms: success

    rect rgba(230,50,50)
    pms -x policyprovider: crashed
    end

    reconciliation ->>+ localdb: read record by `PolicyId`
    localdb -->>-reconciliation: Policy record with status 'CREATE_PENDING'
    reconciliation ->>+ policyprovider: create policy with same details <br/>(Idempotent)
    policyprovider ->>+ externaldb: create policy
    externaldb -->>- policyprovider: success
    policyprovider -->>- reconciliation: `ExternalPolicyID`
    reconciliation ->>+ localdb: update status='ACTIVE', externalId=`ExternalPolicyID`
    localdb -->>- reconciliation: success

This is the implementation of this scenario that mimics the PMS crash after writing the policy to local db and calling the external service to create policy. I return null right after calling the external policy provider to mimic crash.

public Policy crashAfterCallingExternalService(CreatePolicyRequest createPolicyRequest) {
    String policyId = UUID.randomUUID().toString();
    this.reconciliationService.scheduleReconciliation(policyId);
    var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
    db.put(policyId, policy);
    String externalId = this.externalService.createPolicy(policy);

    return null;
}

public Policy crashAfterCallingExternalService(CreatePolicyRequest createPolicyRequest) {
    String policyId = UUID.randomUUID().toString();
    this.reconciliationService.scheduleReconciliation(policyId);
    var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
    db.put(policyId, policy);
    String externalId = this.externalService.createPolicy(policy);

    return null;
}

Here’s the test for that implementation.

@SneakyThrows
@Test
void it_should_mimic_server_crash_when_the_policy_has_been_created_successfully_in_external_service() {
    var externalService = new ExternalService(externalServiceDb, Map.of("CREATE_POLICY", true));
    var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
    var policyService = new PolicyService(policyServiceDb, externalService, reconciliationService);

    CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");
    Policy output = policyService.crashAfterCallingExternalService(testCreatePolicyRequest);

    assertNull(output);
    assertEquals(1, policyServiceDb.estimatedSize());
    assertEquals(1, externalServiceDb.estimatedSize());
    // verify the policy service crashed with 'CREATE_PENDING' state
    var keys = policyServiceDb.asMap().keySet();
    assertFalse(keys.isEmpty());
    String key = keys.stream().findFirst().orElseThrow();
    Policy failedPolicy = policyServiceDb.asMap().get(key);
    assertEquals(Status.CREATE_PENDING, failedPolicy.status());
    assertTrue(failedPolicy.externalID().isEmpty());
    // verify that policy was created by the external service successfully, thus, inconsistent state
    Policy createdPolicy = externalServiceDb.asMap().get(failedPolicy.id());
    assertFalse(createdPolicy.externalID().isBlank());
    assertEquals(Status.ACTIVE, createdPolicy.status());

    // verify that the reconciliation service is working properly to reconcile the state
    awaitSchedulerExecution();
    assertEquals(Status.ACTIVE, policyServiceDb.asMap().get(failedPolicy.id()).status());
}

@SneakyThrows
@Test
void it_should_mimic_server_crash_when_the_policy_has_been_created_successfully_in_external_service() {
    var externalService = new ExternalService(externalServiceDb, Map.of("CREATE_POLICY", true));
    var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
    var policyService = new PolicyService(policyServiceDb, externalService, reconciliationService);

    CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");
    Policy output = policyService.crashAfterCallingExternalService(testCreatePolicyRequest);

    assertNull(output);
    assertEquals(1, policyServiceDb.estimatedSize());
    assertEquals(1, externalServiceDb.estimatedSize());
    // verify the policy service crashed with 'CREATE_PENDING' state
    var keys = policyServiceDb.asMap().keySet();
    assertFalse(keys.isEmpty());
    String key = keys.stream().findFirst().orElseThrow();
    Policy failedPolicy = policyServiceDb.asMap().get(key);
    assertEquals(Status.CREATE_PENDING, failedPolicy.status());
    assertTrue(failedPolicy.externalID().isEmpty());
    // verify that policy was created by the external service successfully, thus, inconsistent state
    Policy createdPolicy = externalServiceDb.asMap().get(failedPolicy.id());
    assertFalse(createdPolicy.externalID().isBlank());
    assertEquals(Status.ACTIVE, createdPolicy.status());

    // verify that the reconciliation service is working properly to reconcile the state
    awaitSchedulerExecution();
    assertEquals(Status.ACTIVE, policyServiceDb.asMap().get(failedPolicy.id()).status());
}

The above test highlights state as each step progresses.

Scenario 3/ External Service fails to create the record and errors out

Now, let’s assume second scenario where the Policy Service was able to call the external policy provider service but that service failed instead.

%%{
    init: {
        'theme': 'light', 
        'themeCSS': '.messageLine0:nth-of-type(4) { stroke: red; textcolor: red;};'
    }
}%%
sequenceDiagram
    title: Policy Management Service Crashes before calling external Policy Provider
    participant pms as PolicyManagementService
    participant localdb as LocalDB
    participant policyprovider as ExternalPolicyProviderService
    participant externaldb as ExternalDB


    pms ->> localdb: create policy with ID and Status = 'CREATE_PENDING'
    localdb -->> pms: success
    pms ->> policyprovider: create policy
    policyprovider -x pms: FAILURE
    pms ->> localdb: update status as 'FAILED'

Let me write a test case to better explain to you what we are testing here.

class PolicyServiceTest {

    private Cache<String, Policy> policyServiceDb;
    private Cache<String, Policy> externalServiceDb;

    @BeforeEach
    void setUp() {
        policyServiceDb = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.DAYS)
                .maximumSize(1000)
                .build();

        externalServiceDb = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.DAYS)
                .maximumSize(1000)
                .build();
    }

    @AfterEach
    void tearDown() {
        policyServiceDb.invalidateAll();
        externalServiceDb.invalidateAll();
    }

    @SneakyThrows
    @Test
    void it_should_update_the_record_as_failed_in_db_if_external_service_fails() {
        var externalService = new ExternalService(externalServiceDb, Map.of("THROW_EXCEPTION", true));
        var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
        var policyService = Mockito.spy(new PolicyService(policyServiceDb, externalService, reconciliationService));

        CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");

        assertThrows(CreatePolicyException.class, () -> {
            Policy output = policyService.compensateActionsIfExternalServiceFailsToCreatePolicy(testCreatePolicyRequest);
        });
        awaitSchedulerExecution();

        assertEquals(1, policyServiceDb.estimatedSize());


        var keys = policyServiceDb.asMap().keySet();
        assertFalse(keys.isEmpty());
        String key = keys.stream().findFirst().orElseThrow();
        Policy failedPolicy = policyServiceDb.asMap().get(key);
        assertEquals(Status.FAILED, failedPolicy.status());
    }
}

class PolicyServiceTest {

    private Cache<String, Policy> policyServiceDb;
    private Cache<String, Policy> externalServiceDb;

    @BeforeEach
    void setUp() {
        policyServiceDb = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.DAYS)
                .maximumSize(1000)
                .build();

        externalServiceDb = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.DAYS)
                .maximumSize(1000)
                .build();
    }

    @AfterEach
    void tearDown() {
        policyServiceDb.invalidateAll();
        externalServiceDb.invalidateAll();
    }

    @SneakyThrows
    @Test
    void it_should_update_the_record_as_failed_in_db_if_external_service_fails() {
        var externalService = new ExternalService(externalServiceDb, Map.of("THROW_EXCEPTION", true));
        var reconciliationService = new ReconciliationService(policyServiceDb, externalService);
        var policyService = Mockito.spy(new PolicyService(policyServiceDb, externalService, reconciliationService));

        CreatePolicyRequest testCreatePolicyRequest = new CreatePolicyRequest("This is a test policy", "permit(principal, action, resource);");

        assertThrows(CreatePolicyException.class, () -> {
            Policy output = policyService.compensateActionsIfExternalServiceFailsToCreatePolicy(testCreatePolicyRequest);
        });
        awaitSchedulerExecution();

        assertEquals(1, policyServiceDb.estimatedSize());


        var keys = policyServiceDb.asMap().keySet();
        assertFalse(keys.isEmpty());
        String key = keys.stream().findFirst().orElseThrow();
        Policy failedPolicy = policyServiceDb.asMap().get(key);
        assertEquals(Status.FAILED, failedPolicy.status());
    }
}

Here we make sure that the status of the policy in the database is updated as failed. And since Policy Management Service didn’t failed, it can perform the compensate action itself. No need for reconciliation service for this.

Here’s how the code will work.

 public Policy compensateActionsIfExternalServiceFailsToCreatePolicy(CreatePolicyRequest createPolicyRequest) {
    String policyId = UUID.randomUUID().toString();
    this.reconciliationService.scheduleReconciliation(policyId);        
    var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
    db.put(policyId, policy);

    try {
        String externalId = this.externalService.createPolicy(policy);
        Policy createdPolicy = policy.withExternalID(externalId).withStatus(Status.ACTIVE);
        this.db.put(policyId, createdPolicy);
        return createdPolicy;
    } catch (CreatePolicyException ex) {
        this.db.put(policyId, policy.withStatus(Status.FAILED));
        throw new CreatePolicyException();
    }
}

 public Policy compensateActionsIfExternalServiceFailsToCreatePolicy(CreatePolicyRequest createPolicyRequest) {
    String policyId = UUID.randomUUID().toString();
    this.reconciliationService.scheduleReconciliation(policyId);        
    var policy = new Policy(policyId, "", Status.CREATE_PENDING, createPolicyRequest.description(), createPolicyRequest.statement());
    db.put(policyId, policy);

    try {
        String externalId = this.externalService.createPolicy(policy);
        Policy createdPolicy = policy.withExternalID(externalId).withStatus(Status.ACTIVE);
        this.db.put(policyId, createdPolicy);
        return createdPolicy;
    } catch (CreatePolicyException ex) {
        this.db.put(policyId, policy.withStatus(Status.FAILED));
        throw new CreatePolicyException();
    }
}

And when I run the test it passes. That means the state is correct.

Consideration for Production Systems

1/ Adopt the Outbox Pattern using Change Data Capture (CDC)

Instead of making an http call like I did in the example above, it would be more reliable if you rely on the DB trigger. Like a Change Data Capture mechanism. So whenever a record is inserted in the db, it will send it to a queue and will trigger the reconciliation pipeline. The reconciliation pipeline will automatically get the data. That is way more robust and reliable then making a service call at the start of your execution.

flowchart TD

    db[("Database")] -->|CDC|queue[/queue/]
    queue -->|"Read CDC records"| ReconciliationService

2/ Ensure Idempotency

Use a stable business idempotency key (e.g. your PolicyId) in calls to the external provider.
The external provider should support idempotent creation (either dedupe by client id or return existing if already created).
Locally, worker must handle duplicate success responses safely (update with ON CONFLICT/upsert).

3/ Retries & backoff

Implement exponential backoff with a max attempts counter.
For persistent failures, move to dead-letter / manual reconciliation queue.

Conclusion

Today we looked at a widely used mechanism for making system more robust whenever we need a mechanism to keep two services in sync without having to deal with costly and complex distributed transactions. There are many variants and flavour to this pattern which you can adopt as needed based on your requirements.

I’ve not covered all the cases as our example was quite simple and straightforward, but it would be important when you are actually dealing with a production usecase. Listing down all possible failure scenario makes it easy to cover in your reconciliation service.

You can follow the complete code in my github repository here: Forward Retry Mechanism System Design

System Design: Forward Retry Mechanism In Distributed Architecture

Why Use this Pattern?

Idempotency

Problem Statement

Designing the Architecture

Scenario 1 & 2/ Crash Before Making Call to External Policy Provider Service

Scenario 3/ External Service fails to create the record and errors out

Consideration for Production Systems

1/ Adopt the Outbox Pattern using Change Data Capture (CDC)

2/ Ensure Idempotency

3/ Retries & backoff

Conclusion

Related

Become an Aficionado

Recent

Search

Why Use this Pattern?

Idempotency

Problem Statement

Designing the Architecture

Scenario 1 & 2/ Crash Before Making Call to External Policy Provider Service

Scenario 3/ External Service fails to create the record and errors out

Consideration for Production Systems

1/ Adopt the Outbox Pattern using Change Data Capture (CDC)

2/ Ensure Idempotency

3/ Retries & backoff

Conclusion

Related

Footer

Become an Aficionado

Recent

Search

Tags