feat: implement per-account token refresh synchronization (!184) · Merge requests · e / os / AccountManager

Description

The token refresh logic in OidcTokenRefresher is improved to minimize the time spent in synchronized blocks, thereby increasing concurrency for token refresh operations across different accounts.

Key changes:

The synchronized block is now only used to check for and register an ongoing refresh operation. The actual network I/O for performTokenRefresh is moved outside of the lock.
This prevents a long-running refresh for one account from blocking other threads that need to check the status or start a refresh for a different account.
Instead of re-reading the AuthState within performTokenRefresh, the state read inside the initial synchronized block is captured and passed along. This avoids potential race conditions where the state could change between the check and the refresh action.
The AuthState is now persisted by writeAuthState only after a successful token refresh, preventing the storage of a failed state.
Cleanup of the ongoingRefreshOperations map is made safer by removing the entry only if it matches the completed operation's future.

Technical details

Hypothetical Simulation

Click to expand

3 Accounts, 4 Threads Each

Let's imagine this scenario:

Account A: Threads A1, A2, A3, A4
Account B: Threads B1, B2, B3, B4
Account C: Threads C1, C2, C3, C4

Timeline:

T0: All 12 threads call refreshAuthState simultaneously

T1: Thread A1 acquires Account A's lock, checks ongoingRefreshOperations["Account A"] which is empty, registers a new operation and proceeds to network call

T2: Thread A2, A3, A4 try to acquire Account A's lock, but wait since A1 has it

T3: Thread B1 acquires Account B's lock (different lock from A), checks ongoingRefreshOperations["Account B"] which is empty, registers a new operation and proceeds to network call

T4: Thread B2, B3, B4 try to acquire Account B's lock, but wait since B1 has it

T5: Thread C1 acquires Account C's lock (different lock from A and B), checks ongoingRefreshOperations["Account C"] which is empty, registers a new operation and proceeds to network call

T6: Thread C2, C3, C4 try to acquire Account C's lock, but wait since C1 has it

T7: Thread A1 performs network operation, refreshes token, returns

T8: Thread A2 gets the lock, checks ongoingRefreshOperations again (now empty), sees the token is already refreshed, returns cached AuthState

T9: Thread A3 gets the lock, sees token is already refreshed, returns cached AuthState

T10: Thread A4 gets the lock, sees token is already refreshed, returns cached AuthState

T11: Thread B1 performs network operation, refreshes token, returns

T12: Thread B2 gets the lock, sees token is already refreshed, returns cached AuthState

T13: Thread B3 gets the lock, sees token is already refreshed, returns cached AuthState

T14: Thread B4 gets the lock, sees token is already refreshed, returns cached AuthState

T15: Thread C1 performs network operation, refreshes token, returns

T16: Thread C2 gets the lock, sees token is already refreshed, returns cached AuthState

T17: Thread C3 gets the lock, sees token is already refreshed, returns cached AuthState

T18: Thread C4 gets the lock, sees token is already refreshed, returns cached AuthState

Result:

Only 3 network calls occurred (one for each account)
All threads complete successfully
No race conditions occurred
Accounts ran in parallel (A, B, C operations happened simultaneously)

Simulation walkthrough

Click to expand

Timeline:

T0 — All 12 threads call refreshAuthState(...)

Each thread computes accountKey = generateAccountKey(account.name, account.type).
Each thread does val accountLock = accountLocks.computeIfAbsent(accountKey) { Any() }. That gives each account its own monitor object. Threads for A share one lock, B share another, C another.

T1 — Thread A1 acquires synchronized(accountLock) for Account A

Inside the short synchronized block the code does:
- existingOp = ongoingRefreshOperations[accountKey] → null (no one started).
- capturedAuthState = readAuthState() reads current state.
- Checks isInvalidGrant(...) → not the case.
- Checks isAuthorized && accessToken != null && !needsTokenRefresh → false (needs refresh).
- Creates newOp = CompletableFuture<AuthState?>() and registers ongoingRefreshOperations[accountKey] = newOp.
- starterOp = newOp.
Synchronized block returns. starterOp is non-null so A1 will perform the actual network refresh outside the lock.

T2 — Threads A2, A3, A4 try to enter synchronized(accountLock) for Account A

They block at the monitor entrance because A1 held the lock at T1 while it executed the short critical section. Once A1 exits the short block, one of them proceeds.

T3 — Thread B1 acquires synchronized(accountLock) for Account B and follows same path as A1 for Account B

B1 registers its own CompletableFuture under ongoingRefreshOperations["B"] and will run the network refresh outside B's lock.

T4 — Threads B2..B4 queue on B's lock.

T5 — Thread C1 acquires C lock, registers new future, releases lock, will run refresh.

T6 — Threads C2..C4 queue on C lock.

Important concurrency invariant achieved so far: each account has its own short synchronized region used only to read state and register a future. Network calls will run outside those synchronized regions. That allows A1, B1, C1 to perform network operations concurrently.

T7 — A1 performs performTokenRefresh(...) outside the lock

performTokenRefresh:
- Uses the captured authState and clientAuth.
- Calls authState.performActionWithFreshTokens(...).
- On success writeAuthState(authState) is called and authStateFuture.complete(authState).
- performTokenRefresh returns the refreshed AuthState.
Back in refreshAuthState, A1 does op.complete(result) where op is its starterOp. Then the finally block removes the future via ongoingRefreshOperations.remove(accountKey, op) (remove only if same future), then returns result.

T8 — A2 acquires A lock (after A1 finished short synchronized registration and released the lock)

Two possible subcases depending on timing:

Case 1: A2 arrives while A1 is still performing network call (but after A1 released the lock).
- Inside synchronized block existingOp = ongoingRefreshOperations["A"] returns the CompletableFuture created by A1. The code sets existingOp and leaves the synchronized block without waiting.
- Outside synchronized block existingOp.join() is called. This waits for A1's future to complete. When A1 calls op.complete(result) the join unblocks and A2 receives the same AuthState. A2 returns that result.
Case 2: A2 arrives after A1 has completed refresh and ongoingRefreshOperations.remove(accountKey, op) completed.
- existingOp = null.
- The code calls capturedAuthState = readAuthState() inside the synchronized short region. Because A1 already wrote the refreshed state via writeAuthState, readAuthState() sees the updated token. The code checks isAuthorized && accessToken != null && !needsTokenRefresh → true. The synchronized block returns capturedAuthState. starterOp is null so method returns the cached/refreshed AuthState immediately without any join or network call.

Either subcase yields the result described in your timeline: A2 either waited on the future or immediately returned the cached refreshed state. Same logic applies to A3 and A4.

T9–T10 — A3 and A4 behave exactly like A2

If they enter while A1 is still running they observe existingOp and join it.
If they enter after the future was removed they read and return cached AuthState.

T11–T14 — B threads follow the same pattern as A threads but independent because B uses its own lock and own ongoingRefreshOperations["B"]. B1 does the refresh concurrently with A1 and C1. B2..B4 either join B1's future or return the cached state after completion.

T15–T18 — C threads follow same pattern as A and B.

Why only 3 network calls happen (one per account)

The first thread for each account creates and publishes a single CompletableFuture while holding the per-account lock. That future acts as the canonical in-flight marker. Other threads either observe that future and join() it, or if the first thread already finished and removed the future, they see the refreshed AuthState and return cached state. The code never starts a second refresh for the same account unless the first future has been removed and the state again requires refresh.

Why accounts run in parallel

Locks are accountLocks[accountKey]. A, B, C use different keys. Synchronized regions are intentionally short and the network call runs outside the synchronized block. So A1, B1, C1 perform performTokenRefresh(...) concurrently.

Why waiting threads do not block other accounts

Waiting threads for account A only contend for A's lock. B and C use different locks and their threads proceed independently.

Why join happens outside synchronization

The code captures existingOp inside the synchronized block and calls existingOp.join() after leaving the synchronized block. That avoids holding the lock while waiting. This allows other threads for the same account to also observe the future and join it concurrently. It also allows other accounts to proceed.

Cleanup correctness

The code removes the future with ongoingRefreshOperations.remove(accountKey, op) in the finally. That guarantees we only remove the future if it is the same one we registered. This prevents accidentally removing a newer future if a subsequent refresh started concurrently after the first completed.

Edge behaviors and how code handles them (brief)

If performTokenRefresh fails and op.completeExceptionally(e) runs, waiting threads un-block with join() that throws CompletionException. The code logs and propagates failure as intended.
If a waiting thread calls readAuthState() after the future was removed but before writeAuthState persisted the updated token, it might see the old state. In this implementation the write to persistent state (writeAuthState) happens inside performTokenRefresh before op.complete(...), so persisted state is updated before observers are unblocked. That ordering ensures readers that check readAuthState() after completion see the updated token.

Summary mapping to the timeline:

T0: all threads compute keys and locks.
T1/T3/T5: A1/B1/C1 register new futures while holding their account locks then release lock and start network refresh.
T2/T4/T6: other threads for those accounts queue briefly for the lock or arrive to find an existing future.
T7/T11/T15: A1/B1/C1 call performTokenRefresh concurrently. On success they writeAuthState, op.complete(result), and remove the future.
T8–T10, T12–T14, T16–T18: remaining threads either join the in-flight future or, if they arrive later, read and return the now-cached AuthState. No duplicate network calls occur.

Tests

Issues

https://gitlab.e.foundation/e/os/backlog/-/issues/3743 and https://gitlab.e.foundation/e/os/backlog/-/issues/3559

10 commandments of code review

👪 ❤️ code review guidelines

Edited Nov 12, 2025 by Fahim M. Choudhury

feat: implement per-account token refresh synchronization