feat: implement per-account token refresh synchronization
Description
The token refresh logic in OidcTokenRefresher is improved to minimize the time spent in synchronized blocks, thereby increasing concurrency for token refresh operations across different accounts.
Key changes:
- The
synchronizedblock is now only used to check for and register an ongoing refresh operation. The actual network I/O forperformTokenRefreshis moved outside of the lock. - This prevents a long-running refresh for one account from blocking other threads that need to check the status or start a refresh for a different account.
- Instead of re-reading the
AuthStatewithinperformTokenRefresh, the state read inside the initialsynchronizedblock is captured and passed along. This avoids potential race conditions where the state could change between the check and the refresh action. - The
AuthStateis now persisted bywriteAuthStateonly after a successful token refresh, preventing the storage of a failed state. - Cleanup of the
ongoingRefreshOperationsmap is made safer by removing the entry only if it matches the completed operation's future.
Technical details
Hypothetical Simulation
Click to expand
3 Accounts, 4 Threads Each
Let's imagine this scenario:
- Account A: Threads A1, A2, A3, A4
- Account B: Threads B1, B2, B3, B4
- Account C: Threads C1, C2, C3, C4
Timeline:
T0: All 12 threads call refreshAuthState simultaneously
T1: Thread A1 acquires Account A's lock, checks ongoingRefreshOperations["Account A"] which is empty, registers a new operation and proceeds to network call
T2: Thread A2, A3, A4 try to acquire Account A's lock, but wait since A1 has it
T3: Thread B1 acquires Account B's lock (different lock from A), checks ongoingRefreshOperations["Account B"] which is empty, registers a new operation and proceeds to network call
T4: Thread B2, B3, B4 try to acquire Account B's lock, but wait since B1 has it
T5: Thread C1 acquires Account C's lock (different lock from A and B), checks ongoingRefreshOperations["Account C"] which is empty, registers a new operation and proceeds to network call
T6: Thread C2, C3, C4 try to acquire Account C's lock, but wait since C1 has it
T7: Thread A1 performs network operation, refreshes token, returns
T8: Thread A2 gets the lock, checks ongoingRefreshOperations again (now empty), sees the token is already refreshed, returns cached AuthState
T9: Thread A3 gets the lock, sees token is already refreshed, returns cached AuthState
T10: Thread A4 gets the lock, sees token is already refreshed, returns cached AuthState
T11: Thread B1 performs network operation, refreshes token, returns
T12: Thread B2 gets the lock, sees token is already refreshed, returns cached AuthState
T13: Thread B3 gets the lock, sees token is already refreshed, returns cached AuthState
T14: Thread B4 gets the lock, sees token is already refreshed, returns cached AuthState
T15: Thread C1 performs network operation, refreshes token, returns
T16: Thread C2 gets the lock, sees token is already refreshed, returns cached AuthState
T17: Thread C3 gets the lock, sees token is already refreshed, returns cached AuthState
T18: Thread C4 gets the lock, sees token is already refreshed, returns cached AuthState
Result:
- Only 3 network calls occurred (one for each account)
- All threads complete successfully
- No race conditions occurred
- Accounts ran in parallel (A, B, C operations happened simultaneously)
Simulation walkthrough
Click to expand
Timeline:
T0 — All 12 threads call refreshAuthState(...)
- Each thread computes
accountKey = generateAccountKey(account.name, account.type). - Each thread does
val accountLock = accountLocks.computeIfAbsent(accountKey) { Any() }. That gives each account its own monitor object. Threads for A share one lock, B share another, C another.
T1 — Thread A1 acquires synchronized(accountLock) for Account A
-
Inside the short synchronized block the code does:
-
existingOp = ongoingRefreshOperations[accountKey]→ null (no one started). -
capturedAuthState = readAuthState()reads current state. - Checks
isInvalidGrant(...)→ not the case. - Checks
isAuthorized && accessToken != null && !needsTokenRefresh→ false (needs refresh). - Creates
newOp = CompletableFuture<AuthState?>()and registersongoingRefreshOperations[accountKey] = newOp. -
starterOp = newOp.
-
-
Synchronized block returns.
starterOpis non-null so A1 will perform the actual network refresh outside the lock.
T2 — Threads A2, A3, A4 try to enter synchronized(accountLock) for Account A
- They block at the monitor entrance because A1 held the lock at T1 while it executed the short critical section. Once A1 exits the short block, one of them proceeds.
T3 — Thread B1 acquires synchronized(accountLock) for Account B and follows same path as A1 for Account B
- B1 registers its own
CompletableFutureunderongoingRefreshOperations["B"]and will run the network refresh outside B's lock.
T4 — Threads B2..B4 queue on B's lock.
T5 — Thread C1 acquires C lock, registers new future, releases lock, will run refresh.
T6 — Threads C2..C4 queue on C lock.
Important concurrency invariant achieved so far: each account has its own short synchronized region used only to read state and register a future. Network calls will run outside those synchronized regions. That allows A1, B1, C1 to perform network operations concurrently.
T7 — A1 performs performTokenRefresh(...) outside the lock
-
performTokenRefresh:- Uses the captured
authStateandclientAuth. - Calls
authState.performActionWithFreshTokens(...). - On success
writeAuthState(authState)is called andauthStateFuture.complete(authState). -
performTokenRefreshreturns the refreshedAuthState.
- Uses the captured
-
Back in
refreshAuthState, A1 doesop.complete(result)whereopis itsstarterOp. Then thefinallyblock removes the future viaongoingRefreshOperations.remove(accountKey, op)(remove only if same future), then returns result.
T8 — A2 acquires A lock (after A1 finished short synchronized registration and released the lock)
-
Two possible subcases depending on timing:
Case 1: A2 arrives while A1 is still performing network call (but after A1 released the lock).
- Inside synchronized block
existingOp = ongoingRefreshOperations["A"]returns theCompletableFuturecreated by A1. The code setsexistingOpand leaves the synchronized block without waiting. - Outside synchronized block
existingOp.join()is called. This waits for A1's future to complete. When A1 callsop.complete(result)the join unblocks and A2 receives the sameAuthState. A2 returns that result.
Case 2: A2 arrives after A1 has completed refresh and
ongoingRefreshOperations.remove(accountKey, op)completed.-
existingOp = null. - The code calls
capturedAuthState = readAuthState()inside the synchronized short region. Because A1 already wrote the refreshed state viawriteAuthState,readAuthState()sees the updated token. The code checksisAuthorized && accessToken != null && !needsTokenRefresh→ true. The synchronized block returnscapturedAuthState.starterOpis null so method returns the cached/refreshedAuthStateimmediately without any join or network call.
- Inside synchronized block
Either subcase yields the result described in your timeline: A2 either waited on the future or immediately returned the cached refreshed state. Same logic applies to A3 and A4.
T9–T10 — A3 and A4 behave exactly like A2
- If they enter while A1 is still running they observe
existingOpand join it. - If they enter after the future was removed they read and return cached
AuthState.
T11–T14 — B threads follow the same pattern as A threads but independent because B uses its own lock and own ongoingRefreshOperations["B"]. B1 does the refresh concurrently with A1 and C1. B2..B4 either join B1's future or return the cached state after completion.
T15–T18 — C threads follow same pattern as A and B.
Why only 3 network calls happen (one per account)
- The first thread for each account creates and publishes a single
CompletableFuturewhile holding the per-account lock. That future acts as the canonical in-flight marker. Other threads either observe that future andjoin()it, or if the first thread already finished and removed the future, they see the refreshedAuthStateand return cached state. The code never starts a second refresh for the same account unless the first future has been removed and the state again requires refresh.
Why accounts run in parallel
- Locks are
accountLocks[accountKey]. A, B, C use different keys. Synchronized regions are intentionally short and the network call runs outside the synchronized block. So A1, B1, C1 performperformTokenRefresh(...)concurrently.
Why waiting threads do not block other accounts
- Waiting threads for account A only contend for A's lock. B and C use different locks and their threads proceed independently.
Why join happens outside synchronization
- The code captures
existingOpinside the synchronized block and callsexistingOp.join()after leaving the synchronized block. That avoids holding the lock while waiting. This allows other threads for the same account to also observe the future and join it concurrently. It also allows other accounts to proceed.
Cleanup correctness
- The code removes the future with
ongoingRefreshOperations.remove(accountKey, op)in thefinally. That guarantees we only remove the future if it is the same one we registered. This prevents accidentally removing a newer future if a subsequent refresh started concurrently after the first completed.
Edge behaviors and how code handles them (brief)
- If
performTokenRefreshfails andop.completeExceptionally(e)runs, waiting threads un-block withjoin()that throwsCompletionException. The code logs and propagates failure as intended. - If a waiting thread calls
readAuthState()after the future was removed but beforewriteAuthStatepersisted the updated token, it might see the old state. In this implementation the write to persistent state (writeAuthState) happens insideperformTokenRefreshbeforeop.complete(...), so persisted state is updated before observers are unblocked. That ordering ensures readers that checkreadAuthState()after completion see the updated token.
Summary mapping to the timeline:
- T0: all threads compute keys and locks.
- T1/T3/T5: A1/B1/C1 register new futures while holding their account locks then release lock and start network refresh.
- T2/T4/T6: other threads for those accounts queue briefly for the lock or arrive to find an existing future.
- T7/T11/T15: A1/B1/C1 call
performTokenRefreshconcurrently. On success theywriteAuthState,op.complete(result), and remove the future. - T8–T10, T12–T14, T16–T18: remaining threads either join the in-flight future or, if they arrive later, read and return the now-cached
AuthState. No duplicate network calls occur.
Tests
Issues
https://gitlab.e.foundation/e/os/backlog/-/issues/3743 and https://gitlab.e.foundation/e/os/backlog/-/issues/3559
