path-cas-concurrency-checklist.md 2.89 KB

Path CAS Concurrency Checklist

Purpose

Validate that VDA path cache updates are atomic under concurrent SendNextSegment and net-action updates.

Scope

Applies to:

  1. src/Rcs.Infrastructure/Services/Protocol/Vda5050ProtocolService.cs
  2. Redis keys: rcs:vdaPath:{robotId}:{taskId}:{subTaskId} rcs:vdaPath:{robotId}:{taskId}:{subTaskId}:planVersion

Core Assertions

  1. planVersion must monotonically increase.
  2. Cache JSON and :planVersion key must stay consistent (same version).
  3. Concurrent writers must not silently overwrite newer data.
  4. On conflict, fallback retry should recover in most cases.

Case A: Dual SendNextSegment race

  1. Trigger two SendNextSegmentAsync requests for same robot/task/subTask almost simultaneously.
  2. Expected:
    • one writer wins direct CAS;
    • another writer either:
      • fails fast and returns a concurrency-wait response, or
      • succeeds via fallback CAS after reading latest version.
  3. Check logs:
    • VDA5050 - 路径状态并发更新
    • [PathCAS] ... Conflict=... FallbackSuccess=...

Case B: SendNextSegment + NetAction state race

  1. While CheckAndExecuteNetActionsAsync writes Executing/WaitingAsyncResponse, concurrently trigger SendNextSegmentAsync.
  2. Expected:
    • no stale overwrite on NetActionStatus;
    • final cache contains latest route index and expected net-action state.

Case C: Hold/Patch conflict with queue update

  1. Force reverse conflict and hold transitions repeatedly.
  2. During hold, trigger patch decision (trim or detour) in parallel.
  3. Expected:
    • route mode transitions are serialized by CAS;
    • no missing HoldNodeCode/HoldUntilUtc;
    • LastDecisionCode remains one of latest successful writes.

Case D: PlanVersion key missing recovery

  1. Manually delete only :planVersion key, keep cache JSON.
  2. Trigger SendNextSegmentAsync.
  3. Expected:
    • EnsurePathPlanVersionKeyAsync recreates version key from loaded cache version;
    • subsequent CAS writes proceed.

Operational Signals

Track periodic CAS metric log:

[PathCAS] DirectAttempt={...}, DirectSuccess={...}, Conflict={...}, FallbackAttempt={...}, FallbackSuccess={...}, Failure={...}

Also monitor coordinator decision distribution:

[GlobalNav] DecisionMetrics Continue={...}, Hold={...}, Patch={...}, TopStrategies={...}

Healthy pattern:

  1. DirectSuccess dominates.
  2. Conflict may rise under burst traffic.
  3. FallbackSuccess should be non-zero when conflicts occur.
  4. Failure should remain near zero.
  5. When short-window conflict ratio is high, service should emit HighConflictBackoff debug logs and apply 40-140ms jitter delay before CAS.

Rollback Criteria

Rollback CAS changes if any occur repeatedly:

  1. planVersion decreases or oscillates.
  2. Cache JSON version != :planVersion key.
  3. Persistent high Failure with stalled dispatch.
  4. Robots stuck in hold/queue due to missing state writes.