global-realtime-avoidance-plan.md 13.6 KB

Global Realtime Avoidance Plan (State-Event Driven)

1. Purpose

This document defines a full implementation plan for:

  1. State-event driven global realtime avoidance.
  2. Compatibility with existing "send next segment" workflow.
  3. Merging avoidance path and existing segmented path safely.

The plan is written for direct implementation by Codex or engineers in this repo.


2. Current Baseline (From Existing Code)

Current code already has these usable capabilities:

  1. State event input and periodic trigger chain:
    • src/Rcs.Infrastructure/Mqtt/MqttMessageHandler.cs
    • HandleStateMessageAsync(...)
    • HandleNewBaseRequestAsync(...)
  2. Segmented path cache and incremental segment sending:
    • src/Rcs.Application/Services/PathFind/Models/VdaSegmentedPathCache.cs
    • src/Rcs.Infrastructure/Services/Protocol/Vda5050ProtocolService.cs
  3. Unified lock service and conflict checks:
    • src/Rcs.Infrastructure/PathFinding/Services/UnifiedTrafficControlService.cs
  4. Existing local navigation/avoidance prototypes:
    • src/Rcs.Infrastructure/PathFinding/Services/NavigationService.cs
    • src/Rcs.Infrastructure/PathFinding/Services/JunctionConflictResolver.cs

Key limitation now:

  1. Avoidance is not globally coordinated as a consistent realtime state machine.
  2. Avoidance decision and segment sending are not integrated through a versioned tail patch protocol.
  3. Some conflict details are placeholders (for example occupying robots retrieval).

3. Design Goals

  1. Event-driven: decision updates triggered by state events, not only by order dispatch time.
  2. Global: decisions based on fleet-wide lock + route snapshot.
  3. Realtime-safe: avoid thrashing and race conditions.
  4. Compatible: keep existing segmented send path and VDA order behavior.
  5. Incremental: no full rewrite of current protocol workflow.

4. Non-goals

  1. Replace VDA message model.
  2. Rewrite all pathfinding algorithms.
  3. Introduce hard dependency on external stream middleware.

5. Target Architecture

5.1 New Components

  1. GlobalNavigationCoordinator (hosted service)
    • Consumes robot navigation events.
    • Maintains per-robot serial decision processing.
    • Produces TailPatch or Hold decision.
  2. NavigationEventPublisher
    • Publishes events from MQTT handlers and send failures.
  3. AvoidanceStrategySelector
    • Chooses among multiple avoidance types.
  4. AvoidanceExecutors (strategy family)
    • WaitNearestExecutor
    • JunctionOccupyExecutor
    • LocalDetourExecutor
    • RetreatExecutor
    • ParkingExecutor
  5. TailPatchApplier
    • Applies patch to unsent tail only.
    • CAS update by plan version.
  6. NavigationSnapshotService
    • Aggregates fleet runtime snapshots from cache + lock state.

5.2 Integration Points

  1. In MqttMessageHandler.HandleStateMessageAsync(...)
    • Publish RobotStateUpdatedEvent.
  2. In Vda5050ProtocolService.SendNextSegmentInternalAsync(...)
    • Call TryAdjustBeforeSend(...) before selecting next segment.
  3. In lock failure branches
    • Publish LockConflictEvent to coordinator.

6. Avoidance Types (Unified Taxonomy)

  1. WaitNearest:
    • Move to nearest wait node or hold in place.
    • Best for short congestion.
  2. JunctionOccupy:
    • Proactive/queue-based occupancy for controlled junction entry.
  3. LocalDetour:
    • Replace local tail section with alternate branch and rejoin.
  4. LateralYield:
    • Side lane temporary yield.
  5. ParkingYield:
    • Mid/long hold at parking/wait node.
  6. Retreat:
    • Reverse to safe upstream node for head-on/deadlock risk.
  7. ReplanTail:
    • Escalated unsent tail replanning.

7. Strategy Selection Matrix

Selection is two-layer: hard constraints then score.

7.1 Hard Constraints

  1. If reverse conflict is active and safe gap not met:
    • Forbid forward occupancy action.
  2. If pending net action exists on current boundary:
    • Forbid tail mutation.
  3. If only sent segments remain mutable:
    • Forbid patch; fallback to hold.

7.2 Preferred Order

  1. Short congestion: WaitNearest.
  2. High priority with lock chance: JunctionOccupy.
  3. Sustained congestion with available branch: LocalDetour.
  4. Head-on narrow corridor: Retreat.
  5. Mid/long congestion with parking: ParkingYield.
  6. Repeated failures: ReplanTail.

7.3 Anti-flap Rules

  1. Same decision must be stable for N consecutive events.
  2. min_hold_ms applies before leaving hold.
  3. Current mode gets a temporary score bias to avoid oscillation.

8. State Machine

8.1 Route Modes

  1. Normal
  2. HoldingAtWaitNode
  3. QueueingForJunction
  4. OccupyingJunction
  5. Detouring
  6. Rejoining
  7. ReplanPending
  8. Blocked

8.2 Events

  1. RobotStateUpdated
  2. DispatchNextSegment
  3. LockCheckPass
  4. LockCheckFail
  5. HoldTimeout
  6. DeadlockDetected
  7. PatchApplied
  8. PatchFailed
  9. ReplanDone

8.3 Main Transitions

  1. Normal -> HoldingAtWaitNode on short congestion.
  2. Normal -> OccupyingJunction on high-priority occupancy grant.
  3. Normal -> Detouring on sustained congestion and branch available.
  4. HoldingAtWaitNode -> Rejoining on resume conditions.
  5. HoldingAtWaitNode -> Detouring on timeout tier 1.
  6. HoldingAtWaitNode -> ReplanPending on timeout tier 2.
  7. Detouring -> Rejoining when patch is applied.
  8. Rejoining -> Normal after successful next segment dispatch.
  9. Any mode -> Blocked on repeated unrecoverable failures.

8.4 Resume Trigger (Wait-to-Continue)

Resume only if all pass:

  1. LastNodeId == WaitNodeCode.
  2. Driving == false.
  3. Next-junction lock precheck pass for consecutive N cycles.
  4. min_hold_ms elapsed.

Then apply patch or continue and let SendNextSegmentInternalAsync(...) send next segment.


9. Data Model Changes

Extend VdaSegmentedPathCache with backward compatibility (new optional fields):

public long PlanVersion { get; set; } = 1;
public string RouteMode { get; set; } = "Normal";
public string? OriginalGoalNodeCode { get; set; }
public Guid? ActivePatchId { get; set; }
public DateTime? HoldUntilUtc { get; set; }
public string? HoldNodeCode { get; set; }
public int StablePassCount { get; set; } = 0;
public string? LastDecisionCode { get; set; }

Extend VdaSegmentCacheItem:

public string SegmentOrigin { get; set; } = "Base"; // Base|Avoidance|Merged

Add new patch model:

public enum PatchAction { Continue, Hold, PatchTail, ReplanTail }

public class TailPatch
{
    public Guid PatchId { get; set; }
    public long ExpectedPlanVersion { get; set; }
    public PatchAction Action { get; set; }
    public string Strategy { get; set; } = string.Empty;
    public string Reason { get; set; } = string.Empty;
    public string? WaitNodeCode { get; set; }
    public string? RejoinNodeCode { get; set; }
    public DateTime? HoldUntilUtc { get; set; }
    public List<List<List<PathSegmentWithCode>>> NewTail { get; set; } = new();
}

10. Redis Keys and Concurrency

Use these keys:

  1. rcs:route:session:{robotId} -> route mode/session metadata.
  2. Existing VDA path key remains primary source:
    • {VdaPathPrefix}:{robotId}:{taskId}:{subTaskId}
  3. rcs:route:queue:{mapCode}:{junctionNodeCode} -> waiting queue.
  4. rcs:route:decision:lock:{robotId} -> decision lock.

CAS update rule:

  1. Read cache with PlanVersion.
  2. Build patch for that version.
  3. WATCH + compare version.
  4. Write updated cache with PlanVersion+1.
  5. Retry on version mismatch.

11. Tail Merge Protocol (Compatibility-critical)

Never mutate sent segments.

11.1 Steps

  1. Locate mutable start from current indexes:
    • CurrentJunctionIndex, CurrentResourceIndex.
  2. Build unsent linear tail from cache.
  3. Align true start with robot LastNodeId.
  4. Generate avoidance branch:
    • current -> wait/detour/rejoin.
  5. Select rejoinNode from original tail or direct target.
  6. Compose:
    • NewTailLinear = Branch + OriginalSuffixAfterRejoin.
  7. Call existing SplitSegmentsByBoundary(...) to regenerate 3-level structure.
  8. Replace only unsent tail structure in cache.
  9. Keep LastSentSequenceId unchanged now; continue existing increment logic during send.

11.2 Safety Conditions

  1. Patch must preserve node/edge code validity.
  2. Patch must keep action continuity rules.
  3. Patch must pass lock precheck before commit.

12. Protocol Layer Integration

Modify SendNextSegmentInternalAsync(...) flow in Vda5050ProtocolService:

  1. Read cache.
  2. Call coordinator:
    • TryAdjustBeforeSend(robot, subTask, cache).
  3. Handle returned action:
    • Hold: return successful wait response, do not move index.
    • PatchTail: reload cache and continue existing send logic.
    • Continue: existing logic unchanged.
    • ReplanTail: apply replan patch then continue.
  4. Continue existing checks:
    • net action gate
    • lock acquisition
    • order publish
    • index increment

Result:

  1. Existing segment send behavior is preserved.
  2. Avoidance affects only future unsent parts.

13. Strategy Executor Contracts

Add interface:

public interface IAvoidanceExecutor
{
    string StrategyCode { get; }

    Task<TailPatch> BuildPatchAsync(AvoidanceContext context, CancellationToken ct);
}

Add selector:

public interface IAvoidanceStrategySelector
{
    Task<string> SelectAsync(AvoidanceContext context, CancellationToken ct);
}

Context contains:

  1. Robot snapshot.
  2. Route cache snapshot.
  3. Fleet/lock snapshot.
  4. Junction metadata.
  5. Timing counters.

14. File-level Implementation Map

14.1 New Files

  1. src/Rcs.Application/Services/PathFind/Realtime/IGlobalNavigationCoordinator.cs
  2. src/Rcs.Infrastructure/PathFinding/Realtime/GlobalNavigationCoordinator.cs
  3. src/Rcs.Infrastructure/PathFinding/Realtime/AvoidanceStrategySelector.cs
  4. src/Rcs.Infrastructure/PathFinding/Realtime/TailPatchApplier.cs
  5. src/Rcs.Infrastructure/PathFinding/Realtime/Executors/*.cs
  6. src/Rcs.Application/Services/PathFind/Realtime/Models/*.cs

14.2 Modify Existing Files

  1. src/Rcs.Infrastructure/Mqtt/MqttMessageHandler.cs
    • publish state events to coordinator.
  2. src/Rcs.Infrastructure/Services/Protocol/Vda5050ProtocolService.cs
    • add TryAdjustBeforeSend(...) hook.
  3. src/Rcs.Application/Services/PathFind/Models/VdaSegmentedPathCache.cs
    • add version/mode fields.
  4. src/Rcs.Infrastructure/PathFinding/Services/JunctionConflictResolver.cs
    • implement actual occupying robot retrieval.
  5. src/Rcs.Infrastructure/Installs/*.cs
    • register new services and fix lifecycle conflicts.

15. Rollout Plan

Phase 1: Skeleton and Hold-only

  1. Add coordinator/event contracts.
  2. Integrate pre-send hook.
  3. Implement only Continue/Hold.
  4. Add metrics and logs.

Exit criteria:

  1. No regression in current send flow.
  2. Hold decisions can be made and resumed correctly.

Phase 2: Tail Patch for Wait/JunctionOccupy

  1. Implement WaitNearest and JunctionOccupy.
  2. Implement TailPatchApplier.
  3. Add CAS versioning.

Exit criteria:

  1. Unsent tail replacement works without sent segment mutation.
  2. Sequence and index consistency maintained.

Phase 3: Detour/Retreat/Parking

  1. Implement LocalDetour, Retreat, Parking.
  2. Add queue fairness for junction.
  3. Add anti-flap tuning.

Exit criteria:

  1. Stable behavior under congestion stress.
  2. Reduced deadlock duration and retries.

Phase 4: ReplanTail and Hardening

  1. Add replan escalation.
  2. Add failure recovery and fallback paths.
  3. Complete observability dashboards.

Exit criteria:

  1. Robust under burst traffic and partial failures.
  2. Error handling fully covered by tests.

16. Testing Plan

16.1 Unit Tests

  1. Strategy selector matrix and anti-flap.
  2. State machine transitions.
  3. Tail patch merge correctness.
  4. CAS conflict retries.

16.2 Integration Tests

  1. Two robots at same junction with priority inversion.
  2. Reverse edge conflict in narrow corridor.
  3. Wait node hold and resume.
  4. Detour and rejoin tail replacement.
  5. Net action pending while congestion occurs.

16.3 Soak/Stress

  1. 20+ robots with random congestion.
  2. Intermittent Redis latency.
  3. Repeated lock fail/recover loops.

17. Observability and Metrics

Add counters/timers:

  1. route_decision_total{strategy,action}
  2. route_hold_duration_ms
  3. route_patch_apply_ms
  4. route_patch_conflict_total
  5. route_resume_attempt_total
  6. route_send_blocked_total

Add structured logs with:

  1. robotId
  2. taskId/subTaskId
  3. planVersion
  4. routeMode
  5. strategy
  6. patchId

18. Known Risks and Mitigations

  1. Risk: strategy oscillation.
    • Mitigation: anti-flap + min hold + stable pass count.
  2. Risk: concurrent patch/send races.
    • Mitigation: per-robot lock + CAS plan version.
  3. Risk: action context break after merge.
    • Mitigation: boundary-safe patch policy in early phases.
  4. Risk: lock leak when patch fails.
    • Mitigation: lock intent transaction and compensating release.

19. Done Criteria

Implementation is done only if all true:

  1. All avoidance types are selectable by unified selector.
  2. Wait resume is event-driven and deterministic.
  3. Tail patch never mutates sent segments.
  4. Existing send-next-segment behavior remains compatible.
  5. Core scenarios pass integration and stress tests.

20. Immediate Task Checklist (Engineering Backlog)

  1. Create coordinator/event contracts and register DI.
  2. Add route mode/version fields to cache model.
  3. Add pre-send adjustment hook in protocol service.
  4. Implement hold resume gate based on state events.
  5. Implement junction occupancy queue and retrieval logic.
  6. Implement tail patch applier for unsent segments.
  7. Add strategy selector with matrix and anti-flap.
  8. Add metrics, logs, and test suites.