Agent Error Handling Architecture in GraphBit: Detection, Classification, and Mitigation Strategies
Overview: Agent Error Handling Architecture
GraphBit’s agent error handling is implemented primarily in the Rust core with targeted support in the Python bindings:
Rust core (graphbit_core)
- Detection: Agent node execution errors, LLM provider API failures, missing agents, dependency issues
- Classification: Retryable vs non-retryable errors, auth/config-critical failures
- Mitigation: Circuit breaker (per agent), retries with exponential backoff + jitter
- Propagation: Convert to NodeExecutionResult; update WorkflowContext or fail workflow (fail-fast)
- Concurrency: Limits on agent nodes; avoids cascading failures via breaker
- Python bindings
- Pre-dispatch validation and global timeout errors
- Tool-call handling: map tool errors; fallback to safe outputs on final LLM failure
- Logging and surfacing of errors to the user-facing WorkflowResult
Key Components and Methods
- core/src/workflow.rs
- WorkflowExecutor::execute: batch loop, fail-fast checks, context updates
- WorkflowExecutor::execute_node_with_retry: breaker checks, retries, result construction
- WorkflowExecutor::execute_agent_node_static / execute_agent_with_tools: agent execution path; delegates to LLM provider; tool-call return path
- core/src/types.rs
- GraphBitError (via errors.rs)
- RetryConfig with RetryableErrorType classification and backoff
- CircuitBreaker with thresholds, half-open handling
- NodeExecutionResult for structured failure propagation
- core/src/llm/*.rs
- Providers return GraphBitError::llm_provider on HTTP errors, parse errors, client errors
- python/src/workflow/executor.rs
- Validation errors; timeout_error; tool-call result wiring and fallback
- python/src/tools/registry.rs
- Tool execution failures captured and recorded; history capped
Concrete Code Examples (from the repository)
1) Agent failure detection and propagation (per-node)
match &node.node_type {
NodeType::Agent { .. } => {
`Self::execute_agent_node_static(...).await`
}
// ...
}
match result {
Ok(output) => { /* store outputs; breaker.record_success(); */ }
Err(error) => {
if let Some(ref config) = retry_config {
`if config.should_retry(&error, attempt) { attempt += 1; let delay_ms = config.calculate_delay(attempt); if delay_ms>0 { sleep(...).await; } continue; }`
}
return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())
`.with_duration(start_time.elapsed().as_millis() as u64)`
`.with_retry_count(attempt));`
}
}
2) Circuit breaker gating and state transitions
if let Some(ref mut breaker) = circuit_breaker {
if !breaker.should_allow_request() {
`let error = GraphBitError::workflow_execution("Circuit breaker is open - requests are being rejected");`
`return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())`
`.with_duration(start_time.elapsed().as_millis() as u64)`
`.with_retry_count(attempt));`
}
}
pub fn record_failure(&mut self) {
self.last_failure = Some(chrono::Utc::now());
match self.state {
`CircuitBreakerState::Closed => { self.failure_count += 1;`
`if self.failure_count >= self.config.failure_threshold {`
`self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() };`
`}`
`}`
`CircuitBreakerState::HalfOpen => { self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() }; self.failure_count = 1; self.success_count = 0; }`
`CircuitBreakerState::Open { .. } => {}`
}
}
3) Retry configuration and classification
pub fn should_retry(&self, error: &crate::errors::GraphBitError, attempt: u32) -> bool {
if attempt >= self.max_attempts { return false; }
let error_type = RetryableErrorType::from_error(error);
self.retryable_errors.contains(&error_type)
}
pub fn calculate_delay(&self, attempt: u32) -> u64 {
if attempt == 0 { return 0; }
let base_delay = (self.initial_delay_ms as f64 * self.backoff_multiplier.powi(attempt as i32 - 1)).min(self.max_delay_ms as f64);
let jitter = (rand::random::<f64>() - 0.5) * 2.0 * self.jitter_factor * base_delay;
((base_delay + jitter).max(0.0) as u64).min(self.max_delay_ms)
}
pub fn from_error(error: &crate::errors::GraphBitError) -> Self {
let s = error.to_string().to_lowercase();
if s.contains("timeout") { Self::TimeoutError }
else if s.contains("network") { Self::NetworkError }
else if s.contains("rate limit") || s.contains("too many requests") { Self::RateLimitError }
else if s.contains("auth") || s.contains("unauthorized") { Self::AuthenticationError }
else { Self::Other }
}
4) Batch-level fail-fast for agent/auth errors
let is_auth_error = error_msg.contains("auth") || error_msg.contains("key")
|| error_msg.contains("invalid") || error_msg.contains("unauthorized")
|| error_msg.contains("permission") || error_msg.contains("api error");
if is_auth_error || self.fail_fast {
should_fail_fast = true;
failure_message = e.to_string();
break;
}
5) Agent creation failures (e.g., invalid API key/config)
Err(e) => {
return Err(GraphBitError::workflow_execution(format!(
`"Failed to create agent '{agent_id_str}': {e}. This may be due to invalid API key or configuration.",`
)));
}
6) Agent not found errors
let agent = agents_guard
.get(agent_id)
.ok_or_else(|| GraphBitError::agent_not_found(agent_id.to_string()))?
.clone();
7) LLM provider error mapping (OpenAI example)
let response = req_builder.send().await
.map_err(|e| GraphBitError::llm_provider("openai", format!("Request failed: {e}")))?;
if !response.status().is_success() {
let error_text = response.text().await.unwrap_or_else(|| "Unknown error".to_string());
return Err(GraphBitError::llm_provider("openai", format!("API error: {error_text}")));
}
8) Agent with tool-calls: error flow back to Python
let llm_response = agent.llm_provider().complete(request).await?;
if !llm_response.tool_calls.is_empty() {
Ok(serde_json::json!({ "type": "tool_calls_required", "content": llm_response.content, "tool_calls": tool_calls_json, "original_prompt": prompt }))
} else {
Ok(serde_json::Value::String(llm_response.content))
}
9) Tool execution error handling (Python)
let tool_results = Python::with_gil(|py| {
execute_production_tool_calls(py, tool_calls_json, node_tools)
}).map_err(|e| {
graphbit_core::errors::GraphBitError::workflow_execution(format!("Failed to execute tools: {}", e))
})?;
10) Final LLM call fallback (Python)
match llm_provider.complete(final_request).await {
Ok(final_response) => { context.set_node_output(&node.id, serde_json::Value::String(response_content)); /* also by name/variables */ }
Err(e) => {
tracing::error!("Failed to get final LLM response: {}", e);
context.set_node_output(&node.id, serde_json::Value::String(tool_results.clone()));
}
}
11) Tool registry failure result (captured, logged, recorded)
Err(e) => {
let duration = start_time.elapsed().as_millis() as u64;
let error_msg = format!("Tool execution failed: {}", e);
let tool_result = ToolResult::failure(name.to_string(), params_str, error_msg, duration);
self.add_to_history(tool_result.clone())?;
tool_result
}
Step-by-Step Agent Error Handling Flow
- Preparation
- Agents are auto-registered before execution; creation failures produce workflow_execution errors (e.g., invalid API key). Execution is aborted early with a descriptive message.
- Batch execution begins
- Nodes execute concurrently (tokio tasks). For agent nodes, concurrency permits are acquired; if breaker is open, the node immediately returns a NodeExecutionResult::failure.
- Agent execution
- execute_agent_node_static resolves the prompt and calls the agent. LLM provider failures (HTTP/network/parse) surface as GraphBitError::llm/llm_provider, which immediately bubble back to execute_node_with_retry as Err(error).
- Retry and breaker logic
- execute_node_with_retry classifies the error with RetryConfig::should_retry (text-based classification) and calculates backoff with jitter. While retrying, the breaker records failures; a sequence of failures may open the breaker, short-circuiting future attempts for that agent.
- Result construction
- On final failure (no retries left), NodeExecutionResult::failure is returned (includes node_id, duration_ms, retry_count). On success, success(...) is returned and the breaker records success (which can close a half-open breaker).
- Batch aggregation and fail-fast
- The batch loop joins task results. If a node returned Ok(Err(e)) earlier, that is seen here; the engine checks for auth/configuration criticality via message strings (auth/key/invalid/unauthorized/permission/api error). If fail_fast is enabled (e.g., LowLatency) or it’s an auth-critical error, the workflow context is set to Failed and returned immediately.
- Context propagation
- For successful nodes, outputs are stored under node id and name, and variables are set to stringified outputs (compatibility). Failed nodes do not store new outputs; WorkflowContext receives failed state if fail-fast triggers.
- Tool-call specific flow (Python)
- If an agent returns “tool_calls_required,” Python executes tools and then attempts a final LLM call. If tool execution fails, a workflow_execution error is raised. If the final LLM call fails, the system logs and safely falls back to tool results as the output for that node.
- Completion and stats
- On success, context completes. On fail-fast, context.fail(message) is set and returned. Concurrency stats and durations are tracked; breaker state is retained and updated in the executor’s map.
Detailed Insights
- Agent-specific error detection and classification
- Detection is primarily via LLM provider errors and “agent_not_found.”
- Classification uses RetryableErrorType::from_error, which inspects error strings for “timeout,” “network,” “rate limit,” “unauthorized,” etc. GraphBitError::is_retryable also marks LLM/Network/RateLimit errors retryable.
- Circuit breaker thresholds and states
- Configurable with failure_threshold, recovery_timeout_ms, success_threshold. Closed → Open after threshold failures; Open → HalfOpen after timeout; HalfOpen → Closed on success_threshold successes, else back to Open on a failure.
- Retry with exponential backoff + jitter
- initial_delay_ms, backoff_multiplier, max_delay_ms, jitter_factor. attempt increments after each failed try; delay applied before re-attempt.
- Error propagation into workflow context
- Per-node NodeExecutionResult returned up the batch loop; if fail-fast triggers, context.fail(message) is set and execution stops. Successful nodes write outputs and variables, failed nodes do not.
- Fail-fast vs graceful degradation
- LowLatency uses fail_fast and disables retries by default. HighThroughput keeps retries on and does not fail-fast (unless auth-critical). This reduces wasted work in latency-sensitive scenarios.
- LLM provider error handling (API, rate-limits, auth)
- Providers map HTTP/client/parse failures into GraphBitError::llm_provider with embedded provider name and message (e.g., “API error: ...”). Rate limits/401/403 aren’t explicitly branched by code but can be recognized via body/status text; retryable classification uses message text (“too many requests,” “rate limit,” “unauthorized”).
- Tool execution errors within agent workflows
- Python registry converts tool exceptions into ToolResult::failure and adds to history. Workflow tool-call orchestration converts errors to GraphBitError::workflow_execution. Final LLM call failures are logged and fallback to tool results rather than failing the entire workflow by default.
- Recovery and state management
- Breaker half-open allows safe probing after timeout. When a node/agent recovers, breaker resets on enough successes. Context maintains a consistent state (Running → Completed/Failed).
- Logging and monitoring
- Errors and fallbacks are logged via tracing (Rust) and tracing/error (Python). Tool registry maintains a capped execution history (last 1000 results) for post-mortem analysis.
- Performance impact
- Retries add delay; backoff reduces provider pressure. Breaker prevents thundering herds on failing providers. Fail-fast cuts wasted work on hard failures. Concurrency permits only applied to agent nodes to reduce overhead elsewhere.
Async Scenarios
- Node tasks run in tokio; errors inside tasks become Ok(Err(...)) at the join site. If task spawning fails (JoinError), it’s treated as an execution failure; fail-fast may abort the workflow. The engine aggregates all task results and makes a mode-aware decision to continue or fail early.
Representative Data Structures for Failures
- NodeExecutionResult (success/failure with duration and retries)
pub struct NodeExecutionResult {
pub success: bool,
pub output: serde_json::Value,
pub error: Option<String>,
pub duration_ms: u64,
pub retry_count: u32,
pub node_id: NodeId,
}
- GraphBitError and helpers
pub enum GraphBitError {
Configuration { message: String },
LlmProvider { provider: String, message: String },
Llm { message: String },
// ...
}
Summary
GraphBit’s agent error handling blends defensive preparation (validation, auto-registration), robust per-node execution control (circuit breaker + retries), and flexible workflow-level strategies (fail-fast vs graceful). LLM provider errors are mapped uniformly into GraphBitError and considered for retries by classification. Tool-call workflows in Python add pragmatic fallbacks (keep tool results if final LLM fails). The result is a resilient system that avoids cascading failures, surfaces actionable diagnostics, and provides predictable behavior across performance modes.
If you want, I can add focused unit/integration tests that:
- Simulate an LLM provider failure to verify retry and breaker transitions
- Assert fail-fast behavior on an “unauthorized” error string
- Verify Python tool-call fallback when final LLM errors out