Agent Error Handling Architecture in GraphBit: Detection, Classification, and Mitigation Strategies

Community Article Published November 27, 2025

Overview: Agent Error Handling Architecture

GraphBit’s agent error handling is implemented primarily in the Rust core with targeted support in the Python bindings:

Rust core (graphbit_core)

  • Detection: Agent node execution errors, LLM provider API failures, missing agents, dependency issues
  • Classification: Retryable vs non-retryable errors, auth/config-critical failures
  • Mitigation: Circuit breaker (per agent), retries with exponential backoff + jitter
  • Propagation: Convert to NodeExecutionResult; update WorkflowContext or fail workflow (fail-fast)
  • Concurrency: Limits on agent nodes; avoids cascading failures via breaker
  • Python bindings
    • Pre-dispatch validation and global timeout errors
    • Tool-call handling: map tool errors; fallback to safe outputs on final LLM failure
    • Logging and surfacing of errors to the user-facing WorkflowResult

Key Components and Methods

  • core/src/workflow.rs
    • WorkflowExecutor::execute: batch loop, fail-fast checks, context updates
    • WorkflowExecutor::execute_node_with_retry: breaker checks, retries, result construction
    • WorkflowExecutor::execute_agent_node_static / execute_agent_with_tools: agent execution path; delegates to LLM provider; tool-call return path
  • core/src/types.rs
    • GraphBitError (via errors.rs)
    • RetryConfig with RetryableErrorType classification and backoff
    • CircuitBreaker with thresholds, half-open handling
    • NodeExecutionResult for structured failure propagation
  • core/src/llm/*.rs
    • Providers return GraphBitError::llm_provider on HTTP errors, parse errors, client errors
  • python/src/workflow/executor.rs
    • Validation errors; timeout_error; tool-call result wiring and fallback
  • python/src/tools/registry.rs
    • Tool execution failures captured and recorded; history capped

Concrete Code Examples (from the repository)

1) Agent failure detection and propagation (per-node)

match &node.node_type {

NodeType::Agent { .. } => {

   `Self::execute_agent_node_static(...).await`

}

// ...

}

match result {

Ok(output) => { /* store outputs; breaker.record_success(); */ }

Err(error) => {

if let Some(ref config) = retry_config {

 `if config.should_retry(&error, attempt) { attempt += 1; let delay_ms = config.calculate_delay(attempt); if delay_ms>0 { sleep(...).await; } continue; }`

}

return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())

 `.with_duration(start_time.elapsed().as_millis() as u64)`

 `.with_retry_count(attempt));`

}

}

2) Circuit breaker gating and state transitions

if let Some(ref mut breaker) = circuit_breaker {

if !breaker.should_allow_request() {

   `let error = GraphBitError::workflow_execution("Circuit breaker is open - requests are being rejected");`

   `return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())`

     `.with_duration(start_time.elapsed().as_millis() as u64)`

     `.with_retry_count(attempt));`

}

}

pub fn record_failure(&mut self) {

self.last_failure = Some(chrono::Utc::now());

match self.state {

   `CircuitBreakerState::Closed => { self.failure_count += 1;`

       `if self.failure_count >= self.config.failure_threshold {`

           `self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() };`

       `}`

   `}`

   `CircuitBreakerState::HalfOpen => { self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() }; self.failure_count = 1; self.success_count = 0; }`

   `CircuitBreakerState::Open { .. } => {}`

}

}

3) Retry configuration and classification

pub fn should_retry(&self, error: &crate::errors::GraphBitError, attempt: u32) -> bool {

if attempt >= self.max_attempts { return false; }

let error_type = RetryableErrorType::from_error(error);

self.retryable_errors.contains(&error_type)

}

pub fn calculate_delay(&self, attempt: u32) -> u64 {

if attempt == 0 { return 0; }

let base_delay = (self.initial_delay_ms as f64 * self.backoff_multiplier.powi(attempt as i32 - 1)).min(self.max_delay_ms as f64);

let jitter = (rand::random::<f64>() - 0.5) * 2.0 * self.jitter_factor * base_delay;

((base_delay + jitter).max(0.0) as u64).min(self.max_delay_ms)

}

pub fn from_error(error: &crate::errors::GraphBitError) -> Self {

let s = error.to_string().to_lowercase();

if s.contains("timeout") { Self::TimeoutError }

else if s.contains("network") { Self::NetworkError }

else if s.contains("rate limit") || s.contains("too many requests") { Self::RateLimitError }

else if s.contains("auth") || s.contains("unauthorized") { Self::AuthenticationError }

else { Self::Other }

}

4) Batch-level fail-fast for agent/auth errors

let is_auth_error = error_msg.contains("auth") || error_msg.contains("key")

|| error_msg.contains("invalid") || error_msg.contains("unauthorized")

|| error_msg.contains("permission") || error_msg.contains("api error");

if is_auth_error || self.fail_fast {

should_fail_fast = true;

failure_message = e.to_string();

break;

}

5) Agent creation failures (e.g., invalid API key/config)

Err(e) => {

return Err(GraphBitError::workflow_execution(format!(

   `"Failed to create agent '{agent_id_str}': {e}. This may be due to invalid API key or configuration.",`

)));

}

6) Agent not found errors

let agent = agents_guard

.get(agent_id)

.ok_or_else(|| GraphBitError::agent_not_found(agent_id.to_string()))?

.clone();

7) LLM provider error mapping (OpenAI example)

let response = req_builder.send().await

.map_err(|e| GraphBitError::llm_provider("openai", format!("Request failed: {e}")))?;

if !response.status().is_success() {

let error_text = response.text().await.unwrap_or_else(|| "Unknown error".to_string());

return Err(GraphBitError::llm_provider("openai", format!("API error: {error_text}")));

}

8) Agent with tool-calls: error flow back to Python

let llm_response = agent.llm_provider().complete(request).await?;

if !llm_response.tool_calls.is_empty() {

Ok(serde_json::json!({ "type": "tool_calls_required", "content": llm_response.content, "tool_calls": tool_calls_json, "original_prompt": prompt }))

} else {

Ok(serde_json::Value::String(llm_response.content))

}

9) Tool execution error handling (Python)

let tool_results = Python::with_gil(|py| {

execute_production_tool_calls(py, tool_calls_json, node_tools)

}).map_err(|e| {

graphbit_core::errors::GraphBitError::workflow_execution(format!("Failed to execute tools: {}", e))

})?;

10) Final LLM call fallback (Python)

match llm_provider.complete(final_request).await {

Ok(final_response) => { context.set_node_output(&node.id, serde_json::Value::String(response_content)); /* also by name/variables */ }

Err(e) => {

tracing::error!("Failed to get final LLM response: {}", e);

context.set_node_output(&node.id, serde_json::Value::String(tool_results.clone()));

}

}

11) Tool registry failure result (captured, logged, recorded)

Err(e) => {

let duration = start_time.elapsed().as_millis() as u64;

let error_msg = format!("Tool execution failed: {}", e);

let tool_result = ToolResult::failure(name.to_string(), params_str, error_msg, duration);

self.add_to_history(tool_result.clone())?;

tool_result

}

Step-by-Step Agent Error Handling Flow

  1. Preparation
  • Agents are auto-registered before execution; creation failures produce workflow_execution errors (e.g., invalid API key). Execution is aborted early with a descriptive message.
  1. Batch execution begins
  • Nodes execute concurrently (tokio tasks). For agent nodes, concurrency permits are acquired; if breaker is open, the node immediately returns a NodeExecutionResult::failure.
  1. Agent execution
  • execute_agent_node_static resolves the prompt and calls the agent. LLM provider failures (HTTP/network/parse) surface as GraphBitError::llm/llm_provider, which immediately bubble back to execute_node_with_retry as Err(error).
  1. Retry and breaker logic
  • execute_node_with_retry classifies the error with RetryConfig::should_retry (text-based classification) and calculates backoff with jitter. While retrying, the breaker records failures; a sequence of failures may open the breaker, short-circuiting future attempts for that agent.
  1. Result construction
  • On final failure (no retries left), NodeExecutionResult::failure is returned (includes node_id, duration_ms, retry_count). On success, success(...) is returned and the breaker records success (which can close a half-open breaker).
  1. Batch aggregation and fail-fast
  • The batch loop joins task results. If a node returned Ok(Err(e)) earlier, that is seen here; the engine checks for auth/configuration criticality via message strings (auth/key/invalid/unauthorized/permission/api error). If fail_fast is enabled (e.g., LowLatency) or it’s an auth-critical error, the workflow context is set to Failed and returned immediately.
  1. Context propagation
  • For successful nodes, outputs are stored under node id and name, and variables are set to stringified outputs (compatibility). Failed nodes do not store new outputs; WorkflowContext receives failed state if fail-fast triggers.
  1. Tool-call specific flow (Python)
  • If an agent returns “tool_calls_required,” Python executes tools and then attempts a final LLM call. If tool execution fails, a workflow_execution error is raised. If the final LLM call fails, the system logs and safely falls back to tool results as the output for that node.
  1. Completion and stats
  • On success, context completes. On fail-fast, context.fail(message) is set and returned. Concurrency stats and durations are tracked; breaker state is retained and updated in the executor’s map.

Detailed Insights

  • Agent-specific error detection and classification
    • Detection is primarily via LLM provider errors and “agent_not_found.”
    • Classification uses RetryableErrorType::from_error, which inspects error strings for “timeout,” “network,” “rate limit,” “unauthorized,” etc. GraphBitError::is_retryable also marks LLM/Network/RateLimit errors retryable.
  • Circuit breaker thresholds and states
    • Configurable with failure_threshold, recovery_timeout_ms, success_threshold. Closed → Open after threshold failures; Open → HalfOpen after timeout; HalfOpen → Closed on success_threshold successes, else back to Open on a failure.
  • Retry with exponential backoff + jitter
    • initial_delay_ms, backoff_multiplier, max_delay_ms, jitter_factor. attempt increments after each failed try; delay applied before re-attempt.
  • Error propagation into workflow context
    • Per-node NodeExecutionResult returned up the batch loop; if fail-fast triggers, context.fail(message) is set and execution stops. Successful nodes write outputs and variables, failed nodes do not.
  • Fail-fast vs graceful degradation
    • LowLatency uses fail_fast and disables retries by default. HighThroughput keeps retries on and does not fail-fast (unless auth-critical). This reduces wasted work in latency-sensitive scenarios.
  • LLM provider error handling (API, rate-limits, auth)
    • Providers map HTTP/client/parse failures into GraphBitError::llm_provider with embedded provider name and message (e.g., “API error: ...”). Rate limits/401/403 aren’t explicitly branched by code but can be recognized via body/status text; retryable classification uses message text (“too many requests,” “rate limit,” “unauthorized”).
  • Tool execution errors within agent workflows
    • Python registry converts tool exceptions into ToolResult::failure and adds to history. Workflow tool-call orchestration converts errors to GraphBitError::workflow_execution. Final LLM call failures are logged and fallback to tool results rather than failing the entire workflow by default.
  • Recovery and state management
    • Breaker half-open allows safe probing after timeout. When a node/agent recovers, breaker resets on enough successes. Context maintains a consistent state (Running → Completed/Failed).
  • Logging and monitoring
    • Errors and fallbacks are logged via tracing (Rust) and tracing/error (Python). Tool registry maintains a capped execution history (last 1000 results) for post-mortem analysis.
  • Performance impact
    • Retries add delay; backoff reduces provider pressure. Breaker prevents thundering herds on failing providers. Fail-fast cuts wasted work on hard failures. Concurrency permits only applied to agent nodes to reduce overhead elsewhere.

Async Scenarios

  • Node tasks run in tokio; errors inside tasks become Ok(Err(...)) at the join site. If task spawning fails (JoinError), it’s treated as an execution failure; fail-fast may abort the workflow. The engine aggregates all task results and makes a mode-aware decision to continue or fail early.

Representative Data Structures for Failures

  • NodeExecutionResult (success/failure with duration and retries)

pub struct NodeExecutionResult {

pub success: bool,

pub output: serde_json::Value,

pub error: Option<String>,

pub duration_ms: u64,

pub retry_count: u32,

pub node_id: NodeId,

}

  • GraphBitError and helpers

pub enum GraphBitError {

Configuration { message: String },

LlmProvider { provider: String, message: String },

Llm { message: String },

// ...

}

Summary

GraphBit’s agent error handling blends defensive preparation (validation, auto-registration), robust per-node execution control (circuit breaker + retries), and flexible workflow-level strategies (fail-fast vs graceful). LLM provider errors are mapped uniformly into GraphBitError and considered for retries by classification. Tool-call workflows in Python add pragmatic fallbacks (keep tool results if final LLM fails). The result is a resilient system that avoids cascading failures, surfaces actionable diagnostics, and provides predictable behavior across performance modes.

If you want, I can add focused unit/integration tests that:

  • Simulate an LLM provider failure to verify retry and breaker transitions
  • Assert fail-fast behavior on an “unauthorized” error string
  • Verify Python tool-call fallback when final LLM errors out

Community

Sign up or log in to comment