Agent Error Handling Architecture in GraphBit: Detection, Classification, and Mitigation Strategies

Community Article Published November 27, 2025

Upvote

Musa Molla

Musamolla

Overview: Agent Error Handling Architecture

Key Components and Methods

Concrete Code Examples (from the repository)
1) Agent failure detection and propagation (per-node)

2) Circuit breaker gating and state transitions

3) Retry configuration and classification

4) Batch-level fail-fast for agent/auth errors

5) Agent creation failures (e.g., invalid API key/config)

6) Agent not found errors

7) LLM provider error mapping (OpenAI example)

8) Agent with tool-calls: error flow back to Python

9) Tool execution error handling (Python)

10) Final LLM call fallback (Python)

11) Tool registry failure result (captured, logged, recorded)

Step-by-Step Agent Error Handling Flow

Detailed Insights

Async Scenarios

Representative Data Structures for Failures

Summary

Overview: Agent Error Handling Architecture

GraphBit’s agent error handling is implemented primarily in the Rust core with targeted support in the Python bindings:

Rust core (graphbit_core)

Detection: Agent node execution errors, LLM provider API failures, missing agents, dependency issues
Classification: Retryable vs non-retryable errors, auth/config-critical failures
Mitigation: Circuit breaker (per agent), retries with exponential backoff + jitter
Propagation: Convert to NodeExecutionResult; update WorkflowContext or fail workflow (fail-fast)
Concurrency: Limits on agent nodes; avoids cascading failures via breaker
Python bindings
- Pre-dispatch validation and global timeout errors
- Tool-call handling: map tool errors; fallback to safe outputs on final LLM failure
- Logging and surfacing of errors to the user-facing WorkflowResult

Key Components and Methods

core/src/workflow.rs
- WorkflowExecutor::execute: batch loop, fail-fast checks, context updates
- WorkflowExecutor::execute_node_with_retry: breaker checks, retries, result construction
- WorkflowExecutor::execute_agent_node_static / execute_agent_with_tools: agent execution path; delegates to LLM provider; tool-call return path
core/src/types.rs
- GraphBitError (via errors.rs)
- RetryConfig with RetryableErrorType classification and backoff
- CircuitBreaker with thresholds, half-open handling
- NodeExecutionResult for structured failure propagation
core/src/llm/*.rs
- Providers return GraphBitError::llm_provider on HTTP errors, parse errors, client errors
python/src/workflow/executor.rs
- Validation errors; timeout_error; tool-call result wiring and fallback
python/src/tools/registry.rs
- Tool execution failures captured and recorded; history capped

Concrete Code Examples (from the repository)

1) Agent failure detection and propagation (per-node)

match &node.node_type {

NodeType::Agent { .. } => {

   `Self::execute_agent_node_static(...).await`

}

// ...

}

match result {

Ok(output) => { /* store outputs; breaker.record_success(); */ }

Err(error) => {

if let Some(ref config) = retry_config {

 `if config.should_retry(&error, attempt) { attempt += 1; let delay_ms = config.calculate_delay(attempt); if delay_ms>0 { sleep(...).await; } continue; }`

}

return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())

 `.with_duration(start_time.elapsed().as_millis() as u64)`

 `.with_retry_count(attempt));`

}

2) Circuit breaker gating and state transitions

if let Some(ref mut breaker) = circuit_breaker {

if !breaker.should_allow_request() {

   `let error = GraphBitError::workflow_execution("Circuit breaker is open - requests are being rejected");`

   `return Ok(NodeExecutionResult::failure(error.to_string(), node.id.clone())`

     `.with_duration(start_time.elapsed().as_millis() as u64)`

     `.with_retry_count(attempt));`

}

pub fn record_failure(&mut self) {

self.last_failure = Some(chrono::Utc::now());

match self.state {

   `CircuitBreakerState::Closed => { self.failure_count += 1;`

       `if self.failure_count >= self.config.failure_threshold {`

           `self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() };`

       `}`

   `}`

   `CircuitBreakerState::HalfOpen => { self.state = CircuitBreakerState::Open { opened_at: chrono::Utc::now() }; self.failure_count = 1; self.success_count = 0; }`

   `CircuitBreakerState::Open { .. } => {}`

}

3) Retry configuration and classification

pub fn should_retry(&self, error: &crate::errors::GraphBitError, attempt: u32) -> bool {

if attempt >= self.max_attempts { return false; }

let error_type = RetryableErrorType::from_error(error);

self.retryable_errors.contains(&error_type)

}

pub fn calculate_delay(&self, attempt: u32) -> u64 {

if attempt == 0 { return 0; }

let base_delay = (self.initial_delay_ms as f64 * self.backoff_multiplier.powi(attempt as i32 - 1)).min(self.max_delay_ms as f64);

let jitter = (rand::random::<f64>() - 0.5) * 2.0 * self.jitter_factor * base_delay;

((base_delay + jitter).max(0.0) as u64).min(self.max_delay_ms)

}

pub fn from_error(error: &crate::errors::GraphBitError) -> Self {

let s = error.to_string().to_lowercase();

if s.contains("timeout") { Self::TimeoutError }

else if s.contains("network") { Self::NetworkError }

else if s.contains("rate limit") || s.contains("too many requests") { Self::RateLimitError }

else if s.contains("auth") || s.contains("unauthorized") { Self::AuthenticationError }

else { Self::Other }

}

4) Batch-level fail-fast for agent/auth errors

let is_auth_error = error_msg.contains("auth") || error_msg.contains("key")

|| error_msg.contains("invalid") || error_msg.contains("unauthorized")

|| error_msg.contains("permission") || error_msg.contains("api error");

if is_auth_error || self.fail_fast {

should_fail_fast = true;

failure_message = e.to_string();

break;

}

5) Agent creation failures (e.g., invalid API key/config)

Err(e) => {

return Err(GraphBitError::workflow_execution(format!(

   `"Failed to create agent '{agent_id_str}': {e}. This may be due to invalid API key or configuration.",`

)));

}

6) Agent not found errors

let agent = agents_guard

.get(agent_id)

.ok_or_else(|| GraphBitError::agent_not_found(agent_id.to_string()))?

.clone();

7) LLM provider error mapping (OpenAI example)

let response = req_builder.send().await

.map_err(|e| GraphBitError::llm_provider("openai", format!("Request failed: {e}")))?;

if !response.status().is_success() {

let error_text = response.text().await.unwrap_or_else(|| "Unknown error".to_string());

return Err(GraphBitError::llm_provider("openai", format!("API error: {error_text}")));

}

8) Agent with tool-calls: error flow back to Python

let llm_response = agent.llm_provider().complete(request).await?;

if !llm_response.tool_calls.is_empty() {

Ok(serde_json::json!({ "type": "tool_calls_required", "content": llm_response.content, "tool_calls": tool_calls_json, "original_prompt": prompt }))

} else {

Ok(serde_json::Value::String(llm_response.content))

}

9) Tool execution error handling (Python)

let tool_results = Python::with_gil(|py| {

execute_production_tool_calls(py, tool_calls_json, node_tools)

}).map_err(|e| {

graphbit_core::errors::GraphBitError::workflow_execution(format!("Failed to execute tools: {}", e))

})?;

10) Final LLM call fallback (Python)

match llm_provider.complete(final_request).await {

Ok(final_response) => { context.set_node_output(&node.id, serde_json::Value::String(response_content)); /* also by name/variables */ }

Err(e) => {

tracing::error!("Failed to get final LLM response: {}", e);

context.set_node_output(&node.id, serde_json::Value::String(tool_results.clone()));

}

11) Tool registry failure result (captured, logged, recorded)

Err(e) => {

let duration = start_time.elapsed().as_millis() as u64;

let error_msg = format!("Tool execution failed: {}", e);

let tool_result = ToolResult::failure(name.to_string(), params_str, error_msg, duration);

self.add_to_history(tool_result.clone())?;

tool_result

}

Step-by-Step Agent Error Handling Flow

Preparation

Agents are auto-registered before execution; creation failures produce workflow_execution errors (e.g., invalid API key). Execution is aborted early with a descriptive message.

Batch execution begins

Nodes execute concurrently (tokio tasks). For agent nodes, concurrency permits are acquired; if breaker is open, the node immediately returns a NodeExecutionResult::failure.

Agent execution

execute_agent_node_static resolves the prompt and calls the agent. LLM provider failures (HTTP/network/parse) surface as GraphBitError::llm/llm_provider, which immediately bubble back to execute_node_with_retry as Err(error).

Retry and breaker logic

execute_node_with_retry classifies the error with RetryConfig::should_retry (text-based classification) and calculates backoff with jitter. While retrying, the breaker records failures; a sequence of failures may open the breaker, short-circuiting future attempts for that agent.

Result construction

On final failure (no retries left), NodeExecutionResult::failure is returned (includes node_id, duration_ms, retry_count). On success, success(...) is returned and the breaker records success (which can close a half-open breaker).

Batch aggregation and fail-fast

The batch loop joins task results. If a node returned Ok(Err(e)) earlier, that is seen here; the engine checks for auth/configuration criticality via message strings (auth/key/invalid/unauthorized/permission/api error). If fail_fast is enabled (e.g., LowLatency) or it’s an auth-critical error, the workflow context is set to Failed and returned immediately.

Context propagation

For successful nodes, outputs are stored under node id and name, and variables are set to stringified outputs (compatibility). Failed nodes do not store new outputs; WorkflowContext receives failed state if fail-fast triggers.

Tool-call specific flow (Python)

If an agent returns “tool_calls_required,” Python executes tools and then attempts a final LLM call. If tool execution fails, a workflow_execution error is raised. If the final LLM call fails, the system logs and safely falls back to tool results as the output for that node.

Completion and stats

On success, context completes. On fail-fast, context.fail(message) is set and returned. Concurrency stats and durations are tracked; breaker state is retained and updated in the executor’s map.

Detailed Insights

Agent-specific error detection and classification
- Detection is primarily via LLM provider errors and “agent_not_found.”
- Classification uses RetryableErrorType::from_error, which inspects error strings for “timeout,” “network,” “rate limit,” “unauthorized,” etc. GraphBitError::is_retryable also marks LLM/Network/RateLimit errors retryable.
Circuit breaker thresholds and states
- Configurable with failure_threshold, recovery_timeout_ms, success_threshold. Closed → Open after threshold failures; Open → HalfOpen after timeout; HalfOpen → Closed on success_threshold successes, else back to Open on a failure.
Retry with exponential backoff + jitter
- initial_delay_ms, backoff_multiplier, max_delay_ms, jitter_factor. attempt increments after each failed try; delay applied before re-attempt.
Error propagation into workflow context
- Per-node NodeExecutionResult returned up the batch loop; if fail-fast triggers, context.fail(message) is set and execution stops. Successful nodes write outputs and variables, failed nodes do not.
Fail-fast vs graceful degradation
- LowLatency uses fail_fast and disables retries by default. HighThroughput keeps retries on and does not fail-fast (unless auth-critical). This reduces wasted work in latency-sensitive scenarios.
LLM provider error handling (API, rate-limits, auth)
- Providers map HTTP/client/parse failures into GraphBitError::llm_provider with embedded provider name and message (e.g., “API error: ...”). Rate limits/401/403 aren’t explicitly branched by code but can be recognized via body/status text; retryable classification uses message text (“too many requests,” “rate limit,” “unauthorized”).
Tool execution errors within agent workflows
- Python registry converts tool exceptions into ToolResult::failure and adds to history. Workflow tool-call orchestration converts errors to GraphBitError::workflow_execution. Final LLM call failures are logged and fallback to tool results rather than failing the entire workflow by default.
Recovery and state management
- Breaker half-open allows safe probing after timeout. When a node/agent recovers, breaker resets on enough successes. Context maintains a consistent state (Running → Completed/Failed).
Logging and monitoring
- Errors and fallbacks are logged via tracing (Rust) and tracing/error (Python). Tool registry maintains a capped execution history (last 1000 results) for post-mortem analysis.
Performance impact
- Retries add delay; backoff reduces provider pressure. Breaker prevents thundering herds on failing providers. Fail-fast cuts wasted work on hard failures. Concurrency permits only applied to agent nodes to reduce overhead elsewhere.

Async Scenarios

Node tasks run in tokio; errors inside tasks become Ok(Err(...)) at the join site. If task spawning fails (JoinError), it’s treated as an execution failure; fail-fast may abort the workflow. The engine aggregates all task results and makes a mode-aware decision to continue or fail early.

Representative Data Structures for Failures

NodeExecutionResult (success/failure with duration and retries)

pub struct NodeExecutionResult {

pub success: bool,

pub output: serde_json::Value,

pub error: Option<String>,

pub duration_ms: u64,

pub retry_count: u32,

pub node_id: NodeId,

}

GraphBitError and helpers

pub enum GraphBitError {

Configuration { message: String },

LlmProvider { provider: String, message: String },

Llm { message: String },

// ...

}

Summary

GraphBit’s agent error handling blends defensive preparation (validation, auto-registration), robust per-node execution control (circuit breaker + retries), and flexible workflow-level strategies (fail-fast vs graceful). LLM provider errors are mapped uniformly into GraphBitError and considered for retries by classification. Tool-call workflows in Python add pragmatic fallbacks (keep tool results if final LLM fails). The result is a resilient system that avoids cascading failures, surfaces actionable diagnostics, and provides predictable behavior across performance modes.

If you want, I can add focused unit/integration tests that:

Simulate an LLM provider failure to verify retry and breaker transitions
Assert fail-fast behavior on an “unauthorized” error string
Verify Python tool-call fallback when final LLM errors out

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote