|
--- |
|
title: '监控与分析' |
|
description: '全面监控 MCP 服务器性能、健康状况和使用情况' |
|
--- |
|
|
|
## 概述 |
|
|
|
MCPHub 提供全面的监控和分析功能,帮助您跟踪 MCP 服务器的性能、健康状况和使用模式。系统提供实时指标、历史分析和智能警报。 |
|
|
|
## 实时监控 |
|
|
|
### 仪表板概览 |
|
|
|
主监控仪表板显示关键指标: |
|
|
|
```bash |
|
# 访问监控仪表板 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" |
|
``` |
|
|
|
响应示例: |
|
|
|
```json |
|
{ |
|
"overview": { |
|
"totalServers": 12, |
|
"activeServers": 10, |
|
"totalRequests": 15420, |
|
"avgResponseTime": "245ms", |
|
"errorRate": "0.3%", |
|
"uptime": "99.9%" |
|
}, |
|
"realtime": { |
|
"requestsPerMinute": 85, |
|
"activeConnections": 156, |
|
"memoryUsage": "68%", |
|
"cpuUsage": "42%" |
|
} |
|
} |
|
``` |
|
|
|
### 服务器健康状态 |
|
|
|
```bash |
|
# 获取所有服务器状态 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" |
|
``` |
|
|
|
## 性能指标 |
|
|
|
### 响应时间监控 |
|
|
|
```json |
|
{ |
|
"responseTime": { |
|
"metrics": [ |
|
{ |
|
"serverId": "server-1", |
|
"average": "180ms", |
|
"p50": "150ms", |
|
"p95": "300ms", |
|
"p99": "500ms" |
|
} |
|
], |
|
"alerts": { |
|
"slowResponse": { |
|
"threshold": "1s", |
|
"current": "180ms", |
|
"status": "normal" |
|
} |
|
} |
|
} |
|
} |
|
``` |
|
|
|
### 吞吐量分析 |
|
|
|
```bash |
|
# 获取吞吐量统计 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "timeRange=1h" -d "granularity=5m" |
|
``` |
|
|
|
响应示例: |
|
|
|
```json |
|
{ |
|
"timeRange": "1h", |
|
"granularity": "5m", |
|
"data": [ |
|
{ |
|
"timestamp": "2024-01-01T12:00:00Z", |
|
"totalRequests": 450, |
|
"successfulRequests": 448, |
|
"failedRequests": 2, |
|
"avgResponseTime": "230ms" |
|
} |
|
], |
|
"summary": { |
|
"peakThroughput": 520, |
|
"averageThroughput": 412, |
|
"totalRequests": 4944 |
|
} |
|
} |
|
``` |
|
|
|
### 资源使用监控 |
|
|
|
```json |
|
{ |
|
"resources": { |
|
"cpu": { |
|
"usage": "42%", |
|
"cores": 8, |
|
"processes": [ |
|
{ "name": "mcp-server-1", "cpu": "15%" }, |
|
{ "name": "mcp-server-2", "cpu": "12%" } |
|
] |
|
}, |
|
"memory": { |
|
"total": "32GB", |
|
"used": "21.6GB", |
|
"available": "10.4GB", |
|
"byProcess": [ |
|
{ "name": "mcp-server-1", "memory": "2.1GB" }, |
|
{ "name": "mcp-server-2", "memory": "1.8GB" } |
|
] |
|
}, |
|
"network": { |
|
"bytesIn": "1.2GB", |
|
"bytesOut": "890MB", |
|
"packetsIn": 1250000, |
|
"packetsOut": 980000 |
|
}, |
|
"disk": { |
|
"total": "1TB", |
|
"used": "340GB", |
|
"available": "660GB", |
|
"iops": 1200 |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## 日志管理 |
|
|
|
### 集中化日志收集 |
|
|
|
配置日志聚合: |
|
|
|
```json |
|
{ |
|
"logging": { |
|
"centralized": true, |
|
"storage": { |
|
"type": "elasticsearch", |
|
"config": { |
|
"hosts": ["localhost:9200"], |
|
"index": "mcphub-logs", |
|
"retention": "30d" |
|
} |
|
}, |
|
"levels": ["error", "warn", "info", "debug"], |
|
"sources": ["application", "mcp-servers", "nginx", "system"] |
|
} |
|
} |
|
``` |
|
|
|
### 实时日志查看 |
|
|
|
```bash |
|
# 获取实时日志流 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-H "Accept: text/event-stream" |
|
|
|
# 过滤日志 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "level=error" -d "server=server-1" -d "limit=100" |
|
``` |
|
|
|
### 日志分析 |
|
|
|
```bash |
|
# 错误模式分析 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "type=error-patterns" -d "timeRange=24h" |
|
``` |
|
|
|
响应示例: |
|
|
|
```json |
|
{ |
|
"analysis": { |
|
"errorPatterns": [ |
|
{ |
|
"pattern": "Connection timeout", |
|
"occurrences": 45, |
|
"trend": "increasing", |
|
"affectedServers": ["server-2", "server-3"] |
|
}, |
|
{ |
|
"pattern": "Memory allocation failed", |
|
"occurrences": 12, |
|
"trend": "stable", |
|
"affectedServers": ["server-1"] |
|
} |
|
], |
|
"recommendations": ["检查服务器网络连接", "增加内存限制配置"] |
|
} |
|
} |
|
``` |
|
|
|
## 警报系统 |
|
|
|
### 配置警报规则 |
|
|
|
```json |
|
{ |
|
"alerts": [ |
|
{ |
|
"name": "high-response-time", |
|
"description": "响应时间过高警报", |
|
"condition": "avg(response_time) > 1000ms over 5m", |
|
"severity": "warning", |
|
"notifications": { |
|
"email": ["admin@company.com"], |
|
"slack": "#alerts", |
|
"webhook": "https://hooks.company.com/alerts" |
|
} |
|
}, |
|
{ |
|
"name": "server-down", |
|
"description": "服务器宕机警报", |
|
"condition": "server_status == 'down'", |
|
"severity": "critical", |
|
"notifications": { |
|
"email": ["oncall@company.com"], |
|
"sms": ["+1234567890"], |
|
"pagerduty": "service-key" |
|
} |
|
}, |
|
{ |
|
"name": "high-error-rate", |
|
"description": "错误率过高", |
|
"condition": "error_rate > 5% over 10m", |
|
"severity": "error", |
|
"notifications": { |
|
"slack": "#dev-team" |
|
} |
|
} |
|
] |
|
} |
|
``` |
|
|
|
### 警报通知 |
|
|
|
```bash |
|
# 测试警报通知 |
|
curl -X POST http: |
|
-H "Content-Type: application/json" \ |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-d '{ |
|
"alertName": "high-response-time", |
|
"channel": "email" |
|
}' |
|
``` |
|
|
|
### 警报历史 |
|
|
|
```bash |
|
# 获取警报历史 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "timeRange=7d" -d "severity=critical" |
|
``` |
|
|
|
## 用户活动监控 |
|
|
|
### 用户会话跟踪 |
|
|
|
```json |
|
{ |
|
"userActivity": { |
|
"activeSessions": 156, |
|
"totalUsers": 89, |
|
"sessionsByUser": [ |
|
{ |
|
"userId": "user123", |
|
"sessions": 3, |
|
"lastActivity": "2024-01-01T12:30:00Z", |
|
"totalRequests": 245 |
|
} |
|
], |
|
"topUsers": [ |
|
{ "userId": "power-user", "requests": 1250 }, |
|
{ "userId": "regular-user", "requests": 890 } |
|
] |
|
} |
|
} |
|
``` |
|
|
|
### 使用模式分析 |
|
|
|
```bash |
|
# 获取使用模式分析 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "timeRange=30d" |
|
``` |
|
|
|
响应示例: |
|
|
|
```json |
|
{ |
|
"patterns": { |
|
"peakHours": [9, 10, 14, 15, 16], |
|
"peakDays": ["tuesday", "wednesday", "thursday"], |
|
"toolUsage": [ |
|
{"tool": "filesystem", "usage": 45%}, |
|
{"tool": "web-search", "usage": 30%}, |
|
{"tool": "database", "usage": 25%} |
|
], |
|
"userBehavior": { |
|
"avgSessionDuration": "45m", |
|
"avgRequestsPerSession": 23, |
|
"returnRate": "78%" |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## 容量规划 |
|
|
|
### 资源预测 |
|
|
|
```json |
|
{ |
|
"capacityPlanning": { |
|
"currentCapacity": { |
|
"cpu": "42%", |
|
"memory": "68%", |
|
"network": "35%", |
|
"storage": "34%" |
|
}, |
|
"predictions": { |
|
"timeHorizon": "30d", |
|
"cpuForecast": [ |
|
{ "date": "2024-01-07", "usage": "48%" }, |
|
{ "date": "2024-01-14", "usage": "52%" }, |
|
{ "date": "2024-01-21", "usage": "56%" }, |
|
{ "date": "2024-01-28", "usage": "61%" } |
|
], |
|
"recommendations": [ |
|
"考虑在第3周增加CPU资源", |
|
"内存使用稳定,暂不需要扩容", |
|
"建议监控网络带宽趋势" |
|
] |
|
} |
|
} |
|
} |
|
``` |
|
|
|
### 自动扩缩容 |
|
|
|
```json |
|
{ |
|
"autoScaling": { |
|
"enabled": true, |
|
"policies": [ |
|
{ |
|
"metric": "cpu_usage", |
|
"scaleUp": { |
|
"threshold": "80%", |
|
"duration": "5m", |
|
"action": "add_instance" |
|
}, |
|
"scaleDown": { |
|
"threshold": "30%", |
|
"duration": "15m", |
|
"action": "remove_instance" |
|
} |
|
}, |
|
{ |
|
"metric": "request_queue_length", |
|
"scaleUp": { |
|
"threshold": 100, |
|
"duration": "2m", |
|
"action": "add_instance" |
|
} |
|
} |
|
] |
|
} |
|
} |
|
``` |
|
|
|
## 性能分析报告 |
|
|
|
### 生成性能报告 |
|
|
|
```bash |
|
# 生成周报 |
|
curl -X POST http: |
|
-H "Content-Type: application/json" \ |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-d '{ |
|
"type": "performance", |
|
"timeRange": "7d", |
|
"format": "pdf", |
|
"sections": [ |
|
"overview", |
|
"response_times", |
|
"error_analysis", |
|
"resource_usage", |
|
"recommendations" |
|
], |
|
"recipients": ["manager@company.com"] |
|
}' |
|
``` |
|
|
|
### 自定义报告 |
|
|
|
```json |
|
{ |
|
"customReport": { |
|
"name": "monthly-summary", |
|
"schedule": "0 0 1 * *", |
|
"template": "executive-summary", |
|
"data": { |
|
"kpis": ["uptime", "avg_response_time", "total_requests", "error_rate", "user_satisfaction"], |
|
"comparisons": { |
|
"previousMonth": true, |
|
"yearOverYear": true |
|
} |
|
}, |
|
"distribution": { |
|
"email": ["executives@company.com"], |
|
"dashboard": true, |
|
"archive": true |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## 监控 API |
|
|
|
### 指标查询 |
|
|
|
```bash |
|
# 查询自定义指标 |
|
curl -X POST http: |
|
-H "Content-Type: application/json" \ |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-d '{ |
|
"query": "avg(response_time) by (server_id)", |
|
"timeRange": "1h", |
|
"step": "5m" |
|
}' |
|
``` |
|
|
|
### 事件跟踪 |
|
|
|
```bash |
|
# 记录自定义事件 |
|
curl -X POST http: |
|
-H "Content-Type: application/json" \ |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-d '{ |
|
"type": "deployment", |
|
"title": "服务器更新", |
|
"description": "部署新版本 v2.1.0", |
|
"tags": ["deployment", "version-2.1.0"], |
|
"metadata": { |
|
"version": "2.1.0", |
|
"servers": ["server-1", "server-2"] |
|
} |
|
}' |
|
``` |
|
|
|
## 第三方集成 |
|
|
|
### Prometheus 集成 |
|
|
|
```yaml |
|
# prometheus.yml |
|
global: |
|
scrape_interval: 15s |
|
|
|
scrape_configs: |
|
- job_name: 'mcphub' |
|
static_configs: |
|
- targets: ['localhost:3000'] |
|
metrics_path: '/api/monitoring/prometheus' |
|
scrape_interval: 30s |
|
``` |
|
|
|
### Grafana 仪表板 |
|
|
|
```json |
|
{ |
|
"dashboard": { |
|
"title": "MCPHub 监控", |
|
"panels": [ |
|
{ |
|
"title": "请求率", |
|
"type": "graph", |
|
"targets": [ |
|
{ |
|
"expr": "rate(mcphub_requests_total[5m])", |
|
"legendFormat": "{{server_id}}" |
|
} |
|
] |
|
}, |
|
{ |
|
"title": "响应时间", |
|
"type": "graph", |
|
"targets": [ |
|
{ |
|
"expr": "histogram_quantile(0.95, mcphub_response_time_histogram)", |
|
"legendFormat": "95th percentile" |
|
} |
|
] |
|
} |
|
] |
|
} |
|
} |
|
``` |
|
|
|
### ELK Stack 集成 |
|
|
|
```json |
|
{ |
|
"logstash": { |
|
"input": { |
|
"beats": { |
|
"port": 5044 |
|
} |
|
}, |
|
"filter": { |
|
"if": "[fields][service] == 'mcphub'", |
|
"json": { |
|
"source": "message" |
|
}, |
|
"date": { |
|
"match": ["timestamp", "ISO8601"] |
|
} |
|
}, |
|
"output": { |
|
"elasticsearch": { |
|
"hosts": ["localhost:9200"], |
|
"index": "mcphub-logs-%{+YYYY.MM.dd}" |
|
} |
|
} |
|
} |
|
} |
|
``` |
|
|
|
## 故障排除 |
|
|
|
### 监控问题诊断 |
|
|
|
```bash |
|
# 检查监控服务状态 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" |
|
|
|
# 验证指标收集 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" |
|
``` |
|
|
|
### 性能问题分析 |
|
|
|
```bash |
|
# 性能瓶颈分析 |
|
curl -X GET http: |
|
-H "Authorization: Bearer $TOKEN" \ |
|
-G -d "timeRange=1h" |
|
``` |
|
|
|
### 常见监控问题 |
|
|
|
1. **指标丢失** |
|
|
|
- 检查收集器配置 |
|
- 验证网络连接 |
|
- 检查存储空间 |
|
|
|
2. **警报不触发** |
|
|
|
- 验证警报规则语法 |
|
- 检查通知配置 |
|
- 测试通知渠道 |
|
|
|
3. **仪表板加载缓慢** |
|
- 优化查询时间范围 |
|
- 增加数据聚合 |
|
- 检查数据库性能 |
|
|
|
## 最佳实践 |
|
|
|
1. **监控层次化**: 建立应用、基础设施和业务三层监控 |
|
2. **合理的警报**: 避免警报疲劳,设置合适的阈值 |
|
3. **数据保留**: 根据业务需求设置适当的数据保留期 |
|
4. **安全监控**: 监控安全相关的指标和事件 |
|
5. **持续优化**: 定期审查和优化监控配置 |
|
|
|
有关更多信息,请参阅 [智能路由](/zh/features/smart-routing) 和 [服务器管理](/zh/features/server-management) 文档。 |
|
|