Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
- building-mcp-track-enterprise
- building-mcp-track-consumer
π HuggingFace EDA MCP Server
π Submission for the Gradio MCP 1st Birthday Hackathon
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
Use cases:
- Dataset discovery:
- Inspect metadata, schemas, and samples to evaluate datasets before use
- Use it in conjunction with HuggingFace MCP
search_datasetfor even more powerful dataset discovery
- Exploratory Data analysis:
- Analyze feature distributions, detect missing values, and review statistics
- Ask your AI assistant to build reports and visualizations
- Content search: Find specific examples in datasets using text search
MCP Client Configuration
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
Hosted endpoint: https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/
With URL
{
"mcpServers": {
"hf-eda-mcp": {
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"headers": {
"hf-api-token": "<HF_TOKEN>"
}
}
}
}
With mcp-remote
{
"mcpServers": {
"hf-eda-mcp": {
"command": "npx",
"args": [
"mcp-remote",
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"--transport",
"streamable-http",
"--header",
"hf-api-token: <HF_TOKEN>"
]
}
}
}
Available Tools
get_dataset_metadata
Retrieve comprehensive metadata about a HuggingFace dataset.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string | β | - | HuggingFace dataset identifier (e.g., imdb, squad, glue) |
config_name |
string | β | None |
Configuration name for multi-config datasets |
Returns: Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
get_dataset_sample
Retrieve sample rows from a dataset for quick exploration.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string | β | - | HuggingFace dataset identifier |
split |
string | β | train |
Dataset split to sample from |
num_samples |
int | β | 10 |
Number of samples to retrieve (max: 10,000) |
config_name |
string | β | None |
Configuration name for multi-config datasets |
streaming |
bool | β | True |
Use streaming mode for efficient loading |
Returns: Sample data rows with schema information and sampling metadata.
analyze_dataset_features
Perform exploratory data analysis on dataset features with automatic optimization.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string | β | - | HuggingFace dataset identifier |
split |
string | β | train |
Dataset split to analyze |
sample_size |
int | β | 1000 |
Number of samples for analysis (max: 50,000) |
config_name |
string | β | None |
Configuration name for multi-config datasets |
Returns: Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
search_text_in_dataset
Search for text in dataset columns using the Dataset Viewer API.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string | β | - | Full dataset identifier (e.g., stanfordnlp/imdb) |
config_name |
string | β | - | Configuration name |
split |
string | β | - | Split name |
query |
string | β | - | Search query |
offset |
int | β | 0 |
Pagination offset |
length |
int | β | 10 |
Number of results to return |
Returns: Matching rows with highlighted search results. Only works on parquet datasets with text columns.
How It Works
API Integrations
The server leverages multiple HuggingFace APIs:
| API | Used For |
|---|---|
| Hub API | Dataset metadata, repository info, download stats |
| Dataset Viewer API | Full dataset statistics, text search, parquet row access |
| datasets library | Streaming data loading, sample extraction |
Data Loading Strategy
- Streaming mode (default): Uses
datasets.load_dataset(..., streaming=True)to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint. - Statistics API: For parquet datasets,
analyze_dataset_featuresfirst attempts to fetch pre-computed statistics from the Dataset Viewer API (/statisticsendpoint), providing full dataset coverage without sampling. - Fallback: If statistics aren't available, analysis falls back to sample-based computation.
Caching
Results are cached locally to reduce API calls:
| Cache Type | TTL | Location |
|---|---|---|
| Metadata | 1 hour | ~/.cache/hf_eda_mcp/metadata/ |
| Samples | 1 hour | ~/.cache/hf_eda_mcp/samples/ |
| Statistics | 1 hour | ~/.cache/hf_eda_mcp/statistics/ |
Parquet Requirements
Some features require datasets with builder_name="parquet":
- Text search (
search_text_in_dataset): Only parquet datasets are searchable - Full statistics: Pre-computed stats are only available for parquet datasets
Error Handling
- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues
Project Structure
src/hf_eda_mcp/
βββ server.py # Gradio app with MCP server setup
βββ config.py # Server configuration (env vars, defaults)
βββ validation.py # Input validation for all tools
βββ error_handling.py # Retry logic, error formatting
βββ tools/ # MCP tools (exposed via Gradio)
β βββ metadata.py # get_dataset_metadata
β βββ sampling.py # get_dataset_sample
β βββ analysis.py # analyze_dataset_features
β βββ search.py # search_text_in_dataset
βββ services/ # Business logic layer
β βββ dataset_service.py # Caching, data loading, statistics
βββ integrations/
βββ dataset_viewer_adapter.py # Dataset Viewer API client
βββ hf_client.py # HuggingFace Hub API wrapper (HfApi)
Local Development
Setup
# Install pdm
brew install pdm
# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp
# Install dependencies
pdm install
# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
# Run the server
pdm run hf-eda-mcp
The server starts at http://localhost:7860 with MCP endpoint at /gradio_api/mcp/.
License
Apache License 2.0