hf-eda-mcp

Running

App Files Files Community

hf-eda-mcp / README.md

KhalilGuetari

fix typo

3d81235 12 days ago

preview code

raw

history blame contribute delete

8.82 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
  - building-mcp-track-enterprise
  - building-mcp-track-consumer

📊 HuggingFace EDA MCP Server

🎉 Submission for the Gradio MCP 1st Birthday Hackathon

An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.

Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.

Use cases:

Dataset discovery:
- Inspect metadata, schemas, and samples to evaluate datasets before use
- Use it in conjunction with HuggingFace MCP search_dataset for even more powerful dataset discovery
Exploratory Data analysis:
- Analyze feature distributions, detect missing values, and review statistics
- Ask your AI assistant to build reports and visualizations
Content search: Find specific examples in datasets using text search

MCP Client Configuration

Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.

Hosted endpoint: https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/

With URL

{
  "mcpServers": {
    "hf-eda-mcp": {
      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
      "headers": {
        "hf-api-token": "<HF_TOKEN>"
      }
    }
  }
}

With mcp-remote

{
  "mcpServers": {
    "hf-eda-mcp": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
        "--transport",
        "streamable-http",
        "--header",
        "hf-api-token: <HF_TOKEN>"
      ]
    }
  }
}

Available Tools

`get_dataset_metadata`

Retrieve comprehensive metadata about a HuggingFace dataset.

Parameter	Type	Required	Default	Description
`dataset_id`	string	✅	-	HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`)
`config_name`	string	❌	`None`	Configuration name for multi-config datasets

Returns: Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.

`get_dataset_sample`

Retrieve sample rows from a dataset for quick exploration.

Parameter	Type	Required	Default	Description
`dataset_id`	string	✅	-	HuggingFace dataset identifier
`split`	string	❌	`train`	Dataset split to sample from
`num_samples`	int	❌	`10`	Number of samples to retrieve (max: 10,000)
`config_name`	string	❌	`None`	Configuration name for multi-config datasets
`streaming`	bool	❌	`True`	Use streaming mode for efficient loading

Returns: Sample data rows with schema information and sampling metadata.

`analyze_dataset_features`

Perform exploratory data analysis on dataset features with automatic optimization.

Parameter	Type	Required	Default	Description
`dataset_id`	string	✅	-	HuggingFace dataset identifier
`split`	string	❌	`train`	Dataset split to analyze
`sample_size`	int	❌	`1000`	Number of samples for analysis (max: 50,000)
`config_name`	string	❌	`None`	Configuration name for multi-config datasets

Returns: Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.

`search_text_in_dataset`

Search for text in dataset columns using the Dataset Viewer API.

Parameter	Type	Required	Default	Description
`dataset_id`	string	✅	-	Full dataset identifier (e.g., `stanfordnlp/imdb`)
`config_name`	string	✅	-	Configuration name
`split`	string	✅	-	Split name
`query`	string	✅	-	Search query
`offset`	int	❌	`0`	Pagination offset
`length`	int	❌	`10`	Number of results to return

Returns: Matching rows with highlighted search results. Only works on parquet datasets with text columns.

How It Works

API Integrations

The server leverages multiple HuggingFace APIs:

API	Used For
Hub API	Dataset metadata, repository info, download stats
Dataset Viewer API	Full dataset statistics, text search, parquet row access
datasets library	Streaming data loading, sample extraction

Data Loading Strategy

Streaming mode (default): Uses datasets.load_dataset(..., streaming=True) to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
Statistics API: For parquet datasets, analyze_dataset_features first attempts to fetch pre-computed statistics from the Dataset Viewer API (/statistics endpoint), providing full dataset coverage without sampling.
Fallback: If statistics aren't available, analysis falls back to sample-based computation.

Caching

Results are cached locally to reduce API calls:

Cache Type	TTL	Location
Metadata	1 hour	`~/.cache/hf_eda_mcp/metadata/`
Samples	1 hour	`~/.cache/hf_eda_mcp/samples/`
Statistics	1 hour	`~/.cache/hf_eda_mcp/statistics/`

Parquet Requirements

Some features require datasets with builder_name="parquet":

Text search (search_text_in_dataset): Only parquet datasets are searchable
Full statistics: Pre-computed stats are only available for parquet datasets

Error Handling

Automatic retry with exponential backoff for transient network errors
Graceful fallback from statistics API to sample-based analysis
Descriptive error messages with suggestions for common issues

Project Structure

src/hf_eda_mcp/
├── server.py                 # Gradio app with MCP server setup
├── config.py                 # Server configuration (env vars, defaults)
├── validation.py             # Input validation for all tools
├── error_handling.py         # Retry logic, error formatting
├── tools/                    # MCP tools (exposed via Gradio)
│   ├── metadata.py           # get_dataset_metadata
│   ├── sampling.py           # get_dataset_sample
│   ├── analysis.py           # analyze_dataset_features
│   └── search.py             # search_text_in_dataset
├── services/                 # Business logic layer
│   ├── dataset_service.py    # Caching, data loading, statistics
└── integrations/
    └── dataset_viewer_adapter.py  # Dataset Viewer API client
    └── hf_client.py          # HuggingFace Hub API wrapper (HfApi)

Local Development

Setup

# Install pdm
brew install pdm

# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp

# Install dependencies
pdm install

# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)

# Run the server
pdm run hf-eda-mcp

The server starts at http://localhost:7860 with MCP endpoint at /gradio_api/mcp/.

License

Apache License 2.0