Adding a New Dataset Parser
This guide explains how to add a new dataset parser to the llmdataparser library. The library is designed to make it easy to add support for new datasets while maintaining consistent interfaces and functionality.
Step-by-Step Guide
1. Create a New Parser Class
Create a new file your_dataset_parser.py
in the llmdataparser
folder. Your parser should inherit from HuggingFaceDatasetParser[T]
where T is your custom entry type.
from llmdataparser.base_parser import (
DatasetDescription,
EvaluationMetric,
HuggingFaceDatasetParser,
HuggingFaceParseEntry,
)
@dataclass(frozen=True, kw_only=True, slots=True)
class YourDatasetParseEntry(HuggingFaceParseEntry):
"""Custom entry class for your dataset."""
# Add any additional fields specific to your dataset
custom_field: str
@classmethod
def create(cls, question: str, answer: str, raw_question: str,
raw_answer: str, task_name: str, custom_field: str) -> "YourDatasetParseEntry":
return cls(
question=question,
answer=answer,
raw_question=raw_question,
raw_answer=raw_answer,
task_name=task_name,
custom_field=custom_field
)
class YourDatasetParser(HuggingFaceDatasetParser[YourDatasetParseEntry]):
"""Parser for your dataset."""
# Required class variables
_data_source = "huggingface/your-dataset"
_default_task = "default"
_task_names = ["task1", "task2", "task3"]
2. Implement Required Methods
Your parser needs to implement these key methods:
def process_entry(
self,
row: dict[str, Any],
task_name: str | None = None,
**kwargs: Any
) -> YourDatasetParseEntry:
"""Process a single dataset entry."""
# Extract data from the row
raw_question = row["question"]
raw_answer = row["answer"]
task = task_name or self._get_current_task(row)
question = f"Question: {raw_question}\nAnswer:"
return YourDatasetParseEntry.create(
question=question,
answer=raw_answer,
raw_question=raw_question,
raw_answer=raw_answer,
task_name=task,
custom_field=row["custom_field"]
)
def get_dataset_description(self) -> DatasetDescription:
"""Returns description of your dataset."""
return DatasetDescription.create(
name="Your Dataset Name",
purpose="Purpose of the dataset",
source="Dataset source/URL",
language="Dataset language",
format="Data format (e.g., multiple choice, free text)",
characteristics="Key characteristics of the dataset",
citation="Dataset citation if available"
)
def get_evaluation_metrics(self) -> list[EvaluationMetric]:
"""Returns recommended evaluation metrics."""
return [
EvaluationMetric.create(
name="metric_name",
type="metric_type",
description="Metric description",
implementation="implementation_details",
primary=True
)
]
3. Add Example Usage
Add example usage at the bottom of your parser file:
if __name__ == "__main__":
# Example usage
parser = YourDatasetParser()
parser.load()
parser.parse()
# Get parsed data
parsed_data = parser.get_parsed_data
# Print example entry
if parsed_data:
example = parsed_data[0]
print("\nExample parsed entry:")
print(f"Question: {example.raw_question}")
print(f"Answer: {example.answer}")
4. Create Tests
Create a test file tests/test_your_dataset_parser.py
:
import pytest
from llmdataparser.your_dataset_parser import YourDatasetParser, YourDatasetParseEntry
def test_parser_initialization():
parser = YourDatasetParser()
assert parser._data_source == "huggingface/your-dataset"
assert parser._default_task == "default"
assert "task1" in parser._task_names
def test_process_entry():
parser = YourDatasetParser()
sample_row = {
"question": "Sample question",
"answer": "Sample answer",
"custom_field": "Custom value"
}
entry = parser.process_entry(sample_row)
assert isinstance(entry, YourDatasetParseEntry)
assert entry.raw_question == "Sample question"
assert entry.custom_field == "Custom value"
Best Practices
- Type Safety: Use type hints consistently and ensure your parser is properly typed.
- Documentation: Add clear docstrings and comments explaining your parser's functionality.
- Error Handling: Include appropriate error checking and validation.
- Testing: Write comprehensive tests covering different scenarios.
Examples
Look at existing parsers for reference:
mmlu_parser.py
for multiple-choice questionsgsm8k_parser.py
for math word problemshumaneval_parser.py
for code generation tasks
Common Patterns
- Parse Entry Class: Create a custom parse entry class if you need additional fields.
- Task Names: Define all available tasks in
_task_names
. - Process Entry: Handle data extraction and formatting in
process_entry
. - Dataset Description: Provide comprehensive dataset information.
- Evaluation Metrics: Define appropriate metrics for your dataset.
Testing Your Parser
- Run the example usage code to verify basic functionality
- Run pytest to execute your test cases
- Try different dataset splits and tasks
- Verify the parsed output format
- Check error handling with invalid inputs