File size: 5,578 Bytes
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b1be45
3d203ac
 
8b1be45
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b1be45
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b1be45
3d203ac
 
8b1be45
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b1be45
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b1be45
3d203ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Adding a New Dataset Parser

This guide explains how to add a new dataset parser to the llmdataparser library. The library is designed to make it easy to add support for new datasets while maintaining consistent interfaces and functionality.

## Step-by-Step Guide

### 1. Create a New Parser Class

Create a new file `your_dataset_parser.py` in the `llmdataparser` folder. Your parser should inherit from `HuggingFaceDatasetParser[T]` where T is your custom entry type.

```python
from llmdataparser.base_parser import (
    DatasetDescription,
    EvaluationMetric,
    HuggingFaceDatasetParser,
    HuggingFaceParseEntry,
)

@dataclass(frozen=True, kw_only=True, slots=True)
class YourDatasetParseEntry(HuggingFaceParseEntry):
    """Custom entry class for your dataset."""
    # Add any additional fields specific to your dataset
    custom_field: str

    @classmethod
    def create(cls, question: str, answer: str, raw_question: str,
               raw_answer: str, task_name: str, custom_field: str) -> "YourDatasetParseEntry":
        return cls(
            question=question,
            answer=answer,
            raw_question=raw_question,
            raw_answer=raw_answer,
            task_name=task_name,
            custom_field=custom_field
        )

class YourDatasetParser(HuggingFaceDatasetParser[YourDatasetParseEntry]):
    """Parser for your dataset."""

    # Required class variables
    _data_source = "huggingface/your-dataset"
    _default_task = "default"
    _task_names = ["task1", "task2", "task3"]
```

### 2. Implement Required Methods

Your parser needs to implement these key methods:

```python
def process_entry(
    self,
    row: dict[str, Any],
    task_name: str | None = None,
    **kwargs: Any
) -> YourDatasetParseEntry:
    """Process a single dataset entry."""
    # Extract data from the row
    raw_question = row["question"]
    raw_answer = row["answer"]
    task = task_name or self._get_current_task(row)

    question = f"Question: {raw_question}\nAnswer:"

    return YourDatasetParseEntry.create(
        question=question,
        answer=raw_answer,
        raw_question=raw_question,
        raw_answer=raw_answer,
        task_name=task,
        custom_field=row["custom_field"]
    )

def get_dataset_description(self) -> DatasetDescription:
    """Returns description of your dataset."""
    return DatasetDescription.create(
        name="Your Dataset Name",
        purpose="Purpose of the dataset",
        source="Dataset source/URL",
        language="Dataset language",
        format="Data format (e.g., multiple choice, free text)",
        characteristics="Key characteristics of the dataset",
        citation="Dataset citation if available"
    )

def get_evaluation_metrics(self) -> list[EvaluationMetric]:
    """Returns recommended evaluation metrics."""
    return [
        EvaluationMetric.create(
            name="metric_name",
            type="metric_type",
            description="Metric description",
            implementation="implementation_details",
            primary=True
        )
    ]
```

### 3. Add Example Usage

Add example usage at the bottom of your parser file:

```python
if __name__ == "__main__":
    # Example usage
    parser = YourDatasetParser()
    parser.load()
    parser.parse()

    # Get parsed data
    parsed_data = parser.get_parsed_data

    # Print example entry
    if parsed_data:
        example = parsed_data[0]
        print("\nExample parsed entry:")
        print(f"Question: {example.raw_question}")
        print(f"Answer: {example.answer}")
```

### 4. Create Tests

Create a test file `tests/test_your_dataset_parser.py`:

```python
import pytest
from llmdataparser.your_dataset_parser import YourDatasetParser, YourDatasetParseEntry

def test_parser_initialization():
    parser = YourDatasetParser()
    assert parser._data_source == "huggingface/your-dataset"
    assert parser._default_task == "default"
    assert "task1" in parser._task_names

def test_process_entry():
    parser = YourDatasetParser()
    sample_row = {
        "question": "Sample question",
        "answer": "Sample answer",
        "custom_field": "Custom value"
    }

    entry = parser.process_entry(sample_row)
    assert isinstance(entry, YourDatasetParseEntry)
    assert entry.raw_question == "Sample question"
    assert entry.custom_field == "Custom value"
```

## Best Practices

1. **Type Safety**: Use type hints consistently and ensure your parser is properly typed.
1. **Documentation**: Add clear docstrings and comments explaining your parser's functionality.
1. **Error Handling**: Include appropriate error checking and validation.
1. **Testing**: Write comprehensive tests covering different scenarios.

## Examples

Look at existing parsers for reference:

- `mmlu_parser.py` for multiple-choice questions
- `gsm8k_parser.py` for math word problems
- `humaneval_parser.py` for code generation tasks

## Common Patterns

1. **Parse Entry Class**: Create a custom parse entry class if you need additional fields.
1. **Task Names**: Define all available tasks in `_task_names`.
1. **Process Entry**: Handle data extraction and formatting in `process_entry`.
1. **Dataset Description**: Provide comprehensive dataset information.
1. **Evaluation Metrics**: Define appropriate metrics for your dataset.

## Testing Your Parser

1. Run the example usage code to verify basic functionality
1. Run pytest to execute your test cases
1. Try different dataset splits and tasks
1. Verify the parsed output format
1. Check error handling with invalid inputs