Commit
β’
9f2d78b
1
Parent(s):
f057b32
Update README.md
Browse files
README.md
CHANGED
@@ -10,10 +10,13 @@ datasets:
|
|
10 |
---
|
11 |
|
12 |
# π΅βπ«π¦ Alpaca HalluciHunter
|
13 |
-
<img src="front-image.png" alt="Alpaca Cleaned" width="200" height="150" >
|
14 |
|
|
|
15 |
|
16 |
-
|
|
|
|
|
|
|
17 |
|
18 |
The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
|
19 |
|
@@ -23,8 +26,6 @@ It's a binary classifier with two labels:
|
|
23 |
- `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
|
24 |
|
25 |
|
26 |
-
This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
|
27 |
-
|
28 |
## Usage
|
29 |
|
30 |
To use this model for inference, first install the SetFit library:
|
@@ -79,7 +80,7 @@ def get_predictions(texts):
|
|
79 |
ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
|
80 |
```
|
81 |
|
82 |
-
Load the data into Argilla for exploration and validation.
|
83 |
```python
|
84 |
# Replace api_url with the url to your HF Spaces URL if using Spaces
|
85 |
# Replace api_key if you configured a custom API key
|
@@ -92,8 +93,37 @@ rg_dataset = rg.DatasetForTextClassification().from_datasets(ds)
|
|
92 |
rg.log(records=rg_dataset, name="alpaca_to_clean")
|
93 |
```
|
94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
## Examples
|
96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
|
98 |
|
99 |
## BibTeX entry and citation info
|
|
|
10 |
---
|
11 |
|
12 |
# π΅βπ«π¦ Alpaca HalluciHunter
|
|
|
13 |
|
14 |
+
This is a cross-lingual SetFit model [SetFit model](https://github.com/huggingface/setfit) to detect potentially bad instructions from Alpaca. This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
|
15 |
|
16 |
+
|
17 |
+
<div style="text-align:center;width:50%">
|
18 |
+
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
|
19 |
+
</div>
|
20 |
|
21 |
The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
|
22 |
|
|
|
26 |
- `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
|
27 |
|
28 |
|
|
|
|
|
29 |
## Usage
|
30 |
|
31 |
To use this model for inference, first install the SetFit library:
|
|
|
80 |
ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
|
81 |
```
|
82 |
|
83 |
+
Load the data into Argilla for exploration and validation. First, you [need to launch Argilla](https://www.argilla.io/blog/launching-argilla-huggingface-hub). Then run:
|
84 |
```python
|
85 |
# Replace api_url with the url to your HF Spaces URL if using Spaces
|
86 |
# Replace api_key if you configured a custom API key
|
|
|
93 |
rg.log(records=rg_dataset, name="alpaca_to_clean")
|
94 |
```
|
95 |
|
96 |
+
## Live demo
|
97 |
+
|
98 |
+
You can explore the dataset using this Space (credentials: `argilla` / `1234`):
|
99 |
+
|
100 |
+
(https://huggingface.co/spaces/argilla/alpaca-hallucihunter)[https://huggingface.co/spaces/argilla/alpaca-hallucihunter]
|
101 |
+
|
102 |
## Examples
|
103 |
|
104 |
+
This model has been tested with English, German, and Spanish. This approach will be used by ongoing efforts for improving the quality of Alpaca-based datasets, and updates will be reflected here.
|
105 |
+
|
106 |
+
Here are some examples of highest scored examples of `BAD INSTRUCTION`.
|
107 |
+
|
108 |
+
|
109 |
+
### English
|
110 |
+
|
111 |
+
<div style="text-align:center;width:50%">
|
112 |
+
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/front-image.png" alt="Alpaca Cleaned"">
|
113 |
+
</div>
|
114 |
+
|
115 |
+
### German
|
116 |
+
|
117 |
+
<div style="text-align:center;width:50%">
|
118 |
+
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/german-alpaca.png" alt="Alpaca Cleaned"">
|
119 |
+
</div>
|
120 |
+
|
121 |
+
### Spanish
|
122 |
+
<div style="text-align:center;width:50%">
|
123 |
+
<img src="https://huggingface.co/argilla/alpaca-hallucihunter-multilingual/resolve/main/spanish-alpaca.png" alt="Alpaca Cleaned"">
|
124 |
+
</div>
|
125 |
+
|
126 |
+
|
127 |
|
128 |
|
129 |
## BibTeX entry and citation info
|