dvilasuero HF staff commited on
Commit
81ab2c6
1 Parent(s): 0b3bb4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -9
README.md CHANGED
@@ -7,12 +7,19 @@ tags:
7
  pipeline_tag: text-classification
8
  ---
9
 
10
- # dvilasuero/alpaca-gigo-detector-setfit-multilingual
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
13
 
14
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
 
 
 
 
 
 
 
16
 
17
  ## Usage
18
 
@@ -22,17 +29,69 @@ To use this model for inference, first install the SetFit library:
22
  python -m pip install setfit
23
  ```
24
 
25
- You can then run inference as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ```python
28
  from setfit import SetFitModel
29
 
30
- # Download from Hub and run inference
31
- model = SetFitModel.from_pretrained("dvilasuero/alpaca-gigo-detector-setfit-multilingual")
32
- # Run inference
33
- preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 
 
 
 
 
 
 
 
 
 
34
  ```
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## BibTeX entry and citation info
37
 
38
  ```bibtex
 
7
  pipeline_tag: text-classification
8
  ---
9
 
10
+ # 😵‍💫🦙 Alpaca HalluciHunter
11
 
12
+ This is a cross-lingual SetFit model [SetFit model](https://github.com/huggingface/setfit) to detect potentially bad instructions from Alpaca (and likely other synthetically generated instruction datasets).
13
 
14
+ The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
15
+
16
+ It's a binary classifier with two labels:
17
+
18
+ - `ALL GOOD`, a given instruction, input, and output are correct,
19
+ - `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
20
+
21
+
22
+ This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
23
 
24
  ## Usage
25
 
 
29
  python -m pip install setfit
30
  ```
31
 
32
+ Load your Alpaca Dataset:
33
+
34
+ ```bash
35
+ from datasets import Dataset, load_dataset
36
+
37
+ import pandas as pd
38
+
39
+ # this can be a translation (e.g., Spanish, Camoscio Italian Alpaca, etc.)
40
+ dataset = pd.read_json("https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json")
41
+
42
+ dataset["id"] = [i for i in range(len(dataset))]
43
+
44
+ ds = Dataset.from_pandas(dataset)
45
+ ```
46
+
47
+ Create a text field containing the instruction, input and output to use for inference:
48
+
49
+ ```python
50
+ def transform(r):
51
+ return {
52
+ "text": f"INSTRUCTION:\n{r['instruction']}\nINPUT:\n{r['input']}\nOUTPUT:\n{r['output']}\n"
53
+ }
54
+ ds = ds.map(transform)
55
+ ```
56
+
57
+ Load the model:
58
 
59
  ```python
60
  from setfit import SetFitModel
61
 
62
+ # Download from Hub
63
+ model = SetFitModel.from_pretrained("argilla/alpaca-hallucihunter-multilingual ")
64
+ ```
65
+
66
+ Perform inference and prediction col to your dataset:
67
+ ```python
68
+ labels = ["ALL GOOD", "BAD INSTRUCTION"]
69
+
70
+ def get_predictions(texts):
71
+ probas = model.predict_proba(texts, as_numpy=True)
72
+ for pred in probas:
73
+ yield [{"label": label, "score": score} for label, score in zip(labels, pred)]
74
+
75
+ ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
76
  ```
77
 
78
+ Load the data into Argilla for exploration and validation. You [need to launch Argilla](https://www.argilla.io/blog/launching-argilla-huggingface-hub):
79
+ ```python
80
+ # Replace api_url with the url to your HF Spaces URL if using Spaces
81
+ # Replace api_key if you configured a custom API key
82
+ rg.init(
83
+ api_url="https://your-agilla-instance.hf.space",
84
+ api_key="team.apikey"
85
+ )
86
+
87
+ rg_dataset = rg.DatasetForTextClassification().from_datasets(ds)
88
+ rg.log(records=rg_dataset, name="alpaca_to_clean")
89
+ ```
90
+
91
+ ## Examples
92
+
93
+
94
+
95
  ## BibTeX entry and citation info
96
 
97
  ```bibtex