argilla
/

alpaca-garbage-collector-multilingual

Text Classification

sentence-transformers

PyTorch

setfit

xlm-roberta

Model card Files Files and versions Community

dvilasuero HF staff commited on Apr 3, 2023

Commit

81ab2c6

•

1 Parent(s): 0b3bb4e

Update README.md

Browse files

Files changed (1) hide show

README.md +68 -9

README.md CHANGED Viewed

@@ -7,12 +7,19 @@ tags:
 pipeline_tag: text-classification
 ---
-# dvilasuero/alpaca-gigo-detector-setfit-multilingual
-This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
-1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
-2. Training a classification head with features from the fine-tuned Sentence Transformer.
 ## Usage
@@ -22,17 +29,69 @@ To use this model for inference, first install the SetFit library:
 python -m pip install setfit
 ```
-You can then run inference as follows:
 ```python
 from setfit import SetFitModel
-# Download from Hub and run inference
-model = SetFitModel.from_pretrained("dvilasuero/alpaca-gigo-detector-setfit-multilingual")
-# Run inference
-preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 ```
 ## BibTeX entry and citation info
 ```bibtex

 pipeline_tag: text-classification
 ---
+# 😵‍💫🦙 Alpaca HalluciHunter
+This is a cross-lingual SetFit model [SetFit model](https://github.com/huggingface/setfit) to detect potentially bad instructions from Alpaca (and likely other synthetically generated instruction datasets).
+The model has been fine-tuned with 1,000 labeled examples from the AlpacaCleaned dataset. It leverages a multilingual sentence transformer `paraphrase-multilingual-mpnet-base-v2`, inspired by the findings from the SetFit paper (Section 6. Multilingual experiments.), where they trained models in English that performed well across languages.
+It's a binary classifier with two labels:
+- `ALL GOOD`, a given instruction, input, and output are correct,
+- `BAD INSTRUCTION`, there's an issue with the instruction, and/or input and output.
+This model can greatly speed up the validation of Alpaca Datasets, flagging examples that need to be fixed or simply discarded.
 ## Usage
 python -m pip install setfit
 ```
+Load your Alpaca Dataset:
+```bash
+from datasets import Dataset, load_dataset
+import pandas as pd
+# this can be a translation (e.g., Spanish, Camoscio Italian Alpaca, etc.)
+dataset = pd.read_json("https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json")
+dataset["id"] = [i for i in range(len(dataset))]
+ds = Dataset.from_pandas(dataset)
+```
+Create a text field containing the instruction, input and output to use for inference:
+```python
+def transform(r):
+  return {
+      "text": f"INSTRUCTION:\n{r['instruction']}\nINPUT:\n{r['input']}\nOUTPUT:\n{r['output']}\n"
+  }
+ds = ds.map(transform)
+```
+Load the model:
 ```python
 from setfit import SetFitModel
+# Download from Hub
+model = SetFitModel.from_pretrained("argilla/alpaca-hallucihunter-multilingual ")
+```
+Perform inference and prediction col to your dataset:
+```python
+labels = ["ALL GOOD", "BAD INSTRUCTION"]
+def get_predictions(texts):
+    probas = model.predict_proba(texts, as_numpy=True)
+    for pred in probas:
+        yield [{"label": label, "score": score} for label, score in zip(labels, pred)]
+ds = ds.map(lambda batch: {"prediction": list(get_predictions(batch["text"]))}, batched=True)
 ```
+Load the data into Argilla for exploration and validation. You [need to launch Argilla](https://www.argilla.io/blog/launching-argilla-huggingface-hub):
+```python
+# Replace api_url with the url to your HF Spaces URL if using Spaces
+# Replace api_key if you configured a custom API key
+rg.init(
+    api_url="https://your-agilla-instance.hf.space",
+    api_key="team.apikey"
+)
+rg_dataset = rg.DatasetForTextClassification().from_datasets(ds)
+rg.log(records=rg_dataset, name="alpaca_to_clean")
+```
+## Examples
 ## BibTeX entry and citation info
 ```bibtex