File size: 2,884 Bytes
666ffe7
 
 
 
 
 
e9f6807
 
 
 
 
666ffe7
 
e9f6807
666ffe7
e9f6807
dceb97e
e9f6807
 
 
 
 
 
 
 
 
 
 
666ffe7
 
 
 
 
 
 
 
 
 
 
 
 
e9f6807
 
666ffe7
 
e9f6807
 
 
 
 
 
 
 
 
 
 
 
 
9c442e6
c2b344d
e9f6807
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
datasets:
- mserras/alpaca-es-hackaton
- somosnlp/somos-clean-alpaca-es
language:
- es
---

# mserras/setfit-alpaca-es-unprocessable-sample-detection

This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset. 

The base model is the multilingual model of [Paraphrase mpnet base v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) from Sentence Transformers

 This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
 
This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from 
the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.

To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the 
open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.

As this model was trained over samples of Alpaca, which were generated using OpenAI's models this model **cannot be used for commercial purposes or to compete against OpenAI**

## Usage

To use this model for inference, first install the SetFit library:

```bash
python -m pip install setfit
```

You can then run inference as follows:

```python
from setfit import SetFitModel
import argilla as rg


# Download from Hub and run inference
model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")

def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
    """Given the instruction, input and output fields, return a text to be used by setfit"""
    return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"

def sample_to_text(sample: rg.TextClassificationRecord) -> str:
    """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
    return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])

# For a given Argilla record:

unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]

```

## Evaluation

*Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.

## Changelog

- [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
- [06/04/2023] It no longer detects password generation as unprocessable.