mserras commited on
Commit
e9f6807
1 Parent(s): c2b344d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -22
README.md CHANGED
@@ -1,18 +1,31 @@
1
  ---
2
- license: apache-2.0
3
  tags:
4
  - setfit
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
 
8
  ---
9
 
10
- # hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
13
 
14
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
 
 
 
 
 
 
 
 
 
16
 
17
  ## Usage
18
 
@@ -26,24 +39,32 @@ You can then run inference as follows:
26
 
27
  ```python
28
  from setfit import SetFitModel
 
 
29
 
30
  # Download from Hub and run inference
31
- model = SetFitModel.from_pretrained("hackathon-somos-nlp-2023/setfit-alpaca-es-unprocessable-sample-detection")
32
- # Run inference
33
- preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
34
- ```
 
 
 
 
 
 
 
 
 
35
 
36
- ## BibTeX entry and citation info
37
-
38
- ```bibtex
39
- @article{https://doi.org/10.48550/arxiv.2209.11055,
40
- doi = {10.48550/ARXIV.2209.11055},
41
- url = {https://arxiv.org/abs/2209.11055},
42
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
- title = {Efficient Few-Shot Learning Without Prompts},
45
- publisher = {arXiv},
46
- year = {2022},
47
- copyright = {Creative Commons Attribution 4.0 International}
48
- }
49
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
  pipeline_tag: text-classification
7
+ datasets:
8
+ - mserras/alpaca-es-hackaton
9
+ - somosnlp/somos-clean-alpaca-es
10
+ language:
11
+ - es
12
  ---
13
 
14
+ # mserras/setfit-alpaca-es-unprocessable-sample-detection
15
 
16
+ This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
17
 
18
+ The base model is the multilingual model of [Paraphrase mpnet base v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) from Sentence Transformers
19
+
20
+ This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
21
+
22
+ This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
23
+ the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
24
+
25
+ To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
26
+ open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
27
+
28
+ As this model was trained over samples of Alpaca, which were generated using OpenAI's models this model **cannot be used for commercial purposes or to compete against OpenAI**
29
 
30
  ## Usage
31
 
 
39
 
40
  ```python
41
  from setfit import SetFitModel
42
+ import argilla as rg
43
+
44
 
45
  # Download from Hub and run inference
46
+ model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
47
+
48
+ def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
49
+ """Given the instruction, input and output fields, return a text to be used by setfit"""
50
+ return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
51
+
52
+ def sample_to_text(sample: rg.TextClassificationRecord) -> str:
53
+ """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
54
+ return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
55
+
56
+ # For a given Argilla record:
57
+
58
+ unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ```
61
+
62
+ ## Evaluation
63
+
64
+ *Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
65
+
66
+ ## Changelog
67
+
68
+ - [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
69
+ - [06/04/2023] It no longer detects password generation as unprocessable.
70
+