Spaces:

pauri32
/

llm-challenge

Sleeping

App Files Files Community

pauri32 commited on Aug 24, 2023

Commit

8749751

•

1 Parent(s): bf8e84d

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -47,6 +47,7 @@ The most important improvement would be to fine-tune the model on the specific t
 * For the language detection task, any collection of documents of the targeted languages would be useful (if we know in which language is every document). These documents would be split at sentence-level and I would create a dataset of instructions with the same format as the current shots.
 * For the second task, the best option would be to find an existing dataset for this task, such as [CONLLPP](https://huggingface.co/datasets/conllpp). With this dataset and the prompt template designed, we could generate a dataset.
 * Another option for the second task, in case there was no dataset for the task (for example if we want to target a specific domain or do it in a different language with less resources), we could generate the dataset by 'distilling' information from another LLM. For example, if this task required entity recognition in Catalan and there was no dataset, we could infer a few examples from a bigger model like GPT4. However, this solution has three main drawbacks: (1) we may have to pay for the models; (2) their license not always allows this; (3) even the best models make mistakes, we would be assuming some error in our dataset.
 Fine-tuning the models for specific tasks would make the model perform better on them. Another variation we could do is to remove the few-shots used in the current code. Despite curated shots are usually helpful for the model, they also increase inference time. It should be studied if the improvement with shots is relevant compared to a higher latency (this would depend on the specifications of the project).
 A third improvement would be to use 'forced decoding' for the classification task. By this, we could make sure one of the 3 language identifiers are generated and the answer would never be 'unknown'. During fine-tuning, the model could be instructed to generate directly the codes 'en', 'es', 'fr'.
 Finally, for the next project I would like to try Gradio, since I have seen that it's a nice way to build interfaces to test the models.

 * For the language detection task, any collection of documents of the targeted languages would be useful (if we know in which language is every document). These documents would be split at sentence-level and I would create a dataset of instructions with the same format as the current shots.
 * For the second task, the best option would be to find an existing dataset for this task, such as [CONLLPP](https://huggingface.co/datasets/conllpp). With this dataset and the prompt template designed, we could generate a dataset.
 * Another option for the second task, in case there was no dataset for the task (for example if we want to target a specific domain or do it in a different language with less resources), we could generate the dataset by 'distilling' information from another LLM. For example, if this task required entity recognition in Catalan and there was no dataset, we could infer a few examples from a bigger model like GPT4. However, this solution has three main drawbacks: (1) we may have to pay for the models; (2) their license not always allows this; (3) even the best models make mistakes, we would be assuming some error in our dataset.
 Fine-tuning the models for specific tasks would make the model perform better on them. Another variation we could do is to remove the few-shots used in the current code. Despite curated shots are usually helpful for the model, they also increase inference time. It should be studied if the improvement with shots is relevant compared to a higher latency (this would depend on the specifications of the project).
 A third improvement would be to use 'forced decoding' for the classification task. By this, we could make sure one of the 3 language identifiers are generated and the answer would never be 'unknown'. During fine-tuning, the model could be instructed to generate directly the codes 'en', 'es', 'fr'.
 Finally, for the next project I would like to try Gradio, since I have seen that it's a nice way to build interfaces to test the models.