Spaces:

pauri32
/

llm-challenge

Sleeping

App Files Files Community

pauri32 commited on Aug 24, 2023

Commit

bf8e84d

1 Parent(s): 998ede2

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -11

README.md CHANGED Viewed

@@ -23,18 +23,18 @@ You have two options to run the challenge!
 ## Reasoning of the LLM design choices
 ### Model selected: BloomZ
-The model selected for this project is BloomZ with 1.1B parameters. The size of the model was decided according to my hardware limitations (this is the largest model I could fit without a GPU, since 4bit is not possible on CPU). The type of model used has been chosen considering the following characteristics:
 * BloomZ is a model fine-tuned on instructions with Bloom as base model. Bloom was trained on 46 different languages, this is highly relevant for the language detection task. Finding a model trained on a smaller subset of languages may have been a good option, but with BloomZ it will be easier to scale to a highly multilingual classification.
-* The main language in which the model has been trained is English. Therefore it can still be strong for the entity recognition task, and that is the reason why the instructions are in English.
-* The model has a wide range of sizes, up to 176B. It is easy to increase the size of the model in case the project continues with a different hardware without further modifications.
-* A model fine-tuned to follow human instructions will have a better performance than one trained on documents for unknown tasks. In this case, since the model has not been fine-tuned for the specific tasks targeted, zero and few-shot performance are crucial.
 ### Task 1 design: Language detection
 * The model receives the instruction of identifying if the language of the string is English, Spanish or French.
 * The template follows the format <input_sentence>(language_id), selected after a quick prompt engineering process.
-* The languages are identified with 'english' for English, 'español' for Spanish and 'française' for French. My reasoning behind this decision is  the hypothesis that the model, if it has been trained on these languages, will be more likely to keep the language of the sentence during the generation of the next token, since rarely documents switch languages within a sentence. So just the fact of having the language names in that language, helps with the classification.
-* 3 shots are added with random sentences (that should be curated in a real scenario), one sentence for each language.
-* If none of the language identifiers is detected, the language is classificated as unknown.
 ### Task 2 design: Name entity recognition
 * The model is asked to identify entities related to locations, people and organizations.
 * The LLM generates the entities in the following format <entity>(entity_type), selected after a quick prompt engineering process.
@@ -45,13 +45,12 @@ The model selected for this project is BloomZ with 1.1B parameters. The size of
 ## Further improvements
 The most important improvement would be to fine-tune the model on the specific tasks. To do this, I would follow the following steps:
 * For the language detection task, any collection of documents of the targeted languages would be useful (if we know in which language is every document). These documents would be split at sentence-level and I would create a dataset of instructions with the same format as the current shots.
-* For the second task, the best option would be to find an existing dataset for this task, such as XXX. With this dataset and the prompt template designed, we could generate a dataset.
 * Another option for the second task, in case there was no dataset for the task (for example if we want to target a specific domain or do it in a different language with less resources), we could generate the dataset by 'distilling' information from another LLM. For example, if this task required entity recognition in Catalan and there was no dataset, we could infer a few examples from a bigger model like GPT4. However, this solution has three main drawbacks: (1) we may have to pay for the models; (2) their license not always allows this; (3) even the best models make mistakes, we would be assuming some error in our dataset.
 Fine-tuning the models for specific tasks would make the model perform better on them. Another variation we could do is to remove the few-shots used in the current code. Despite curated shots are usually helpful for the model, they also increase inference time. It should be studied if the improvement with shots is relevant compared to a higher latency (this would depend on the specifications of the project).
 A third improvement would be to use 'forced decoding' for the classification task. By this, we could make sure one of the 3 language identifiers are generated and the answer would never be 'unknown'. During fine-tuning, the model could be instructed to generate directly the codes 'en', 'es', 'fr'.
 Finally, for the next project I would like to try Gradio, since I have seen that it's a nice way to build interfaces to test the models.
 ## Evaluation proposal
-For the language detection task, since it is a classification task, the best metric for evaluation would be to compute the F1-score (also possible to check the precision, recall and confusion matrix for better insights).
-TODO
-In order to evaluate LLMs, I am currently using a framework from EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), which I highly recommend. With this framework it would be very easy to create a custom task and evaluate with the metrics described above.

 ## Reasoning of the LLM design choices
 ### Model selected: BloomZ
+The model selected for this project is BloomZ with 1.1B parameters. The size of the model was decided according to my hardware limitations (this is the largest model I could fit without a GPU, 4bit and 8bit quantizing is not possible on CPU). The type of model used has been chosen considering the following characteristics:
 * BloomZ is a model fine-tuned on instructions with Bloom as base model. Bloom was trained on 46 different languages, this is highly relevant for the language detection task. Finding a model trained on a smaller subset of languages may have been a good option, but with BloomZ it will be easier to scale to a highly multilingual classification.
+* The main language in which the model has been trained is English. Therefore it can still be strong for the entity recognition task. This is also the reason why the instructions are in English.
+* The model has a wide range of sizes, up to 176B. In case of having a good GPU, it would be easy to increase the size of the model without many further modifications.
+* A model fine-tuned to follow instructions will have a better performance for unseen tasks. In this case, since the model has not been fine-tuned for the specific tasks targeted, good zero and few-shot performance are important.
 ### Task 1 design: Language detection
 * The model receives the instruction of identifying if the language of the string is English, Spanish or French.
 * The template follows the format <input_sentence>(language_id), selected after a quick prompt engineering process.
+* The languages are identified with 'english' for English, 'español' for Spanish and 'française' for French. My reasoning behind this decision is the hypothesis that the model, because it has been trained on these languages, will be more likely to keep the language of the previous tokens for the next generation, since in training sequences rarely switch languages within a sentence. So just the fact of having the language names in that language, helps with the classification.
+* 3 shots are added with one sentence for each language (shots should be more curated for a production application).
+* If none of the language identifiers is detected, the language is classificated as unknown. This is not the ideal scenario, but a method to avoid it is later suggested.
 ### Task 2 design: Name entity recognition
 * The model is asked to identify entities related to locations, people and organizations.
 * The LLM generates the entities in the following format <entity>(entity_type), selected after a quick prompt engineering process.
 ## Further improvements
 The most important improvement would be to fine-tune the model on the specific tasks. To do this, I would follow the following steps:
 * For the language detection task, any collection of documents of the targeted languages would be useful (if we know in which language is every document). These documents would be split at sentence-level and I would create a dataset of instructions with the same format as the current shots.
+* For the second task, the best option would be to find an existing dataset for this task, such as [CONLLPP](https://huggingface.co/datasets/conllpp). With this dataset and the prompt template designed, we could generate a dataset.
 * Another option for the second task, in case there was no dataset for the task (for example if we want to target a specific domain or do it in a different language with less resources), we could generate the dataset by 'distilling' information from another LLM. For example, if this task required entity recognition in Catalan and there was no dataset, we could infer a few examples from a bigger model like GPT4. However, this solution has three main drawbacks: (1) we may have to pay for the models; (2) their license not always allows this; (3) even the best models make mistakes, we would be assuming some error in our dataset.
 Fine-tuning the models for specific tasks would make the model perform better on them. Another variation we could do is to remove the few-shots used in the current code. Despite curated shots are usually helpful for the model, they also increase inference time. It should be studied if the improvement with shots is relevant compared to a higher latency (this would depend on the specifications of the project).
 A third improvement would be to use 'forced decoding' for the classification task. By this, we could make sure one of the 3 language identifiers are generated and the answer would never be 'unknown'. During fine-tuning, the model could be instructed to generate directly the codes 'en', 'es', 'fr'.
 Finally, for the next project I would like to try Gradio, since I have seen that it's a nice way to build interfaces to test the models.
 ## Evaluation proposal
+For the language detection and NER tasks, the best metric for evaluation would be to compute the weighted F1-score (also possible to check the precision, recall and confusion matrix for better insights).
+In order to evaluate LLMs, I would use EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), which I'm currently using for other projects and highly recommend. It's also the one running behind the well-known Open LLM Leaderboard. With this framework it would be very easy to create a custom task and evaluate with the metrics described above. For example, it would be very easy to do a forced decoding evaluation as I suggested for the language detection task, the framework gives you the logits associated to the candidate tokens and you just pick the maximum in every case.