--- datasets: - hackathon-somos-nlp-2023/podcasts-ner-es license: mit language: - es pipeline_tag: text-generation --- # Named-entity recognition for Spanish Podcasts This model is a fine-tuned version for named-entity recognition of the Spanish [bertin-project/bertin-gpt-j-6B](https://huggingface.co/bertin-project/bertin-gpt-j-6B) checkpoint. This model has been developed during the 2023 Hackaton organized by SomosNLP with the GPUs RTX 3090 provided by Q Blocks. ## Motivation of the project Podcasts are an incredible source of information and inspiration. We can listen to them while commuting, practising sport or cooking our favourite recipe. However, it can be difficult to retain specific facts, dates or people mentioned in them. The aim of this project has been to explore how to capture all those facts using named-entity recognition. Instead of using a language model fine-tuned with a specific NER head, we have reframed the problem as text generation from a prompt of the kind: ``` text: Yo hoy voy a hablar de mujeres en el mundo del arte, porque me ha leído un libro fantástico que se llama Historia del arte sin hombres, de Katie Hesel. entities: (people, Katie Hesel), (books, Historia del arte sin hombres) ``` By fine-tuning a large generative model with this prompt, we are able to capture the entities mentioned in the podcast. We fine-tuned the [bertin-gpt-j-6B](https://huggingface.co/bertin-project/bertin-gpt-j-6B) following this strategy. Similar projects with podcasts have been conducted by Andrej Karpathy (https://karpathy.ai/lexicap/) and Aleksa Gordic (https://www.hubermantranscripts.com/). ## Dataset creation For full details of the dataset, check [this page](https://huggingface.co/datasets/hackathon-somos-nlp-2023/podcasts-ner-es). A brief summary is: 1) Transcribe the audio from a youtube playlist by employing whisper ([check this notebook to understand how we did it](https://github.com/sergiopperez/hackathon_podcast/blob/main/src/NER/get_transcriptions.ipynb)). For the podcast we chose "Deforme Semanal" and the audios from this [playlist](https://www.youtube.com/playlist?list=PLLbN7SMQhMVZoXhtQ00AyebQE_-ttDrs9). 2) Gather all the transcriptions, unify all of them into a dataset, and divide them into sentences of 512 characters. 3) For each sentence, we label the entities in it by using the `text-davinci-003` API from OpenAI ([check this notebook to understand how we did it](https://github.com/sergiopperez/hackathon_podcast/blob/main/src/NER/create_entities_json.ipynb). ## Fine-tuning Training was performed in a RTX 3090 kindly provided by Q Blocks. It took 2h20m. We employed the Low-Rank Adaptation (LoRA) strategy to substantially reduce the number of trainable parameters for downstream tasks while maintaining model quality. The pre-trained checkpoint employed was [bertin-project/bertin-gpt-j-6B](https://huggingface.co/bertin-project/bertin-gpt-j-6B). We didn't perform an extensive hyperparameter sweep so there's room to improve it. Check [this script](https://github.com/sergiopperez/hackathon_podcast/blob/main/src/NER/peft-gpt-j.ipynb) to understand how we did it. ## Evaluation Disclaimer: There was no formal evaluation for the training ## Team members [David Mora](https://huggingface.co/DavidFM43) [Sergio Perez](https://huggingface.co/sergiopperez) [Albeto Fernandez](https://huggingface.co/AlbertoFH98) --- datasets: - hackathon-somos-nlp-2023/podcasts-ner-es ---