Max embedding size

by abacaj - opened Mar 3, 2023

Discussion

abacaj

Mar 3, 2023

Hi, I think this model is still limited to the original 512 positions as well as a 512 tokenizer

ybelkada

Mar 3, 2023

•

edited Mar 3, 2023

Hello @abacaj

I think that it works fine, I did a quick check using this Space: https://huggingface.co/spaces/ybelkada/i-like-flan-ul2 and fed an input of length 966:

Summarize the following text: On February 6, 2023, earthquakes measuring 7.7 and 7.6 hit South Eastern Turkey, affecting 10 cities and resulting in more than 42,000 deaths and 120,000 injured as of February 21.

A few hours after the earthquake, a group of programmers started a Discord server to roll out an application called afetharita, literally meaning, disaster map. This application would serve search & rescue teams and volunteers to find survivors and bring them help. The need for such an app arose when survivors posted screenshots of texts with their addresses and what they needed (including rescue) on social media. Some survivors also tweeted what they needed so their relatives knew they were alive and that they need rescue. Needing to extract information from these tweets, we developed various applications to turn them into structured data and raced against time in developing and deploying these apps.

When I got invited to the discord server, there was quite a lot of chaos regarding how we (volunteers) would operate and what we would do. We decided to collaboratively train models so we needed a model and dataset registry. We opened a Hugging Face organization account and collaborated through pull requests as to build ML-based applications to receive and process information.

We had been told by volunteers in other teams that there's a need for an application to post screenshots, extract information from the screenshots, structure it and write the structured information to the database. We started developing an application that would take a given image, extract the text first, and from text, extract a name, telephone number, and address and write these informations to a database that would be handed to authorities. After experimenting with various open-source OCR tools, we started using easyocr for OCR part and Gradio for building an interface for this application. We were asked to build a standalone application for OCR as well so we opened endpoints from the interface. The text output from OCR is parsed using transformers-based fine-tuned NER model.

To collaborate and improve the application, we hosted it on Hugging Face Spaces and we've received a GPU grant to keep the application up and running. Hugging Face Hub team has set us up a CI bot for us to have an ephemeral environment, so we could see how a pull request would affect the Space, and it helped us during pull request reviews.

Later on, we were given labeled content from various channels (e.g. twitter, discord) with raw tweets of survivors' calls for help, along with the addresses and personal information extracted from them. We started experimenting both with few-shot prompting of closed-source models and fine-tuning our own token classification model from transformers. We’ve used bert-base-turkish-cased as a base model for token classification and came up with the first address extraction model.

The model was later used in afetharita to extract addresses. The parsed addresses would be sent to a geocoding API to obtain longitude and latitude, and the geolocation would then be displayed on the front-end map. For inference, we have used Inference API, which is an API that hosts model for inference and is automatically enabled when the model is pushed to Hugging Face Hub. Using Inference API for serving has saved us from pulling the model, writing an app, building a docker image, setting up CI/CD, and deploying the model to a cloud instance, where it would be extra overhead work for the DevOps and cloud teams as well. Hugging Face teams have provided us with more replicas so that there would be no downtime and the application would be robust against a lot of traffic.

Later on, we were asked if we could extract what earthquake survivors need from a given tweet. We were given data with multiple labels for multiple needs in a given tweet, and these needs could be shelter, food, or logistics, as it was freezing cold over there. We’ve started experimenting first with zero-shot experimentations with open-source NLI models on Hugging Face Hub and few-shot experimentations with closed-source generative model endpoints. We have tried xlm-roberta-large-xnli and convbert-base-turkish-mc4-cased-allnli_tr. 

>>> len(tokenizer(text).input_ids)
>>> 966

(Btw the text that I got from this blogpost: https://huggingface.co/blog/using-ml-for-disasters )

when I fed it to the model I got:

Note that you can't pass an input that is larger than 1000 tokens on inference endpoints

abacaj

Mar 3, 2023

Hi @ybelkada thanks for following up. Looks like it is working, maybe I loaded wrong config. Was seeing the warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (1348 > 512). Running this sequence through the model will result in indexing errors

svernek

Mar 5, 2023

•

edited Mar 5, 2023

Hi for me it still shows 512 as the max sequence length. How do you fix this? The model card for Flan-UL2 says it supports 2048 tokens! I am running it on my local machine.

joaogante

Mar 5, 2023

@ybelkada I believe changing this line to 2048 sorts the issue :) After changing it, everything seems to work well, and there is no (misleading) warning.

I've opened a PR with the fix here

ybelkada

Mar 6, 2023

Thanks! Indeed I think the fix proposed by @joaogante should fix the issue

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment