Spaces:

jskinner215
/

TAPAS_WTQ_Chunking

Build error

jskinner215 commited on Aug 31, 2023

Commit

414bc96

1 Parent(s): 512f2de

Still getting tokenization and chunking errors

Errors:
The error Token indices sequence length is longer than the specified maximum sequence length for this model (5183 > 512) clearly indicates that the tokenized sequence length is exceeding the model's limit, which is 512 tokens for TAPAS.

The exception ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. suggests that you should enable truncation to cut the token sequence down to a manageable size.

Fix:
Enable truncation in the tokenizer.
Possibly reduce the chunk size if the above step doesn't resolve the issue.

Notice the truncation=True argument in the tokenizer method. This will truncate the token sequence to fit the model's maximum input size. Also, I've included a try-except block to capture any exceptions that may occur during tokenization.

Files changed (1) hide show

app.py +6 -1

app.py CHANGED Viewed

@@ -10,7 +10,11 @@ model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-large-f
 def ask_llm_chunk(chunk, questions):
     chunk = chunk.astype(str)
-    inputs = tokenizer(table=chunk, queries=questions, padding="max_length", return_tensors="pt")
     # Check for token limit
     if inputs["input_ids"].shape[1] > 512:
@@ -34,6 +38,7 @@ def ask_llm_chunk(chunk, questions):
             answers.append(", ".join(cell_values))
     return answers
 MAX_ROWS_PER_CHUNK = 200
 def summarize_map_reduce(data, questions):

 def ask_llm_chunk(chunk, questions):
     chunk = chunk.astype(str)
+    try:
+        inputs = tokenizer(table=chunk, queries=questions, padding="max_length", truncation=True, return_tensors="pt")
+    except Exception as e:
+        st.write(f"An error occurred: {e}")
+        return ["Error occurred while tokenizing"] * len(questions)
     # Check for token limit
     if inputs["input_ids"].shape[1] > 512:
             answers.append(", ".join(cell_values))
     return answers
 MAX_ROWS_PER_CHUNK = 200
 def summarize_map_reduce(data, questions):