Spaces:
Build error
Still getting tokenization and chunking errors
Browse filesErrors:
The error Token indices sequence length is longer than the specified maximum sequence length for this model (5183 > 512) clearly indicates that the tokenized sequence length is exceeding the model's limit, which is 512 tokens for TAPAS.
The exception ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. suggests that you should enable truncation to cut the token sequence down to a manageable size.
Fix:
Enable truncation in the tokenizer.
Possibly reduce the chunk size if the above step doesn't resolve the issue.
Notice the truncation=True argument in the tokenizer method. This will truncate the token sequence to fit the model's maximum input size. Also, I've included a try-except block to capture any exceptions that may occur during tokenization.
@@ -10,7 +10,11 @@ model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-large-f
|
|
10 |
|
11 |
def ask_llm_chunk(chunk, questions):
|
12 |
chunk = chunk.astype(str)
|
13 |
-
|
|
|
|
|
|
|
|
|
14 |
|
15 |
# Check for token limit
|
16 |
if inputs["input_ids"].shape[1] > 512:
|
@@ -34,6 +38,7 @@ def ask_llm_chunk(chunk, questions):
|
|
34 |
answers.append(", ".join(cell_values))
|
35 |
return answers
|
36 |
|
|
|
37 |
MAX_ROWS_PER_CHUNK = 200
|
38 |
|
39 |
def summarize_map_reduce(data, questions):
|
|
|
10 |
|
11 |
def ask_llm_chunk(chunk, questions):
|
12 |
chunk = chunk.astype(str)
|
13 |
+
try:
|
14 |
+
inputs = tokenizer(table=chunk, queries=questions, padding="max_length", truncation=True, return_tensors="pt")
|
15 |
+
except Exception as e:
|
16 |
+
st.write(f"An error occurred: {e}")
|
17 |
+
return ["Error occurred while tokenizing"] * len(questions)
|
18 |
|
19 |
# Check for token limit
|
20 |
if inputs["input_ids"].shape[1] > 512:
|
|
|
38 |
answers.append(", ".join(cell_values))
|
39 |
return answers
|
40 |
|
41 |
+
|
42 |
MAX_ROWS_PER_CHUNK = 200
|
43 |
|
44 |
def summarize_map_reduce(data, questions):
|