jskinner215 commited on
Commit
414bc96
·
1 Parent(s): 512f2de

Still getting tokenization and chunking errors

Browse files

Errors:
The error Token indices sequence length is longer than the specified maximum sequence length for this model (5183 > 512) clearly indicates that the tokenized sequence length is exceeding the model's limit, which is 512 tokens for TAPAS.

The exception ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. suggests that you should enable truncation to cut the token sequence down to a manageable size.

Fix:
Enable truncation in the tokenizer.
Possibly reduce the chunk size if the above step doesn't resolve the issue.

Notice the truncation=True argument in the tokenizer method. This will truncate the token sequence to fit the model's maximum input size. Also, I've included a try-except block to capture any exceptions that may occur during tokenization.

Files changed (1) hide show
  1. app.py +6 -1
app.py CHANGED
@@ -10,7 +10,11 @@ model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-large-f
10
 
11
  def ask_llm_chunk(chunk, questions):
12
  chunk = chunk.astype(str)
13
- inputs = tokenizer(table=chunk, queries=questions, padding="max_length", return_tensors="pt")
 
 
 
 
14
 
15
  # Check for token limit
16
  if inputs["input_ids"].shape[1] > 512:
@@ -34,6 +38,7 @@ def ask_llm_chunk(chunk, questions):
34
  answers.append(", ".join(cell_values))
35
  return answers
36
 
 
37
  MAX_ROWS_PER_CHUNK = 200
38
 
39
  def summarize_map_reduce(data, questions):
 
10
 
11
  def ask_llm_chunk(chunk, questions):
12
  chunk = chunk.astype(str)
13
+ try:
14
+ inputs = tokenizer(table=chunk, queries=questions, padding="max_length", truncation=True, return_tensors="pt")
15
+ except Exception as e:
16
+ st.write(f"An error occurred: {e}")
17
+ return ["Error occurred while tokenizing"] * len(questions)
18
 
19
  # Check for token limit
20
  if inputs["input_ids"].shape[1] > 512:
 
38
  answers.append(", ".join(cell_values))
39
  return answers
40
 
41
+
42
  MAX_ROWS_PER_CHUNK = 200
43
 
44
  def summarize_map_reduce(data, questions):