Love it!

#1
by OccultDemonCassette - opened

I absolutely love that this model can actually handle things longer than one sentence. Would there be any way to increase the total amount of input-text it can handle when using it in google colab so that it can process a multi-paragraph document of up to 2000 words (or 10,000 characters) or so?

Hi @OccultDemonCassette , glad you like it 😌 currently, the model should be able to handle up to 1024 tokens at a time (though I would have to go back and look at the training data to see practically how many tokens it was looking at max during train.. this may take some time). You can see how many tokens you have for this model via the following:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("pszemraj/flan-t5-large-grammar-synthesis")
input_text = "I love that this model can handle things longer than one sentence. Would there be any way to increase the total amount of input-text it can handle when using it in google colab so that it can process a multi-paragraph document of up to 2000 words (or 10,000 characters) or so?"
encoded_input = tokenizer(
        input_text,
        truncation=False,
        return_tensors="pt",
    )
num_tokens = len(encoded_input.input_ids[0])
print(f"my input text is {num_tokens} long")

AFAIK there should be a diminishing importance of "multiple sentences away" context on correcting the grammar of a given sentence, I think it should be fine to correct the text via a sliding token batch window and then aggregate the outputs. You can see how I implement this for summarization here in the summarize_via_tokenbatches function. You need to change calling summarize_and_score to one that implements this model. Another thing that might be worth investigating is ensuring that the token batches are split in a "smart" way such that the model is not fed half of the sentence on one batch and the other half on the next.

I hope that helps!

small update: tokens in training data (t5 tokenizer)

tl;dr I would use batches of 64-96 tokens at a time to ensure "in-distribution" correction, but it may (and probably does) work for longer sequences too.

count    180080.000000
mean         78.201888
std          94.282064
min           2.000000
25%          19.000000
50%          40.000000
75%          95.000000
max         761.000000
Name: input_token_length, dtype: float64

token-lengths

pszemraj changed discussion status to closed
This comment has been hidden
This comment has been hidden

Another thing that might be worth investigating is ensuring that the token batches are split in a "smart" way such that the model is not fed half of the sentence on one batch and >the other half on the next.

I hope that helps!

Hello! Sorry for the constant messages. I'm unsure if there's a better way to contact you, or if you could recommend me to any forums who might be able to help a newcomer figure some of this stuff out.

So, the following code for "def chunks"
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i : i + n]

I suppose this part splits the input text into equally sized chunks?

Do you know if there would be a way to split these chunks to just be line-by-line instead of equally sized batches of text?

Say, if I took my text file, and I ran a regex on it to insert a new line after every period, questionmark, exclamation point, etc. Basically so that every single sentence is on its own line. Would there be a way to then tokenize a batch of input based specifically on each individual line from the text file?

no worries! you can try discord mrshadow773#0840 or the huggingface discord channel would be good for this :)

re chunks it does split evenly (as possible) into token batches

Say, if I took my text file, and I ran a regex on it to insert a new line after every period, questionmark, exclamation point, etc.

yes, you would then call either the tokenizer on that string or the pipeline object. the tricky part is you want to probably maximize the amount of tokens/lines you pass to the model at a time to save on compute/speed.

no worries! you can try discord mrshadow773#0840 or the huggingface discord channel would be good for this :)

Thanks for the references! I don't know why, but I never even thought about the huggingface discord, haha. I'm really loving this long-document/fiction-book/narrative-language summarization stuff!

On a side-note: One thing that I would love to figure out is what data it is pulling "The Underground Man" or "The UM" from. I've tested out multiple fiction novels which are written in a mostly 1st-person narrative, and across all of them if it ever forgets what the narrator name is, and it doesn't default to "the narrator", it will then default to "the underground man". So strange, but so fun! It only happens like once in every 50 iterations or so, though.

Some examples:
In this chapter, the Underground Man explains his philosophy of existence and how it differs from other theories of self.
In this short chapter, the Underground Man tries to make his case for why we should all be afraid.
In this chapter, the Underground Man sums up his argument against being a human.
In this chapter, the Underground Man discusses several other Dracula's who have appeared throughout the novel before their souls goes straight to hell.

hmm is that happening with this model (grammar synthesis)? if it's summarization perhaps better to continue that on a thread for one of those models. tl;dr I would use SBERT to search the source data if this is related to booksum.

Sign up or log in to comment