Sliding window vs Running Chunks Separately

#21
by MrBeardo4 - opened

Hello, I am running large texts in NuExtract1.5 through Python, extracting any mentions of habitat. When I use sliding window, I always get good results for the first chunk and very poor or no results for subsequent chunks. This is not an error in my Python as it is the same outcome when using the online Playground. However, if I run those chunks of text totally separate, it extracts habitats successfully in later chunks. Below I have shown my result when running the full text. However, if I run the chunks of text separately, I get 6 more habitats in my result.

These should return the same result since the chunks are being processed almost identically, except the sliding window has an overlap of tokens.

Why is the model performing so badly on larger texts, it seems like the sliding window does not work properly! I have replicated with a much more simplified template of just species name and habitat but the same problems persist.

{
"Species habitat use": [
{
"Species name": "Myotis myotis",
"Country": "Germany",
"Continent": "Europe",
"Habitat":
{
"Uses": [
"forest habitats with open ground",
"grassland habitats",
"freshly mown meadows",
"short-grazed pastures",
"bare fields"
],
"Does not use": []
}
}
]
}

I'd really appreciate some feedback as this research will go towards a scientific publication with the use of NuExtract1.5, but currently the model is not performing to expectations.

Tom

Can you provide the code you're using along with a source text example?

Hello William, thank you for your response!

Below is a drive link to Text files, Output from model, and Python script.

https://drive.google.com/drive/folders/1af6rUDL52ZG0sq1K3K6jCk_Ruew7oqdK

To make it simple, I have provided a full text "Test_text.txt" which is a txt file of a scientific article about a bat species myotis myotis. This text is derived from the html. I have then divided this into smaller texts; Abstract, Introduction, Results, Discussion.

You will see from the results that I return 4 habitats when running the full text. But when I run sections of the text individually, I get more habitats overall. When running the full text, only habitats in the Introduction are recognised.

You can replicate this result by copying and pasting the texts plus template into the Playground. In this example I have used a sliding window of 2000 tokens. However, this result is still consistent when varying this parameter.

Thanks in advance for any help!!!
Tom

Hi Tom,

What you're describing sounds like a known limitation we saw on occasion with the sliding window approach on this model (we discuss it briefly in the blogpost). Basically if at any chunk the model makes an error (deletes information, breaks json, etc.) then this error will propagate forward to following chunks and the model won't recover because of the limited context. As the number of total chunks increases therefore the chance of this happening also increases. This is the main reason we paused support for this feature in later models we have been developing.

My suggestion would be: if it makes sense for your task to be split in separate (non-sequential) chunks that can be combined afterwards, do that.

I'd also encourage you to modify the sliding window code a bit to inspect what the output of each chunk is to get a better idea of where the problem is occurring (e.g. print pred on each iteration, or skip handle_broken_output()).

Thank you for the help and advice with the code!

Yes, I am analysing numerous scientific articles but I will code the task to be carried out in separate chunks (either by IMRaD structure or by a set token amount).

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment