Shopping Receipts

#2
by douglasg14b - opened

This seems to do incredibly poorly on shopping receipts from grocery stores, big box stores, hardware stores...etc

Is this to be expected? Is there a good dataset of receipts that are focused on these sorts of businesses?

Just mulling

Yes, the result is not satisfactory due to the fact that there are many errors on the original dataset (which was produced with OCR without verification). I am gradually correcting the errors manually, but it takes a long time to correct ~1500 receipts. On HuggingFace, aside from the original Cordv2 dataset and the one I use, there is not much else, at least not for free.

I wonder how effective something like https://labelstud.io might be in assisting with re-tagging. Combine that with mechanical turk 🤔

Kind of related, any ideas on how to get a hold of more receipts (Classified or not)? There's definitely an under-representation of non-restaurants in the dataset, and some businesses like Safeway for example have additional receipt syntax that throws off document extraction software, more samples more better.

That could be a good idea, there are a few vision models that could help with a proper prompt (I also tried GPT4-V), but ultimately you'll need to correct some mistakes here and there so I don't think this process can be fully automated. To get more receipts I can think of Roboflow and also Kaggle. I didn't go for it because they're not tagged at all (or badly), compared to the ones on HF.

I can recommend the SROIE dataset, although it's not tagged as well as the CORD. I guess it also depends on what data you need from the receipts.

We get around 85% accuracy (text similarity) when trying out GPT4 Vision on our dataset at work, so I wouldn't suggest simply using that data as ground truth.

But then again - I still find mistakes in our test data every other time I sample it, so who knows 🤷

Sign up or log in to comment