Thank you for sharing! Question/Note about larger dataframes...

#4
by PaulBFB - opened

first of all thank you for hosting this, it's really very helpful. A quick note about applying this to larger dataframes, so if I pass in a larger (10,000 and above) items to SentimentModel.predict_sentiment() my process (terminal) usually freezes and then gets terminated by the OS. Next I tried pandas.DataFrame.apply(...) to add it row-wise, which I ended up closing after 5 hours.

What ultimately ended up working for me is chunking the dataframe with numpy.array_split into chunks and wrapping a generator around it to yield the sentiment as chunks and then concatenating the result.

Is there a better/more elegant way to do this? And thank you again for hosting this here.

Hi @PaulBFB ,
thanks for the feedback! I think its the best approach to chunk the data like you did. It would be even better if we include this into the python package, in this line:

https://github.com/oliverguhr/german-sentiment-lib/blob/master/germansentiment/sentimentmodel.py#L26

If you like to, feel free to create a pull request with your solution.

Best
Oliver

Is there some experience on the speed and the best configuration (for example size of used array chunks) of the model when working with large datasets?
I am planning to use the model to predict the sentiment for over 500'000 text elements and I am bit afraid of the expected run time.

The batch size largely depends on the CPU / GPU you have at hand. The other factor is the length of the documents you want to classify.

I made a little colab notebook to test this:
https://colab.research.google.com/drive/1ecUTh_TmEOjdIK6-rFqamifd9Jzk7uDA?usp=sharing

500000 documents in batches of 5 take about 10h on the CPU and 20 minutes on the GPU.
If you increase the batch size to 50 it takes 4 minutes to process the dataset.

So you can process you dataset for free with colab.

Hi @oliverguhr ,

Thank you very much for what you are doing: Providing this model AND being so responsive and helpful here!

Your colab is golden for testing the performance and choosing the optimal batch size. I inserted texts that are realistic to my use case (avg. of 90 words per text) and tested different batch sizes.
Interestingly, batch size 50 seems to be a pretty good choice, even though there would be enough GPU memory for batch size 100. 50 is around 10 minutes faster than 100.

Anyway, I can see that my dataset is no problem. Thanks again!

Thank you for your feedback @oberbus - sometimes a nice word brightens the day.

oliverguhr changed discussion status to closed

PS: Feel free to reopen in case something is unclear.

Sign up or log in to comment