The next version of Gradio will be significantly more efficient (as well as a bit faster) for anyone who uses Gradio's streaming features. Looking at you chatbot developers @oobabooga@pseudotensor :)
The major change that we're making is that when you stream data, Gradio used to send the entire payload at each token. This is generally the most robust way to ensure all the data is correctly transmitted. We've now switched to sending "diffs" --> so at each time step, we automatically compute the diff between the most recent updates and then only send the latest token (or whatever the diff may be). Coupled with the fact that we are now using SSE, which is a more robust communication protocol than WS (SSE will resend packets if there's any drops), we should have the best of both worlds: efficient *and* robust streaming.
The authors claim that this simple method is the best heuristic for detecting hallucinations. The beauty is that it only uses the generated token probabilities, so it can be implemented at inference time ⚡
With the Google announcement last week, I think we're now officially the only AI startup out there who has commercial collaborations with all the major cloud providers (AWS, GCP, Azure) and hardware providers (Nvidia, AMD, Intel, Qualcomm,...), making our vision of being the independent and agnostic platform for all AI builders truer than ever!