Bo Wang

bwang0911

AI & ML interests

information retrieval, representation learning

Organizations

bwang0911's activity

posted an update about 1 month ago
view post
Post
2098
In the vector search setup, we normally combine a fast embedding model and an accurate but slow reranker model.

The newly released @jinaai rerankers are small in size and almost as accurate as our base reranker. This means given a time constraint, it can scoring more candidate documents from embedding models and have a better chance to feed LLM the correct context for RAG generation.

These models are available on Huggingface and has been integrated into the latest SentenceTransformers 2.7.0. Check it out!

jinaai/jina-reranker-v1-turbo-en
jinaai/jina-reranker-v1-tiny-en
  • 1 reply
·
posted an update 3 months ago
view post
Post
@jinaai , we've recently launched an interesting model: jinaai/jina-colbert-v1-en. In this post, I'd like to give you a quick introduction to ColBERT: the multi-vector search & late interaction retriever.

As you may already know, we've been developing embedding models such as jinaai/jina-embeddings-v2-base-en for some time. These models, often called 'dense retrievers', generate a single representation for each document.

Embedding models like Jina-v2 have the advantage of quick integration with vector databases and good performance within a specific domain.

When discussing tasks within a specific domain, it means embedding models can perform very well by "seeing similar distributions". However, this also suggests that they might only perform "okay" on tasks outside of that domain and require fine-tuning.

Now, let's delve into multi-vector search and late-interaction models. The idea is quite simple:

1. During model training, you apply dimensionality reduction to decrease the vector dimensionality from 768 to 128 to save storage.
2. Now, with one query and one document, you match each query token embedding against every token embedding in the document to find the maximum similarity score. Repeat this process for each token in the query, from the second to the last, and then sum up all the maximum similarity scores.

This process is called multi-vector search because if your query has 5 tokens, you're keeping 5 * 128 token embeddings. The "max similarity sum-up" procedure is termed late interaction.

Multi-vector & Late interaction retrievers have the advantage of:

1. Excellent performance outside of a specific domain since they match at a token-level granularity.
2. Explainability: you can interpret your token-level matching and understand why the score is higher/lower.

Try our first multi-vector search at jinaai/jina-colbert-v1-en and share your feedback!
posted an update 4 months ago
view post
Post
We've been busy cooking up some interesting models at @jinaai , with a recent highlight being the release of our first batch of bilingual embedding models.

Internally labeled as X+EN, where X represents the target language and EN stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.

You can find these models available on Huggingface:
1. German-English bilingual embedding: jinaai/jina-embeddings-v2-base-de
2. Chinese-English bilingual embedding: jinaai/jina-embeddings-v2-base-zh

We're also excited to announce that a Spanish bilingual embedding will be released in approximately two weeks.

Our evaluation across various MLM tasks has demonstrated that the Bilingual Backbone consistently outperforms state-of-the-art Multilingual Backbones like XLM-Roberta (given its focus on just two languages).

Despite being three times smaller than the leading multilingual models (e5-multilingual-large), our released bilingual embedding models have shown superior performance compared to e5-multilingual-large, excelling in both monolingual and cross-lingual search tasks.

Currently, we're putting the finishing touches on the technical report, which should be available on Arxiv by next week.

Looking ahead, the embedding team is gearing up for jina-embeddings-v3
with some initial groundwork already underway. Stay tuned for more updates!
  • 1 reply
·