支持langchain使用bge-m3模型吗?

#25
by Nicole828 - opened

您好,请问一下,目前可以使用langchain来使用bge-m3模型吗?

Beijing Academy of Artificial Intelligence org

langchain中可以使用bge-m3的稠密向量,使用huggingfaceembedding模块或者sentence-transfomer加载即可。

langchain中可以使用bge-m3的稠密向量,使用huggingfaceembedding模块或者sentence-transfomer加载即可。

谢谢,请问一下,是使用HuggingFaceBgeEmbedding还是直接HuggingFaceEmbedding的稠密向量?

Beijing Academy of Artificial Intelligence org

langchain中可以使用bge-m3的稠密向量,使用huggingfaceembedding模块或者sentence-transfomer加载即可。

谢谢,请问一下,是使用HuggingFaceBgeEmbedding还是直接HuggingFaceEmbedding的稠密向量?

BGE-M3不需要添加指令,使用HuggingFaceEmbedding即可

我用huggingfaceembedding加载的时候出错了:
代码
from langchain.embeddings import HuggingFaceBgeEmbeddings
embeddings = HuggingFaceBgeEmbeddings(model_name='BAAI/bge-m3')

错误信息:
Traceback (most recent call last):
File "Untitled-1.py", line 8, in
embeddings = HuggingFaceBgeEmbeddings(model_name='BAAI/bge-m3', model_kwargs = {
File "/python3.9/site-packages/langchain_community/embeddings/huggingface.py", line 257, in init
self.client = sentence_transformers.SentenceTransformer(
File "/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 190, in init
modules = self._load_sbert_model(
File " python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 1162, in _load_sbert_model
module = Transformer(model_name_or_path, cache_dir=cache_folder, **kwargs)
File " python3.9/site-packages/sentence_transformers/models/Transformer.py", line 38, in init
self.tokenizer = AutoTokenizer.from_pretrained(
File "/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 751, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "python3.9/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
return cls._from_pretrained(
File "python3.9/site-packages/transformers/tokenization_utils_base.py", line 2242, in _from_pretrained
init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
TypeError: unhashable type: 'dict'

加载BAAI/bge-large-zh-v1.5就不会

版本:
langchain 0.1.12 pypi_0 pypi
langchain-community 0.0.28 pypi_0 pypi
langchain-core 0.1.31 pypi_0 pypi
langchain-experimental 0.0.54 pypi_0 pypi
langchain-text-splitters 0.0.1 pypi_0 pypi
sentence-transformers 2.5.1 pypi_0 pypi
transformers 4.34.1 pypi_0 pypi

Beijing Academy of Artificial Intelligence org

transformers 版本太低了,需要提升版本,如4.37.0

采用HuggingFaceEmbeddings加载bge-m3模型没问题,但是在做向量化的时候会有显存溢出的问题:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.39 GiB. GPU 1 has a total capacty of 79.15 GiB of which 19.82 GiB is free. Process 48388 has 958.00 MiB memory in use. Process 48381 has 958.00 MiB memory in use. Process 821924 has 1.73 GiB memory in use. Process 3871686 has 3.85 GiB memory in use. Process 3954586 has 14.79 GiB memory in use. Process 4057619 has 37.03 GiB memory in use. Of the allocated memory 36.54 GiB is allocated by PyTorch, and 10.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Beijing Academy of Artificial Intelligence org

@CVer2022 , the max length of bge-m3 is 8192, which is larger than that of bge-*-v1.5 (512). When dealing with long texts, the consumption of memory will increase sharply.
If there is not enough resource to handle the text of length 8192, you can reduce the batch size by passing encode_kwargs = {'batch_size': 1} to HuggingFaceEmbeddings, or you can truncate the text sbefore encoding them,

Sign up or log in to comment