Missing tokenize method?

#6
by DamianS89 - opened

Hey,

I tried to fine tune that embedding model to my specific case - basically puting the name of the model into my code which works on different other models. Its basically - as you recommended - using th SentenceTransformer model.fit.

The error is:

  File "xxx/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "xxx/sentence_transformers/SentenceTransformer.py", line 551, in smart_batching_collate
    tokenized = self.tokenize(texts[idx])
  File "xxx/sentence_transformers/SentenceTransformer.py", line 319, in tokenize
    return self._first_module().tokenize(texts)
AttributeError: 'NoneType' object has no attribute 'tokenize'

Am I doing something wrong here or is this method simply missing (as the error states).

Do you have any recommendations?

Best,

Damian

Jina AI org

hi @DamianS89 can give more context, such as your fine-tuning code?

Sure,
I am using (most of the time) SentenceTransformer to fine tune my embedding models:

Simplified code:

examples = []
examples.append(InputExample(texts=[data['query'], data['pos'][i], data['neg'][i]]))

train_examples = examples[:1000]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

model = SentenceTransformer(
    device=device,
)

evaluator = evaluation.TripletEvaluator.from_input_examples(eval_examples, name='eval', batch_size=batch_size)
train_loss = losses.TripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    steps_per_epoch=steps_per_epoch,
    optimizer_params={'lr': learning_rate},
    weight_decay=0,
    show_progress_bar=True,
    callback=val_callback,
    evaluator=evaluator,
    save_best_model=True,
)

model.save(f"{base_path}/fine-tuning/emb-models/{ft_model_id}")

Best,

Damian

Jina AI org

hi @DamianS89 in 2024 Jan 30th after sentence-transformers release, jina-v2 now supported by sbert officially, i'm not sure about the reason, but most likely because previous sbert does not support trust_remote_code. But since yesterday you can do:

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-de", # switch to en/zh for English or Chinese
    trust_remote_code=True.  # NEEDD
)

# control your input sequence length up to 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    'Wie ist das Wetter heute?'
])
print(cos_sim(embeddings[0], embeddings[1]))
>>> tensor([[0.9602]])

so, please upgrade sbert, then sent trust_remote_code=True and give another try

Hey,
yep, I know, actually wanted to add there an issue and while I wrote it, they released 2.3.0^^
Before that I hacked the huggingface package and added basically "this.client.max_seq_length = xxx" statically for testing purposes.
Thank you for your responses.
Best,
Damian

DamianS89 changed discussion status to closed

Sign up or log in to comment