Edit model card

Hungarian Experimental Sentence-BERT

The pre-trained huBERT was fine-tuned on the Hunglish 2.0 parallel corpus to mimic the bert-base-nli-stsb-mean-tokens model provided by UKPLab. Sentence embeddings were obtained by applying mean pooling to the huBERT output. The data was split into training (98%) and validation (2%) sets. By the end of the training, a mean squared error of 0.106 was computed on the validation set. Our code was based on the Sentence-Transformers library. Our model was trained for 2 epochs on a single GTX 1080Ti GPU card with the batch size set to 32. The training took approximately 15 hours.

Limitations

  • max_seq_length = 128

Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('NYTK/sentence-transformers-experimental-hubert-hungarian')
embeddings = model.encode(sentences)
print(embeddings)

Citation

If you use this model, please cite the following paper:

@article {bertopic,
    title = {Analyzing Narratives of Patient Experiences: A BERT Topic Modeling Approach},
    journal = {Acta Polytechnica Hungarica},
    year = {2023},
    author = {Osváth, Mátyás and Yang, Zijian Győző and Kósa, Karolina},
    pages = {153--171},
    volume = {20},
    number = {7}
}
Downloads last month
2,348
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.