Multilingual E5 for Document Classification (DocLayNet)

This model is a fine-tuned version of intfloat/multilingual-e5-large for document text classification based on the DocLayNet dataset.

Model description

  • Base model: intfloat/multilingual-e5-large
  • Task: Document text classification
  • Languages: Multilingual

Training data

{
    'financial_reports': 0,
    'government_tenders': 1,
    'laws_and_regulations': 2,
    'manuals': 3,
    'patents': 4,
    'scientific_articles': 5
}

Training procedure

Trained on single gpu for 2 epochs for apx. 20 minutes.

hyperparameters:

{
    'batch_size': 8,
    'num_epochs': 10,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_ratio': 0.1,
    'gradient_clip': 1.0,
    'label_smoothing': 0.1,
    'optimizer': 'AdamW',
    'scheduler': 'cosine_with_warmup'
}

Evaluation results

Test Loss: 0.5192, Test Acc: 0.9719

Usage:


# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="kaixkhazaki/multilingual-e5-doclaynet")

prediction = pipe("This is some text from a financial report")
print(prediction)
Downloads last month
116
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for kaixkhazaki/multilingual-e5-doclaynet

Finetuned
(71)
this model

Dataset used to train kaixkhazaki/multilingual-e5-doclaynet

Evaluation results