Darija-LM / README.md
Duino's picture
Upload folder using huggingface_hub
2b58741 verified
|
raw
history blame
2.47 kB
metadata
language_model:
  - causal
license: apache-2.0
tags:
  - multilingual
  - arabic
  - darija
  - transformers
  - text-generation
model-index:
  - name: Darija-LM
    results: []

Darija-LM

This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.

Model Description

[TODO: Add a detailed description of your model here.] For example, you can include:

  • Model architecture: GPT-like Transformer
  • Training data: Arabic and Darija Wikipedia (20231101 snapshot)
  • Tokenizer: SentencePiece (BPE, vocab size: 32000)
  • Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]

Intended Uses & Limitations

[TODO: Describe the intended uses and limitations of this model.] For example:

  • Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
  • Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.

How to Use

[TODO: Add instructions on how to load and use the model.]

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example generation code (adapt as needed based on your model and tokenizer)
# input_text = "مرحبا بالعالم" # Example Arabic/Darija input
# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(generated_text)

Training Details

[TODO: Provide details about the training process.]

  • Training data preprocessing: [Describe tokenization, data splitting, etc.]
  • Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
  • Hardware: [Specify GPUs or TPUs used]

Evaluation

[TODO: Include evaluation metrics if you have them.]

  • [Metrics and results on a validation set or benchmark.]

Citation

[TODO: Add citation information if applicable.]

Model Card Contact

[TODO: Add your contact information.]

  • [Your name/organization]
  • [Your email/website/Hugging Face profile]