pszemraj's picture
Update README.md
86b24ef
metadata
license: apache-2.0
base_model: BEE-spoke-data/smol_llama-220M-GQA
tags:
  - generated_from_trainer
metrics:
  - accuracy
inference:
  parameters:
    max_new_tokens: 64
    do_sample: true
    renormalize_logits: true
    repetition_penalty: 1.05
    no_repeat_ngram_size: 6
    temperature: 0.9
    top_p: 0.95
    epsilon_cutoff: 0.0008
widget:
  - text: In beekeeping, the term "queen excluder" refers to
    example_title: Queen Excluder
  - text: One way to encourage a honey bee colony to produce more honey is by
    example_title: Increasing Honey Production
  - text: The lifecycle of a worker bee consists of several stages, starting with
    example_title: Lifecycle of a Worker Bee
  - text: Varroa destructor is a type of mite that
    example_title: Varroa Destructor
  - text: In the world of beekeeping, the acronym PPE stands for
    example_title: Beekeeping PPE
  - text: The term "robbing" in beekeeping refers to the act of
    example_title: Robbing in Beekeeping
  - text: |-
      Question: What's the primary function of drone bees in a hive?
      Answer:
    example_title: Role of Drone Bees
  - text: To harvest honey from a hive, beekeepers often use a device known as a
    example_title: Honey Harvesting Device
  - text: >-
      Problem: You have a hive that produces 60 pounds of honey per year. You
      decide to split the hive into two. Assuming each hive now produces at a
      70% rate compared to before, how much honey will you get from both hives
      next year?

      To calculate
    example_title: Beekeeping Math Problem
  - text: In beekeeping, "swarming" is the process where
    example_title: Swarming
pipeline_tag: text-generation
datasets:
  - BEE-spoke-data/bees-internal
language:
  - en

smol_llama-220M-bees-internal

This model is a fine-tuned version of BEE-spoke-data/smol_llama-220M-GQA on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 2.6892
  • Accuracy: 0.4610

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 2
  • seed: 27634
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.05
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss Accuracy
3.0959 0.1 50 2.9671 0.4245
2.9975 0.19 100 2.8691 0.4371
2.8938 0.29 150 2.8271 0.4419
2.9027 0.39 200 2.7973 0.4457
2.8983 0.49 250 2.7719 0.4489
2.8789 0.58 300 2.7519 0.4515
2.8672 0.68 350 2.7366 0.4535
2.8369 0.78 400 2.7230 0.4558
2.8271 0.88 450 2.7118 0.4569
2.7775 0.97 500 2.7034 0.4587
2.671 1.07 550 2.6996 0.4592
2.695 1.17 600 2.6965 0.4598
2.6962 1.27 650 2.6934 0.4601
2.6034 1.36 700 2.6916 0.4605
2.716 1.46 750 2.6901 0.4609
2.6968 1.56 800 2.6896 0.4608
2.6626 1.66 850 2.6893 0.4609
2.6881 1.75 900 2.6891 0.4610
2.7339 1.85 950 2.6891 0.4610
2.6729 1.95 1000 2.6892 0.4610

Framework versions

  • Transformers 4.36.2
  • Pytorch 2.1.0
  • Datasets 2.16.1
  • Tokenizers 0.15.0