AstroLLaMA-3-8B-Chat_AIC

AstroLLaMA-3-8B-Chat_AIC is a specialized chat model for astronomy, developed by fine-tuning the AstroLLaMA-3-8B-Base_AIC model. This model was developed by the AstroMLab team. It is designed for instruction-following and chat-based interactions in the astronomy domain.

Model Details

Base Architecture: LLaMA-3-8b
Base Model: AstroLLaMA-3-8B-Base_AIC (trained on Abstract, Introduction, and Conclusion sections from arXiv's astro-ph category papers)
Fine-tuning Method: Supervised Fine-Tuning (SFT)
SFT Dataset:
- 10,356 astronomy-centered conversations generated from arXiv abstracts by GPT-4
- Full content of LIMA dataset
- 10,000 samples from Open Orca dataset
- 10,000 samples from UltraChat dataset
Training Details:
- Learning rate: 3 × 10⁻⁷
- Training epochs: 1
- Total batch size: 48
- Maximum token length: 2048
- Warmup ratio: 0.03
- Cosine decay schedule for learning rate reduction
Primary Use: Instruction-following and chat-based interactions for astronomy-related queries
Reference: Pan et al. 2024

Using the model for chat

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-chat_aic")
model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-chat_aic", device_map="auto")

# Function to generate a response
def generate_response(prompt, max_length=512):
    full_prompt = f"###Human: {prompt}\n\n###Assistant:"
    inputs = tokenizer(full_prompt, return_tensors="pt", truncation=True, max_length=max_length)
    inputs = inputs.to(model.device)
    
    # Generate a response
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_length=max_length, 
            num_return_sequences=1, 
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.encode("###Human:", add_special_tokens=False)[0]
        )
    
    # Decode and return the response
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract only the Assistant's response
    assistant_response = response.split("###Assistant:")[-1].strip()
    return assistant_response

# Example usage
user_input = "What are the main components of a galaxy?"
response = generate_response(user_input)
print(f"Human: {user_input}")
print(f"Assistant: {response}")

Model Limitations and Biases

This model is specifically trained on astronomy literature and conversation data, and may not generalize well to other domains. Users should be aware of potential biases in the training data, which may reflect historical trends and biases in astronomical research publications and the datasets used for fine-tuning.

Several key limitations have been identified:

Base Model Training: Training solely on astro-ph data may not be sufficient to significantly improve performance over the base model, especially for the already highly performant LLaMA-3 series.
SFT Dataset Limitations: The current Supervised Fine-Tuning (SFT) dataset, inherited from the original AstroLLaMA series, has proven inadequate. With only 30,000 Q&As, many of which are not astronomy-focused, it has led to a performance decrease in the instruct model compared to the base model.
Performance Degradation: The full instruct score (61.8%) is significantly lower than the base model token prediction score (71.9%), indicating a ten-point decrement due to the SFT process.
General Knowledge vs. Specialized Knowledge: The current SFT process appears to deviate the model towards general answers, potentially at the cost of specialized astronomical knowledge.

Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in Ting et al. 2024:

Model	Score (%)
AstroSage-LLaMA-3.1-8B (AstroMLab)	80.9
LLaMA-3.1-8B	73.7
LLaMA-3-8B	72.9
AstroLLaMA-3-8B-Base_AIC (AstroMLab)	71.9
Gemma-2-9B	71.5
Qwen-2.5-7B	70.4
Yi-1.5-9B	68.4
InternLM-2.5-7B	64.5
Mistral-7B-v0.3	63.9
ChatGLM3-6B	50.4

These limitations underscore the challenges in developing specialized models and the critical importance of both the quantity and quality of training data, especially for the SFT process.

This model is released primarily for reproducibility purposes, allowing researchers to track the development process and compare different iterations of AstroLLaMA models.

For optimal performance and the most up-to-date capabilities in astronomy-related tasks, we recommend using AstroSage-8B, where these limitations have been addressed. The newer model incorporates expanded training data beyond astro-ph and features a greatly expanded fine-tuning process, resulting in significantly improved performance.

Ethical Considerations

While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.

Citation

If you use this model in your research, please cite:

@ARTICLE{2024arXiv240919750P,
       author = {{Pan}, Rui and {Dung Nguyen}, Tuan and {Arora}, Hardik and {Accomazzi}, Alberto and {Ghosal}, Tirthankar and {Ting}, Yuan-Sen},
        title = "{AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy}",
      journal = {arXiv e-prints},
     keywords = {Astrophysics - Instrumentation and Methods for Astrophysics, Computer Science - Computation and Language},
         year = 2024,
        month = sep,
          eid = {arXiv:2409.19750},
        pages = {arXiv:2409.19750},
          doi = {10.48550/arXiv.2409.19750},
archivePrefix = {arXiv},
       eprint = {2409.19750},
 primaryClass = {astro-ph.IM},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2024arXiv240919750P},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}