Model Card for ScholaWrite-Llama3.1-8B-Classifier
Model Details
Model Description
This model is refered as LLAMA-8B-SW-PRED in the paper. It is fined-tuned based on 4 bit quantized Llama-3.1-8B-Instruct from unsloth Hugging Face Hub, using train
split of ScholaWrite dataset. The sole purpose of this model is to take the role of next intention prediction in the Iterative Self-Writing task.
- Developed by: *Linghe Wang, *Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang
- Language: English
- Finetuned from model: Meta-Llama-3.1-8B-Instruct-bnb-4bit
Model Sources [optional]
- Repository: ScholaWrite Github Repository
- Paper: [More Information Needed]
- Demo: https://minnesotanlp.github.io/scholawrite/
Uses
Direct Use
The model is intended to used for next intention prediction in Iterative Self-Writing task.
Iterative Self-Writing task is iteratively generating scholarly text from scratch, mirroring the human writing process. This task focuses on how well the model trained on our dataset can replicate the actual iterative writing and thinking process of scholars, thus produce the better scholarly text output than model without trained on our dataset.
Iterative self-writing involves two subtasks (1) Next intention prediction. Model will take input prompt with task instructions, and the "before text". The model's task is to generate next writing intention based on the ''before'' text. (2) "after-text" generation. Model will take input prompt with task instructions, a verbalizer derived from human-annotated labels, and the "before text". The model's task is to generate ''after-text'' given the verbalizer and ''before'' text.
Out-of-Scope Use
The model is fine-tuned only for next writing intention prediction and infereneced in closed enviroment. Its main goal is to examine the usefullness of our dataset. It is suitable for acdamic use, but not suitable for production, general public use, or consumer-oriented service. In addition, use this model on tasks besides "after-text" generation in LaTex paper draft may not work well.
Bias and Limitations
The bias and limitations of this model mainly came from the dataset (ScholaWrite) it fine-tuned on.
First, the ScholaWrite dataset is currently limited to the computer science domain, as LaTeX is predominantly used in computer science journals and conferences. This domain-specific focus in dataset may restrict the model's generalizability to other scientific disciplines. Future work could address this limitation by collecting keystroke data from a broader range of fields with diverse writing conven554 tions and tools, such as the humanities or biological sciences. For example, students in humanities usu556 ally write book-length papers and integrate more sources, so it could affect cognitive complexities.
Second, all participants were early-career researchers (e.g., PhD students) at an R1 university in the United States, which means the models may not learn the professional writing behavior and cognitive process from expert. Expanding the dataset to include senior researchers, such as post-doctoral fellows and professors, could offer valuable insights into how writing strategies and revision behaviors evolve with research experience and expertise.
Third, the dataset is exclusive to English-language writing, which restricts model's capability to predict next writing intention in multilingual or non-English contexts. Expanding to multilingual settings could reveal unique cognitive and linguistic insights into writing across languages.
How to Get Started with the Model
import os
from unsloth import FastLanguageModel
from dotenv import load_dotenv
from huggingface_hub import login
load_dotenv()
login(os.getenv("HUGGINGFACE_TOKEN"))
model_name = "minnesotanlp/scholawrite-llama3.1-8b-classifier"
text = '''
list in following format:
[
{"role": "user", "content": your prompt that contain instruction and before text}
]
'''
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=4096,
load_in_4bit=True,
dtype=None,
)
FastLanguageModel.for_inference(model)
input_ids = tokenizer.apply_chat_template(text, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_ids, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
response = tokenizer.batch_decode(outputs)
response = response[0].split("<|start_header_id|>assistant<|end_header_id|>")[1].strip()
fine-tuning Details
fine-tuning Data
This model is fine-tuned on minnesotanlp/scholawrite dataset train
split. It is keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre-processing or filtering performed on the dataset.
fine-tuning Procedure
In the dataset, column before text
contains before text, column label
contains writing intentions.
For each entry in the dataset, we need to setup the prompt that is ready for fine-tuning. To do so, the task istruction followed by the before text will be put into the user message, and writing intention will be put into the assistant message of predefined prompt template.
We mask out the system and user message of the prompt with -100, so that the model is trained on responsed only.
fine-tuning Hyperparameters
- fine-tuning regime: QLoRA
- max_seq_length 4099
- learning_rate 2e-4
- lr_scheduler_type linear
- per_device_train_batch_size 2
- gradient_accumulation_steps 4
- num_train_epochs 1
- fp16 False
- bf16 True
- logging_steps 10
- optim adamw_8bit
- weight_decay 0.01
- warmup_steps 5
- seed 3407
Machine Specs
- Hardware: Nvidia RTX 5000
- Software: Unsloth
- Hours used: 8 hrs
- Compute Region: Minnesota
Testing Procedure
Testing Data in Next Intention Prediction
This model is tested on minnesotanlp/scholawrite dataset test_small
split.
For each entry in the dataset, the task istruction followed by the before text will be put into the user message of predefined prompt.
We leave the assistant message blank for model to generate the next writing intention.
Metrics in Next Intention Prediction
The data has class imbalanced on both training and testing data splits, so we use weighted F1 to measure the performance.
Results in Next Intention Prediction
BERT | RoBERTa | LLama-8B-Instruct | GPT-4o | |
---|---|---|---|---|
Base | 0.04 | 0.02 | 0.12 | 0.08 |
+ SW | 0.64 | 0.64 | 0.13 | - |
Summary in Next Intention Prediction
Table above presents the weighted F1 scores for predicting writing intentions across baselines and fine-tuned models. All models finetuned on ScholaWrite show a improvement performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama-8B-Instruct showed a modest improvement after fine-tuning. Those results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions.
Testing Data in Iterative Writing
Beyond running test on dataset, this model also participate into the next intention prediction subtask in Iterative Self-Writing, see section Direct Use for detail. We pick 4 seed documents as starting point of Iterative Self-Writing task, derived from 4 award-winning NLP papers spanning different topics. They are Zeng et al., 2024; Lu et al., 2024b; Du et al., 2022a; Etxaniz et al., 2024
Metrics in Iterative Writing
- lexical diversity: the ratio of unique to total tokens in the final iteration
- topic consistency: cosine similarity between the seed document and the final output
- intention coverage: diversity of writing intentions as a proportion of unique labels used across 100 iterations among the 15 available labels in our taxonomy.
Furthermore, inspired by chang2023 et al (2023), we conducted a human evaluationfor more detailed descriptions of the entire evaluation process. With three native English speakers experienced in Overleaf. They assessed the outputs based on following metrics:
- Accuracy: alignment with the predicted intention
- Alignment: how closely the model’s process resembled human writing style
- fluency: grammatical correctness of final writing
- coherence: logical structure
- relevance: connection to the seed paper's contents.
Accuracy is evaluated for each iteration, while alignment, fluency, and coherence were assessed through pairwise comparisons on final iteration.
Results in Iterative Writing
Note:
Llama-8b-sw
andFinetuned
in the tables below refers to combination of our two models where scholawrite-llama3.1-8b-classifier (this model) responsible for next writing intention prediction, and scholawrite-llama3.1-8b-writing responsible for after text generation.Llama-3b-instruct
andBaseline
in the tables below refers to combination of two Meta-Llama-3.1-8B-Instruct-bnb-4bit running next writing intention prediction and after text generation.
Auto Evaluation Results for Seed 1
Metric | Llama-8b-sw | Llama-3b-instruct | Llama-8b-instruct | GPT4o |
---|---|---|---|---|
Lexical Diversity | 0.4985 | 0.2197 | 0.2268 | 0.3405 |
Cosine Similarity | 0.8197 | 0.7839 | 0.4494 | 0.6516 |
Auto Evaluation Results for Seed 2
Metric | Llama-8b-sw | Llama-3b-instruct | Llama-8b-instruct | GPT4o |
---|---|---|---|---|
Lexical Diversity | 0.4262 | 0.164 | 0.23 | 0.3113 |
Cosine Similarity | 0.8644 | 0.7467 | 0.8319 | 0.6585 |
Auto Evaluation Results for Seed 3
Metric | Llama-8b-sw | Llama-3b-instruct | Llama-8b-instruct | GPT4o |
---|---|---|---|---|
Lexical Diversity | 0.457 | 0.2127 | 0.1784 | 0.3093 |
Cosine Similarity | 0.7772 | 0.8416 | 0.8367 | 0.4037 |
Auto Evaluation Results for Seed 4
Metric | Llama-8b-sw | Llama-3b-instruct | Llama-8b-instruct | GPT4o |
---|---|---|---|---|
Lexical Diversity | 0.359 | 0.1802 | 0.1824 | 0.3139 |
Cosine Similarity | 0.2147 | 0.5009 | 0.5353 | 0.6500 |
Human Evaluation Results for Seed 1
Metrics | Model | Evaluator 1 | Evaluator 2 | Evaluator 3 |
---|---|---|---|---|
Accuracy | Finetuned | 43 | 3 | 17 |
Baseline | 47 | 22 | 38 | |
Alignment | Finetuned | |||
Baseline | X | X | X | |
Fluency | Finetuned | |||
Baseline | X | X | X | |
Coherence | Finetuned | |||
Baseline | X | X | X | |
Relevance | Finetuned | Yes | No | No |
Baseline | Yes | Yes | Yes |
Human Evaluation Results for Seed 2
Metrics | Model | Evaluator 1 | Evaluator 2 | Evaluator 3 |
---|---|---|---|---|
Accuracy | Finetuned | 26 | 0 | 5 |
Baseline | 48 | 12 | 29 | |
Alignment | Finetuned | |||
Baseline | X | X | X | |
Fluency | Finetuned | |||
Baseline | X | X | X | |
Coherence | Finetuned | |||
Baseline | X | X | X | |
Relevance | Finetuned | Yes | Yes | Yes |
Baseline | Yes | Yes | Yes |
Human Evaluation Results for Seed 3
Metrics | Model | Evaluator 1 | Evaluator 2 | Evaluator 3 |
---|---|---|---|---|
Accuracy | Finetuned | 52 | 0 | 3 |
Baseline | 70 | 23 | 43 | |
Alignment | Finetuned | |||
Baseline | X | X | X | |
Fluency | Finetuned | |||
Baseline | X | X | X | |
Coherence | Finetuned | |||
Baseline | X | X | X | |
Relevance | Finetuned | Yes | Yes | No |
Baseline | Yes | Yes | Yes |
Human Evaluation Results for Seed 4
Metrics | Model | Evaluator 1 | Evaluator 2 | Evaluator 3 |
---|---|---|---|---|
Accuracy | Finetuned | 37 | 3 | 6 |
Baseline | 60 | 22 | 48 | |
Alignment | Finetuned | |||
Baseline | X | X | X | |
Fluency | Finetuned | |||
Baseline | X | X | X | |
Coherence | Finetuned | |||
Baseline | X | X | X | |
Relevance | Finetuned | Yes | No | No |
Baseline | Yes | Yes | Yes |
Summary for Iterative Writing
Auto Evaluation Results tables illustrates the quality of the final writing output produced by each model across all four seed documents. Notably, our models (scholawrite-llama3.1-8b-writing and scholawrite-llama3.1-8b-classifier) consistently used the most lexically diverse words in their final outputs. Moreover, it generated content that was semantically most aligned with the seed 1 and seed 2. It also covered the highest number of writing intentions based on our taxonomy for all seeds except Seed 3. These results underscore the effectiveness of ScholaWrite as a valuable resource for enhancing the quality of scholarly writing generated by language models.
Despite their remarkable performance based on automatic evaluation metrics, LLMs still exhibit limitations in learning human writing behaviors and scholarly thinking processes. According to our human evaluation our model generated fewer instances of after text
that aligned with the predicted intentions from the previous step during 100 iterations across all four seed documents. Furthermore, all three evaluators unanimously agreed that the baseline model, demonstrated more human-like writing behaviors throughout the iterations. Its final outputs were also perceived as more grammatically correct and containing stronger logical claims compared to our models.
However, the evaluators also noted that the final outputs from our models contained more relevant content for Seeds 2 and 3. This observation aligns with the trend in topic consistency scores shown in Auto Evaluation Results for Seed 2 and seed 3, further highlighting the usefulness of ScholaWrite dataset in certain contexts.
BibTeX
@misc{wang2025scholawritedatasetendtoendscholarly,
title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
year={2025},
eprint={2502.02904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02904},
}
- Downloads last month
- 2
Model tree for minnesotanlp/scholawrite-llama3.1-8b-classifier
Base model
meta-llama/Llama-3.1-8B