classla
/

ParlaCAP-Topic-Classifier

@@ -111,14 +111,15 @@ to the [CAP (Comparative Agendas Project) schema](https://www.comparativeagendas
 This classification model is based on the multilingual parliamentary [XLM-R-Parla](https://huggingface.co/classla/xlm-r-parla) BERT-like model,
 which is a XLM-RoBERTa-large model that was additionally pre-trained on texts of parliamentary proceedings.
-To develop the ParlaCAP model, XLM-R-Parla was additionally fine-tuned on 29,779 instances (speeches) from
 29 [ParlaMint 4.1](http://hdl.handle.net/11356/1912) datasets
 containing transcriptions of parliamentary debates of 29 European countries and autonomous regions.
 The speeches were automatically annotated with 22 CAP labels (21 major topics and a label "Other") using the GPT-4o model
 in a zero-shot prompting fashion
 following the [LLM teacher-student framework](https://ieeexplore.ieee.org/document/10900365).
 Evaluation of the GPT model has shown that its annotation performance is
-comparable to those of human annotators.
 The fine-tuned ParlaCAP model achieves 0.723 in macro-F1 on an English test set,
 0.686 in macro-F1 on a Croatian test set, 0.710 in macro-F1 on a Serbian test set and 0.646 in macro-F1 on a Bosnian test set
@@ -186,7 +187,23 @@ To apply the model to the text corpora in the ParlaMint TXT format
 ## How to Cite
-The paper presenting the model is on its way. In the meantime, you can cite the model as follows:
 ```
 @misc{parlacap_model,
     author       = {Kuzman Punger{\v s}ek, Taja and Ljube{\v s}i{\'c}, Nikola},

 This classification model is based on the multilingual parliamentary [XLM-R-Parla](https://huggingface.co/classla/xlm-r-parla) BERT-like model,
 which is a XLM-RoBERTa-large model that was additionally pre-trained on texts of parliamentary proceedings.
+To develop the ParlaCAP model, XLM-R-Parla was additionally fine-tuned on the [ParlaCAP-train dataset](http://hdl.handle.net/11356/2093): 29,779 instances (speeches) from
 29 [ParlaMint 4.1](http://hdl.handle.net/11356/1912) datasets
 containing transcriptions of parliamentary debates of 29 European countries and autonomous regions.
 The speeches were automatically annotated with 22 CAP labels (21 major topics and a label "Other") using the GPT-4o model
 in a zero-shot prompting fashion
 following the [LLM teacher-student framework](https://ieeexplore.ieee.org/document/10900365).
 Evaluation of the GPT model has shown that its annotation performance is
+comparable to those of human annotators. For more information, see the paper ["Supercharging Agenda Setting Research:
+ The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification"](https://doi.org/10.48550/arXiv.2602.16516) (Kuzman Pungeršek et al., 2026).
 The fine-tuned ParlaCAP model achieves 0.723 in macro-F1 on an English test set,
 0.686 in macro-F1 on a Croatian test set, 0.710 in macro-F1 on a Serbian test set and 0.646 in macro-F1 on a Bosnian test set
 ## How to Cite
+Please cite the paper presenting the model:
+```
+@article{pungersek2026parlacap-paper,
+      title={{Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification}},
+      author={Kuzman Punger{\v s}ek, Taja and Rupnik, Peter and {\v S}irini{\'c}, Daniela and Ljube{\v s}i{\'c}, Nikola},
+      year={2026},
+      eprint={2602.16516},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2602.16516},
+    journal={arXiv preprint arXiv:2602.16516},
+}
+```
+You can also cite the model as follows:
 ```
 @misc{parlacap_model,
     author       = {Kuzman Punger{\v s}ek, Taja and Ljube{\v s}i{\'c}, Nikola},