Disclaimer: This model is still under testing and may change in the future, we will try to keep backwards compatibility. For any questions reach us at info@cvcio.org
MediaWatch News Topics (Greek)
Fine-tuned model for multi-label text-classification (SequenceClassification), based on roberta-el-news, using Hugging Face's Transformers library. This model is to classify news in real-time on upto 33 topics including: AFFAIRS, AGRICULTURE, ARTS_AND_CULTURE, BREAKING_NEWS, BUSINESS, COVID, ECONOMY, EDUCATION, ELECTIONS, ENTERTAINMENT, ENVIRONMENT, FOOD, HEALTH, INTERNATIONAL, LAW_AND_ORDER, MILITARY, NON_PAPER, OPINION, POLITICS, REFUGEE, REGIONAL, RELIGION, SCIENCE, SOCIAL_MEDIA, SOCIETY, SPORTS, TECH, TOURISM, TRANSPORT, TRAVEL, WEATHER, CRIME, JUSTICE.
How to use
You can use this model directly with a pipeline for text-classification:
from transformers import pipeline
pipe = pipeline(
task="text-classification",
model="cvcio/mediawatch-el-topics",
tokenizer="cvcio/roberta-el-news" # or cvcio/mediawatch-el-topics
)
topics = pipe(
"Η βιασύνη αρκετών χωρών να άρουν τους περιορισμούς κατά του κορονοϊού, "+
"αν όχι να κηρύξουν το τέλος της πανδημίας, με το σκεπτικό ότι έφτασε "+
"πλέον η ώρα να συμβιώσουμε με την Covid-19, έχει κάνει μερικούς πιο "+
"επιφυλακτικούς επιστήμονες να προειδοποιούν ότι πρόκειται μάλλον "+
"για «ενδημική αυταπάτη» και ότι είναι πρόωρη τέτοια υπερβολική "+
"χαλάρωση. Καθώς τα κρούσματα της Covid-19, μετά το αιφνιδιαστικό "+
"μαζικό κύμα της παραλλαγής Όμικρον, εμφανίζουν τάση υποχώρησης σε "+
"Ευρώπη και Βόρεια Αμερική, όπου περισσεύει η κόπωση μεταξύ των "+
"πολιτών μετά από δύο χρόνια πανδημίας, ειδικοί και μη αδημονούν να "+
"«ξεμπερδέψουν» με τον κορονοϊό.",
padding=True,
truncation=True,
max_length=512,
return_all_scores=True
)
print(topics)
# outputs
[
[
{'label': 'AFFAIRS', 'score': 0.0018806682201102376},
{'label': 'AGRICULTURE', 'score': 0.00014653144171461463},
{'label': 'ARTS_AND_CULTURE', 'score': 0.0012948638759553432},
{'label': 'BREAKING_NEWS', 'score': 0.0001729220530251041},
{'label': 'BUSINESS', 'score': 0.0028276608791202307},
{'label': 'COVID', 'score': 0.4407998025417328},
{'label': 'ECONOMY', 'score': 0.039826102554798126},
{'label': 'EDUCATION', 'score': 0.0019098613411188126},
{'label': 'ELECTIONS', 'score': 0.0003333651984576136},
{'label': 'ENTERTAINMENT', 'score': 0.004249618388712406},
{'label': 'ENVIRONMENT', 'score': 0.0015828514005988836},
{'label': 'FOOD', 'score': 0.0018390495097264647},
{'label': 'HEALTH', 'score': 0.1204477995634079},
{'label': 'INTERNATIONAL', 'score': 0.25892165303230286},
{'label': 'LAW_AND_ORDER', 'score': 0.07646272331476212},
{'label': 'MILITARY', 'score': 0.00033025629818439484},
{'label': 'NON_PAPER', 'score': 0.011991199105978012},
{'label': 'OPINION', 'score': 0.16166265308856964},
{'label': 'POLITICS', 'score': 0.0008890336030162871},
{'label': 'REFUGEE', 'score': 0.0011504743015393615},
{'label': 'REGIONAL', 'score': 0.0008734092116355896},
{'label': 'RELIGION', 'score': 0.0009001944563351572},
{'label': 'SCIENCE', 'score': 0.05075162276625633},
{'label': 'SOCIAL_MEDIA', 'score': 0.00039615994319319725},
{'label': 'SOCIETY', 'score': 0.0043518817983567715},
{'label': 'SPORTS', 'score': 0.002416545059531927},
{'label': 'TECH', 'score': 0.0007818648009561002},
{'label': 'TOURISM', 'score': 0.011870541609823704},
{'label': 'TRANSPORT', 'score': 0.0009422845905646682},
{'label': 'TRAVEL', 'score': 0.03004464879631996},
{'label': 'WEATHER', 'score': 0.00040286066359840333},
{'label': 'CRIME', 'score': 0.0005416403291746974},
{'label': 'JUSTICE', 'score': 0.000990519649349153}
]
]
Labels
All labels, except NON_PAPER, retrieved by source articles during the data collection step, without any preprocessing, assuming that journalists and newsrooms assign correct tags to the articles. We disregarded all articles with more than 6 tags to reduce bias and tag manipulation.
label | roc_auc | samples |
---|---|---|
AFFAIRS | 0.9872 | 6,314 |
AGRICULTURE | 0.9799 | 1,254 |
ARTS_AND_CULTURE | 0.9838 | 15,968 |
BREAKING_NEWS | 0.9675 | 827 |
BUSINESS | 0.9811 | 6,507 |
COVID | 0.9620 | 50,000 |
CRIME | 0.9885 | 34,421 |
ECONOMY | 0.9765 | 45,474 |
EDUCATION | 0.9865 | 10,111 |
ELECTIONS | 0.9940 | 7,571 |
ENTERTAINMENT | 0.9925 | 23,323 |
ENVIRONMENT | 0.9847 | 23,060 |
FOOD | 0.9934 | 3,712 |
HEALTH | 0.9723 | 16,852 |
INTERNATIONAL | 0.9624 | 50,000 |
JUSTICE | 0.9862 | 4,860 |
LAW_AND_ORDER | 0.9177 | 50,000 |
MILITARY | 0.9838 | 6,536 |
NON_PAPER | 0.9595 | 4,589 |
OPINION | 0.9624 | 6,296 |
POLITICS | 0.9773 | 50,000 |
REFUGEE | 0.9949 | 4,536 |
REGIONAL | 0.9520 | 50,000 |
RELIGION | 0.9922 | 11,533 |
SCIENCE | 0.9837 | 1,998 |
SOCIAL_MEDIA | 0.991 | 6,212 |
SOCIETY | 0.9439 | 50,000 |
SPORTS | 0.9939 | 31,396 |
TECH | 0.9923 | 8,225 |
TOURISM | 0.9900 | 8,081 |
TRANSPORT | 0.9879 | 3,211 |
TRAVEL | 0.9832 | 4,638 |
WEATHER | 0.9950 | 19,931 |
loss | 0.0533 | - |
roc_auc | 0.9855 | - |
Pretraining
The model was pretrained using an NVIDIA A10 GPU for 15 epochs (~ approx 59K steps, 8 hours training) with a batch size of 128. The optimizer used is Adam with a learning rate of 1e-5, and weight decay 0.01. We used roc_auc_micro to evaluate the results.
Framework versions
- Transformers 4.13.0
- Pytorch 1.9.0+cu111
- Datasets 1.16.1
- Tokenizers 0.10.3
Authors
Dimitris Papaevagelou - @andefined
About Us
Civic Information Office is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.
- Downloads last month
- 0
Evaluation results
- ROCAUCself-reported98.550
- AFFAIRSself-reported98.720
- AGRICULTUREself-reported97.990
- ARTS_AND_CULTUREself-reported98.380
- BREAKING_NEWSself-reported96.750
- BUSINESSself-reported98.110
- COVIDself-reported96.200
- CRIMEself-reported98.850
- ECONOMYself-reported97.650
- EDUCATIONself-reported98.650