README.md · hse-teddy-bear/xlm-roberta-russian-stock-sentiment at 203263b199391e81b8ab6a936f1845575f6d4203

metadata

language:
  - ru
  - en
license: mit
tags:
  - finance
  - sentiment
  - stocks
metrics:
  - accuracy
widget:
  - text: Нуу, эту папиру надо лонговать!
    example_title: long sentiment
  - text: Не уверен. Нужно подумать, перед тем, как брать.
    example_title: neutral sentiment
  - text: Такое только хомяки берут. Нужно сливать эту бумажку поскорее.
    example_title: short sentiment

Model Details

Model Description

Developed by: Alexander Nikitin
Model type: XLM-RoBERTa-base Fine-Tuned on my labelled dataset
Language(s) (NLP): Russian, English
License: MIT
Finetuned from model: FacebookAI/xlm-roberta-base

Dataset

This transformer model was fine-tuned on parsed comments from "Tinkoff Pulse".

First step: Comments were preprocessed, for each stock ticker subcomment for ticker was extracted. Example: "{$GAZP} {$TCSG} {$RTKM} По газрому все хорошо. По Ростелекому не очень. Тинек идет вниз!" -> "{$GAZP} По газрому все хорошо."

Next step: Labelling dataset of 10K preprocessed comments, evenly distributed from 10 russian stocks. Used Mistral-7b LLM to label comments on 3 categories: "buy" - if author wants or encourages to buy (long), "sell" - if author wants or encourages to sell or short, "neutral" - if this is news or we cannot say for sure. Plans for further research: label 100k comments and train on them.

Bias, Risks, and Limitations

Model is trained on Russian/English comments;
Model is not good at extracting sentiment from comments with bright keywords in different directions, like "I wanna sell. But probably I should buy back later.";
Model performs good on short-medium texts like comments, which are usually skewed to one side (strong buy or strong sell).

Recommendations

How to Get Started with the Model

Download the model with huggingface pipeline and use it!

Labels:

LABEL_0 = SELL
LABEL_1 = NEUTRAL
LABEL_2 = BUY

Evaluation

Accuracy on validation dataset: 0.786
Notice: this is accuracy on ~1.5k comments.

Model Card Authors

https://t.me/pivo_txt