Dataset and Precision and Recall metrics

#2
by minanne - opened

Hi there! I am just wondering if we can get to know the source of dataset used for training and testing in this particular model as well as the precision and recall metrics that was evaluated.

Been utilising this model to test for the confusion metrics (accuracy and f1 score) using other datasets, but at the moment not giving me an ideal performance.

Hi minanne!
Thanks for your interest!
The dataset was scraped from reddit over the course of a university project. There is a paper which will be linked after some finishing touches.
We want to retrain this model with a bigger sample size and will upload it again.

Hi JanSt!

Thanks for the prompt response. That's amazing. We had been trying to test for its f1 score and accuracy using the dataset from kaggle (https://www.kaggle.com/datasets/datasnaek/mbti-type) with different pre processing methods. The accuracy that we can managed to get is about 10%. I look forward for your updates again. Thank you very much.

hello @minanne ;
you gotta understand, the kaggle dataset is really, really bad and very small.
you also gotta know that half of all people misclassify themselves (it is proven).
so if you take these into consideration, then 10% with 16 types is still A LOT better than random chance.
We achieved 35% performance in accuracy, which is extremely good (considering again the fact, half of all people misclassify themselves).
If you did your personality test on 16personalities.com; it is most likely wrong result!
.
If you want to test yourself do so on https://personalityjunkie.com/ . If you want I can DM you some ebooks for basic mbti education.

We have a huge! great quality dataset covering (I believe) 500.000 samples.
Find the link to the preliminary paper here:
https://bachstelze.gitlab.io/trilinux/MBTI_classification.pdf
Also, please keep in mind, that this is a preliminary model. I have trained dozens of models to figure out the best approach. ALBERT stands for ' a light bert' .
Again, this is a tiny model we trained with about only 10k samples.
Once I get some spare time I will retrain it with a larger large language model and ALL our collected data but it will train for multiple days.
We were in a rush for the deadline so I did not have enough time to train our best approach "more properly" anymore.

I am pretty confident I can get the accuracy pushing 50% which is again really good.
Once you understand the functional stack (each type has 4 functions) relates between types you will also find out, that you usually get misclassified as a personality type close to yours.
(please read our paper to know more about the theory and our approaches).

If you want access to our FULL dataset from the reddit MBTI subreddit, please DM me or JanSt.
Thank you.
Robookwus

@minanne :
Look:

image.png

I am INTP. But have a well developed second function (extroverted intution) which is why I can seem extraverted in conversations.
The classifier works imho; i have been studying MBTI for about 7 years and am very 'deep' into the theory.

If you wanna reach out to me, chat more and learn more join our signal conversation group (head over to trilinux.org to do so)

I hope this helps!
Yours sincerely,
Robookwus

Sign up or log in to comment