rbroc commited on
Commit
fe55222
1 Parent(s): 7c324b8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers; pytorch
6
+ tags:
7
+ - speech
8
+ ---
9
+
10
+ ### Contrastive user encoder
11
+ This model is a `DistilBertModel` trained by fine-tuning `distilbert-base-uncased` on author-based triplet loss.
12
+
13
+ #### Details
14
+ Training and evaluation details are provided in our EMNLP Findings paper:
15
+ - Rocca, R., & Yarkoni, T. (2022), Language models as user encoders: Self-supervised learning of user encodings using transformers, to appear in *Findings of the Association for Computational Linguistics: EMNLP 2022*
16
+
17
+
18
+ #### Training
19
+ We fine-tuned DistilBERT on triplets consisting of:
20
+ - a Reddit submission from a given user (the "anchor");
21
+ - an additional post from the same user (a "positive example");
22
+ - a post from a different, randomly selected user (the "negative example")
23
+
24
+
25
+ To compute the loss, we use [CLS] encoding of the anchor, positive example and negative example from the last layer of the DistilBERT encoder. We optimize for \\(max(||f(a) - f(n)|| - ||\overline{f(A)} - f(p)|| + \alpha,0)\\)
26
+
27
+ where:
28
+ - \\( f(a)\\) is the [CLS] encoding of the anchor;
29
+ - \\( f(n) \\) is the [CLS] encoding of the negative example;
30
+ - \\( f(p) \\) is the [CLS] encoding of the positive example;
31
+ - \\( \alpha \\) is a tunable parameter called margin. Here, we tuned this to \\( \alpha = 1.0\\)
32
+
33
+
34
+ #### Evaluation and usage
35
+ The model yields performance advantages downstream user-based classification tasks.
36
+
37
+ We encourage usage and benchmarking on tasks involving:
38
+ - prediction of user traits (e.g., personality);
39
+ - extraction of user-aware text encodings (e.g., style modeling);
40
+ - contextualized text modeling, where standard text representations are complemented with compact user representations
41
+
42
+ #### Limitations
43
+ Being exclusively trained on Reddit data, our models probably overfit to linguistic markers and traits which are relevant to characterizing the Reddit user population, but less salient in the general population. Domain-specific fine-tuning may be required before deployment.
44
+
45
+ Furthermore, our self-supervised approach enforces little or no control over biases, which models may actively use as part of their heuristics in contrastive and downstream tasks.