simonschoe commited on
Commit
2d84e3f
1 Parent(s): 20746b1

update model card

Browse files
Files changed (1) hide show
  1. README.md +70 -2
README.md CHANGED
@@ -2,9 +2,9 @@
2
  language:
3
  - en
4
  library_name: fasttext
 
5
  tags:
6
  - text
7
- - text-classification
8
  - semantic-similarity
9
  - earnings-call-transcripts
10
  - word2vec
@@ -20,4 +20,72 @@ widget:
20
  example_title: "disruption"
21
  ---
22
 
23
- ### Model Card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - en
4
  library_name: fasttext
5
+ pipeline_tag: text-classification
6
  tags:
7
  - text
 
8
  - semantic-similarity
9
  - earnings-call-transcripts
10
  - word2vec
 
20
  example_title: "disruption"
21
  ---
22
 
23
+ # EarningsCall2Vec
24
+ This is a [fastText](https://fasttext.cc/) model trained via [`Gensim`](https://radimrehurek.com/gensim/): It maps each token in the vocabulary (i.e., unigram and frequently coocurring bi-, tri-, and fourgrams) to a dense, 300-dimensional vector space, designed for performing **semantic search**. It has been trained on corpus of ~160k earning call transcripts, in particular the executive remarks within the Q&A-section of these transcripts (13m sentences).
25
+
26
+ ## Usage (API)
27
+ ```
28
+ pip install -U xxx
29
+ ```
30
+ Then you can use the model like this:
31
+ ```python
32
+ py code
33
+ ```
34
+
35
+ ## Usage (Gensim)
36
+ ```
37
+ pip install -U xxx
38
+ ```
39
+ Then you can use the model like this:
40
+ ```python
41
+ py code
42
+ ```
43
+
44
+ ## Background
45
+
46
+ Context on the project.
47
+
48
+
49
+ ## Intended Uses
50
+
51
+ Our model is intented to be used for semantic search on a token-level: It encodes search-queries (i.e., token) in a dense vector space and finds semantic neighbours, i.e., token which frequently occur within similar contexts in the underlying training data. Note that this search is only feasible for individual token and may produce deficient results in the case of out-of-vocabulary token.
52
+
53
+
54
+ ## Training procedure
55
+
56
+ ```python
57
+ logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
58
+
59
+ # init
60
+ model = FastText(
61
+ vector_size=300,
62
+ window=5,
63
+ min_count=10,
64
+ alpha=0.025,
65
+ negative = 5,
66
+ seed=2021,
67
+ sample = 0.001,
68
+ sg=1,
69
+ hs=0,
70
+ max_vocab_size=None,
71
+ workers=10,
72
+ )
73
+
74
+ # build vocab
75
+ model.build_vocab(corpus_iterable=LineSentence(<PATH_TRAIN_DATA>))
76
+
77
+ # train model
78
+ model.train(
79
+ corpus_iterable=LineSentence(<PATH_TRAIN_DATA>),
80
+ total_words=model.corpus_total_words,
81
+ total_examples=model.corpus_count,
82
+ epochs=50,
83
+ )
84
+
85
+ # save to binary format
86
+ save_facebook_model(<PATH_MOD_SAVE>)
87
+ ```
88
+
89
+ ## Training Data
90
+
91
+ description