todd-cook commited on
Commit
971409a
1 Parent(s): b3fa48a

updated for paper publication

Browse files
Files changed (5) hide show
  1. README.md +12 -8
  2. config.json +3 -3
  3. pytorch_model.bin +2 -2
  4. tf_model.h5 +2 -2
  5. vocab.txt +0 -0
README.md CHANGED
@@ -8,13 +8,15 @@ datasets:
8
  - Tesserae
9
  - Phi5
10
  - Thomas Aquinas
 
11
  ---
12
 
13
  # Cicero-Similis
14
 
15
  ## Model description
16
 
17
- A Latin Language Model, trained on classical Latin texts that are reasonably close to Cicero's range of vocabulary as described in the forthcoming paper "What Would Cicero Write?".
 
18
 
19
  ## Intended uses & limitations
20
 
@@ -26,7 +28,7 @@ Normalize text using JV Replacement and tokenize using CLTK to separate enclitic
26
  from transformers import BertForMaskedLM, AutoTokenizer, FillMaskPipeline
27
  tokenizer = AutoTokenizer.from_pretrained("cook/cicero-similis")
28
  model = BertForMaskedLM.from_pretrained("cook/cicero-similis")
29
- fill_mask = FillMaskPipeline(model=model, tokenizer=tokenizer)
30
  # Cicero, De Re Publica, VI, 32, 2
31
  # "animal" is found in A, Q, PhD manuscripts
32
  # 'anima' H^1 Macr. et codd. Tusc.
@@ -35,21 +37,23 @@ results = fill_mask("inanimum est enim omne quod pulsu agitatur externo; quod au
35
 
36
  #### Limitations and bias
37
 
38
- Biased towards Cicero, but that weakness is the model's strength; it's not aimed to be a one-size fits all model.
39
 
40
  ## Training data
41
 
42
- Trained on the corpora Phi5, Tesserae, and Thomas Aquinas--excluding documents that went outside the scope of Cicero's expected unknown vocabulary probabilities.
43
 
44
 
45
  ## Training procedure
46
 
47
- 5 epochs, masked language modeling .45, effective batch size 32
48
 
49
 
50
  ## Eval results
51
- A novel evaluation metric is proposed in the forthcoming paper "What Would Cicero Write?"
 
52
 
53
  ### BibTeX entry and citation info
54
-
55
- A paper will be published in Cicero Digitalis in 2021.
 
8
  - Tesserae
9
  - Phi5
10
  - Thomas Aquinas
11
+ - Patrologia Latina
12
  ---
13
 
14
  # Cicero-Similis
15
 
16
  ## Model description
17
 
18
+ A Latin Language Model, trained on Latin texts, and evaluated using the corpus of Cicero, as described in the paper _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook,
19
+ Published in Ciceroniana On Line, Vol. V, #2.
20
 
21
  ## Intended uses & limitations
22
 
28
  from transformers import BertForMaskedLM, AutoTokenizer, FillMaskPipeline
29
  tokenizer = AutoTokenizer.from_pretrained("cook/cicero-similis")
30
  model = BertForMaskedLM.from_pretrained("cook/cicero-similis")
31
+ fill_mask = FillMaskPipeline(model=model, tokenizer=tokenizer, top_k=10_000)
32
  # Cicero, De Re Publica, VI, 32, 2
33
  # "animal" is found in A, Q, PhD manuscripts
34
  # 'anima' H^1 Macr. et codd. Tusc.
37
 
38
  #### Limitations and bias
39
 
40
+ Currently the model training data excludes modern and 19th century texts, but that weakness is the model's strength; it's not aimed to be a one-size-fits-all model.
41
 
42
  ## Training data
43
 
44
+ Trained on the corpora Phi5, Tesserae, Thomas Aquinas, and Patrologes Latina.
45
 
46
 
47
  ## Training procedure
48
 
49
+ 5 epochs, masked language modeling .15, effective batch size 32
50
 
51
 
52
  ## Eval results
53
+ A novel evaluation metric is proposed in the paper _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook,
54
+ Published in Ciceroniana On Line, Vol. V, #2.
55
 
56
  ### BibTeX entry and citation info
57
+ TODO
58
+ _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook,
59
+ Published in Ciceroniana On Line, Vol. V, #2.
config.json CHANGED
@@ -1,10 +1,10 @@
1
  {
2
- "_name_or_path": "/Users/todd/PycharmProjects/cicero-similis",
3
  "architectures": [
4
  "BertForMaskedLM"
5
  ],
6
  "attention_probs_dropout_prob": 0.2,
7
- "gradient_checkpointing": false,
8
  "hidden_act": "gelu",
9
  "hidden_dropout_prob": 0.2,
10
  "hidden_size": 768,
@@ -17,7 +17,7 @@
17
  "num_hidden_layers": 6,
18
  "pad_token_id": 0,
19
  "position_embedding_type": "absolute",
20
- "transformers_version": "4.3.2",
21
  "type_vocab_size": 1,
22
  "use_cache": true,
23
  "vocab_size": 25000
1
  {
2
+ "_name_or_path": "models/final",
3
  "architectures": [
4
  "BertForMaskedLM"
5
  ],
6
  "attention_probs_dropout_prob": 0.2,
7
+ "classifier_dropout": null,
8
  "hidden_act": "gelu",
9
  "hidden_dropout_prob": 0.2,
10
  "hidden_size": 768,
17
  "num_hidden_layers": 6,
18
  "pad_token_id": 0,
19
  "position_embedding_type": "absolute",
20
+ "transformers_version": "4.15.0",
21
  "type_vocab_size": 1,
22
  "use_cache": true,
23
  "vocab_size": 25000
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0fc3843041c1a1a046c2c4bbf5d94a58ee08316a969628b974d179f80d59b0b0
3
- size 250989083
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15a9661486ed016a2ad717e37b7949d5617dca271e491359bc3ad260bb13f542
3
+ size 253348914
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f994bbe8e4ec14eb6dfa0d619f190b0e5a23e58d09205b7545f14385c244c47e
3
- size 327909792
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:469f4cb1f094d290ffa8a66f77eabec58f22c57cd56fc536ee0a6a30a6b2aae0
3
+ size 329499728
vocab.txt CHANGED
The diff for this file is too large to render. See raw diff