File size: 4,586 Bytes
f30df6a
 
 
 
 
 
 
e47fa32
 
 
 
 
f30df6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed0e5f3
 
 
 
 
 
f30df6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57bf368
9d0ff92
 
57bf368
f30df6a
 
 
 
 
aa17b26
f30df6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: mit
datasets:
- jrahn/yolochess_lichess-elite_2211
library_name: transformers
tags:
- chess
widget:
- text: "rnbqkbnr/pppppppp/8/8/8/[MASK]/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
  example_title: "MLM: Masked = 8"
- text: "6k1/8/8/1pB3[MASK]P/1P3P2/8/8/8 w - - 1 74"
  example_title: "MLM: Masked = K"
---
# Model Card for yolochess_mlm_azure-cloud-35

<!-- Provide a quick summary of what the model is/does. -->

This model with 66M parameters is pre-trained from scratch with Masked Language Modeling on Chess Positions in [FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation) format.  
It is supposed to be used for downstream fine-tuning, e.g. Text Classification for human moves.

# Model Details

## Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Jonathan Rahn
- **Model type:** Distilbert
- **Language(s) (NLP):** Chess [FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)
- **License:** MIT

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

This model is pre-trained from scratch with Masked Language Modeling on Chess Positions in FEN format.  

## Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

It is supposed to be used for downstream fine-tuning, e.g. Text Classification for human moves.

## Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

Anything other than Chess Positions in standard [FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation) format.

# Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

n/a

## Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

n/a

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jrahn/yolochess_mlm_azure-cloud-35")
model = AutoModelForMaskedLM.from_pretrained("jrahn/yolochess_mlm_azure-cloud-35")
```

```python
from transformers import pipeline
pipe = pipeline("fill-mask", "jrahn/yolochess_mlm_azure-cloud-35")
pipe("6k1/8/8/1pB3[MASK]P/1P3P2/8/8/8 w - - 1 74")
```

# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[Lichess-Elite 22-11 Dataset](https://huggingface.co/datasets/jrahn/yolochess_lichess-elite_2211)

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Masked Language Modeling objective with 15% masked token ratio.

### Preprocessing

Tokenize `data["train"]["fen"]` with max-length padding to 200 tokens with default `distilbert-base-cased` tokenizer. 
Inefficient: Most of the vocab is never observed in FEN, wasting embedding parameters. 
The sequence length / pos embedding size of model and sequence length of data preprocessing leads to lots of padding and wasted parameters. FENs should be shorter than 90 characters.
Experiments with reduced max-length in tokenization show performance gains.

### Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

Training for 172500 steps at batch-size 128 (22M examples, 1 epoch) took ~10 hrs on 1x RTX 4090, using 20GB VRAM, with final MLM-loss: 0.2567.

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** 1x RTX 4090
- **Hours used:** 10
- **Cloud Provider:** local
- **Compute Region:** local
- **Carbon Emitted:** 1.5kg

# Technical Specifications

## Model Architecture and Objective

Distilbert, Masked Language Modeling