|
--- |
|
language: |
|
- sv |
|
widget: |
|
- text: "Den i HandelstidniDgens g&rdagsnnmmer omtalade hvalfisken, sorn fångats i Frölnndaviken" |
|
example_title: "News article #1" |
|
- text: "En Gosse fur plats nu genast ! inetallyrkc, JU 83 Drottninggatan." |
|
example_title: "News article #2" |
|
- text: "AfgäiigStiden bestämmes wid fartyget» hltkomst." |
|
example_title: "News article #3" |
|
- text: "Elt godt Fortepiano om 6 octaver, ifräân Contra-F till o< med fyrſtrukna F, förfäljes af underte>uad för 260 R:dr Rgs." |
|
example_title: "Long-s piano ad" |
|
--- |
|
|
|
(Work in progress) |
|
|
|
# Swedish OCR correction |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model corrects OCR errors in Swedish text. |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned version of [byt5-small](https://huggingface.co/google/byt5-small), a character-level multilingual transformer. It is fine-tuned on OCR samples from Swedish 19th and 20th century newspapers and historical text. |
|
|
|
<!-- ### Model Description--> |
|
|
|
<!-- Provide a longer summary of what this model is. |
|
|
|
- **Developed by:** [More Information Needed] |
|
- **Funded by [optional]:** [More Information Needed] |
|
- **Shared by [optional]:** [More Information Needed] |
|
- **Model type:** [More Information Needed] |
|
- **Language(s) (NLP):** [More Information Needed] |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model [optional]:** [More Information Needed] |
|
|
|
### Model Sources [optional] |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo [optional]:** [More Information Needed]--> |
|
|
|
## Training Data |
|
|
|
The base model byt5 is pre-trained on [mc4](https://huggingface.co/datasets/mc4). This fine-tuned version is further trained on: |
|
|
|
- Swedish newspapers from 1818 to 2018. Parts of the dataset are available from Språkbanken Text: [Swedish newspapers 1818-1870](https://spraakbanken.gu.se/en/resources/svenska-tidningar-1818-1870), [Swedish newspapers 1871-1906](https://spraakbanken.gu.se/resurser/svenska-tidningar-1871-1906). |
|
- Swedish blackletter documents from 1626 to 1816, available from Språkbaknen Text: [Swedish fraktur 1626-1816](https://spraakbanken.gu.se/resurser/svensk-fraktur-1626-1816) |
|
|
|
This data includes characters not used in Swedish today, such as the long s (ſ) and the esszett ligature (ß). |
|
|
|
## Usage |
|
The model accepts input sequences of at most 128 UTF-8 bytes, longer sequences are truncated to this limit. 128 UTF-8 bytes corresponds to slightly less than 128 characters of Swedish text, since most characters use one byte but Å, Ä and Ö use two bytes. |
|
|
|
[Demo code here] |
|
|
|
|
|
|