Tonic commited on
Commit
9d03b06
1 Parent(s): d74ddfc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -180
README.md CHANGED
@@ -1,185 +1,10 @@
1
  ---
2
  license: mit
3
- title: Tonic's Texify
4
- emoji: ✖️➖➕➗🟰
5
- colorFrom: indigo
6
- colorTo: pink
7
  pinned: true
8
  sdk: gradio
9
  app_file: app.py
10
- ---
11
- # Texify
12
-
13
- Texify is an OCR model that converts images or pdfs containing math into markdown and LaTeX that can be rendered by MathJax ($$ and $ are delimiters). It can run on CPU, GPU, or MPS.
14
-
15
- https://github.com/VikParuchuri/texify/assets/913340/882022a6-020d-4796-af02-67cb77bc084c
16
-
17
- Texify can work with block equations, or equations mixed with text (inline). It will convert both the equations and the text.
18
-
19
- The closest open source comparisons to texify are [pix2tex](https://github.com/lukas-blecher/LaTeX-OCR) and [nougat](https://github.com/facebookresearch/nougat), although they're designed for different purposes:
20
-
21
- - Pix2tex is designed only for block LaTeX equations, and hallucinates more on text.
22
- - Nougat is designed to OCR entire pages, and hallucinates more on small images only containing math.
23
-
24
- Pix2tex is trained on im2latex, and nougat is trained on arxiv. Texify is trained on a more diverse set of web data, and works on a range of images.
25
-
26
- See more details in the [benchmarks](#benchmarks) section.
27
-
28
- ## Community
29
-
30
- [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
31
-
32
- ## Examples
33
-
34
- **Note** I added spaces after _ symbols because [Github math formatting is broken](https://github.com/github/markup/issues/1575).
35
-
36
- ![Example 0](data/examples/0.png)
37
-
38
- **Detected Text** The potential $V_{i}$ of cell $\mathcal{C}_ {j}$ centred at position $\mathbf{r}_ {i}$ is related to the surface charge densities $\sigma_ {j}$ of cells $\mathcal{E}_ {j}$ $j\in[1,N]$ through the superposition principle as:
39
-
40
- $$V_ {i}\,=\,\sum_ {j=0}^{N}\,\frac{\sigma_ {j}}{4\pi\varepsilon_ {0}}\,\int_{\mathcal{E}_ {j}}\frac{1}{\left|\mathbf{r}_ {i}-\mathbf{r}^{\prime}\right|}\,\mathrm{d}^{2}\mathbf{r}^{\prime}\,=\,\sum_{j=0}^{N}\,Q_ {ij}\,\sigma_{j},$$
41
-
42
- where the integral over the surface of cell $\mathcal{C}_ {j}$ only depends on $\mathcal{C}{j}$ shape and on the relative position of the target point $\mathbf{r}_ {i}$ with respect to $\mathcal{C}_ {j}$ location, as $\sigma_ {j}$ is assumed constant over the whole surface of cell $\mathcal{C}_ {j}$.
43
-
44
- | Image | OCR Markdown |
45
- |----------------------------|---------------------------|
46
- | [1](data/examples/100.png) | [1](data/examples/100.md) |
47
- | [2](data/examples/300.png) | [2](data/examples/300.md) |
48
- | [3](data/examples/400.png) | [3](data/examples/400.md) |
49
-
50
- # Installation
51
-
52
- You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
53
-
54
- Install with:
55
-
56
- ```
57
- `pip install texify`
58
- ```
59
-
60
- Model weights will automatically download the first time you run it.
61
-
62
- # Usage
63
-
64
- - Inspect the settings in `texify/settings.py`. You can override any settings with environment variables.
65
- - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`.
66
-
67
- ## Usage tips
68
-
69
- - Don't make your boxes too small or too large. See the examples and the video above for good crops.
70
- - Texify is sensitive to how you draw the box around the text you want to OCR. If you get bad results, try selecting a slightly different box, or splitting the box into 2+. You can also try changing the `TEMPERATURE` setting.
71
- - Sometimes, KaTeX won't be able to render an equation (red error), but it will still be valid LaTeX. You can copy the LaTeX and render it elsewhere.
72
-
73
- ## App for interactive conversion
74
-
75
- I've included a streamlit app that lets you interactively select and convert equations from images or PDF files. Run it with:
76
-
77
- ```
78
- texify_gui
79
- ```
80
-
81
- The app will allow you to select the specific equations you want to convert on each page, then render the results with KaTeX and enable easy copying.
82
-
83
- ## Convert images
84
-
85
- You can OCR a single image or a folder of images with:
86
-
87
- ```
88
- texify /path/to/folder_or_file --max 8 --json_path results.json
89
- ```
90
-
91
- - `--max` is how many images in the folder to convert at most. Omit this to convert all images in the folder.
92
- - `--json_path` is an optional path to a json file where the results will be saved. If you omit this, the results will be saved to `data/results.json`.
93
-
94
- ## Import and run
95
-
96
- You can import texify and run it in python code:
97
-
98
- ```
99
- from texify.inference import batch_inference
100
- from texify.model.model import load_model
101
- from texify.model.processor import load_processor
102
- from PIL import Image
103
-
104
- model = load_model()
105
- processor = load_processor()
106
- img = Image.open("test.png") # Your image name here
107
- results = batch_inference([img], model, processor)
108
- ```
109
-
110
- # Manual install
111
-
112
- If you want to develop texify, you can install it manually:
113
-
114
- - `git clone https://github.com/VikParuchuri/texify.git`
115
- - `cd texify`
116
- - `poetry install` # Installs main and dev dependencies
117
-
118
- # Limitations
119
-
120
- OCR is complicated, and texify is not perfect. Here are some known limitations:
121
-
122
- - The OCR is dependent on how you crop the image. If you get bad results, try a different selection/crop. Or try changing the `TEMPERATURE` setting.
123
- - Texify will OCR equations and surrounding text, but is not good for general purpose OCR. Think sections of a page instead of a whole page.
124
- - Texify was mostly trained with 96 DPI images, and only at a max 420x420 resolution. Very wide or very tall images may not work well.
125
- - It works best with English, although it should support other languages with similar character sets.
126
- - The output format will be markdown with embedded LaTeX for equations (close to Github flavored markdown). It will not be pure LaTeX.
127
-
128
- # Benchmarks
129
-
130
- Benchmarking OCR quality is hard - you ideally need a parallel corpus that models haven't been trained on. I sampled from arxiv and im2latex to create the benchmark set.
131
-
132
- ![Benchmark results](data/images/texify_bench.png)
133
-
134
- Each model is trained on one of the benchmark tasks:
135
-
136
- - Nougat was trained on arxiv, possibly the images in the benchmark.
137
- - Pix2tex was trained on im2latex.
138
- - Texify was trained on im2latex. It was trained on arxiv, but not the images in the benchmark.
139
-
140
- Although this makes the benchmark results biased, it does seem like a good compromise, since nougat and pix2tex don't work as well out of domain. Note that neither pix2tex or nougat is really designed for this task (OCR inline equations and text), so this is not a perfect comparison.
141
-
142
- | Model | BLEU ⬆ | METEOR ⬆ | Edit Distance ⬇ |
143
- |---------|--------------|--------------|-----------------|
144
- | pix2tex | 0.382659 | 0.543363 | 0.352533 |
145
- | nougat | 0.697667 | 0.668331 | 0.288159 |
146
- | texify | **0.842349** | **0.885731** | **0.0651534** |
147
-
148
- ## Running your own benchmarks
149
-
150
- You can benchmark the performance of texify on your machine.
151
-
152
- - Follow the manual install instructions above.
153
- - If you want to use pix2tex, run `pip install pix2tex`
154
- - If you want to use nougat, run `pip install nougat-ocr`
155
- - Download the benchmark data [here](https://drive.google.com/file/d/1dbY0kBq2SUa885gmbLPUWSRzy5K7O5XJ/view?usp=sharing) and put it in the `data` folder.
156
- - Run `benchmark.py` like this:
157
-
158
- ```
159
- python benchmark.py --max 100 --pix2tex --nougat --data_path data/bench_data.json --result_path data/bench_results.json
160
- ```
161
-
162
- This will benchmark marker against pix2tex and nougat. It will do batch inference with texify and nougat, but not with pix2tex, since I couldn't find an option for batching.
163
-
164
- - `--max` is how many benchmark images to convert at most.
165
- - `--data_path` is the path to the benchmark data. If you omit this, it will use the default path.
166
- - `--result_path` is the path to the benchmark results. If you omit this, it will use the default path.
167
- - `--pix2tex` specifies whether to run pix2tex (Latex-OCR) or not.
168
- - `--nougat` specifies whether to run nougat or not.
169
-
170
- # Training
171
-
172
- Texify was trained on latex images and paired equations from across the web. It includes the [im2latex](https://github.com/guillaumegenthial/im2latex) dataset. Training happened on 4x A6000s for 2 days (~6 epochs).
173
-
174
- # Commercial usage
175
-
176
- This model is trained on top of the openly licensed [Donut](https://huggingface.co/naver-clova-ix/donut-base) model, and thus can be used for commercial purposes. Model weights are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.
177
-
178
- # Thanks
179
-
180
- This work would not have been possible without lots of amazing open source work. I particularly want to acknowledge [Lukas Blecher](https://github.com/lukas-blecher), whose work on Nougat and pix2tex was key for this project. I learned a lot from his code, and used parts of it for texify.
181
-
182
- - [im2latex](https://github.com/guillaumegenthial/im2latex) - one of the datasets used for training
183
- - [Donut](https://huggingface.co/naver-clova-ix/donut-base) from Naver, the base model for texify
184
- - [Nougat](https://github.com/facebookresearch/nougat) - I used the tokenizer from Nougat
185
- - [Latex-OCR](https://github.com/lukas-blecher/LaTeX-OCR) - The original open source Latex OCR project
 
1
  ---
2
  license: mit
3
+ title: Nexus🐦‍⬛Raven
4
+ emoji: 🐦‍⬛🔬🤖
5
+ colorFrom: yellow
6
+ colorTo: purple
7
  pinned: true
8
  sdk: gradio
9
  app_file: app.py
10
+ ---