shreyanshu09 commited on
Commit
8187697
1 Parent(s): 0de896c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - donut
5
+ - image-to-text
6
+ - vision
7
+ datasets:
8
+ - shreyanshu09/Block_Diagram
9
+ - shreyanshu09/BD-EnKo
10
+ language:
11
+ - en
12
+ - ko
13
+ ---
14
+
15
+ # Donut (base-sized model, pre-trained only)
16
+
17
+ Donut model pre-trained-only. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).
18
+
19
+
20
+
21
+ ## Model description
22
+
23
+ Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.
24
+
25
+ ![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)
26
+
27
+ ## Intended uses & limitations
28
+
29
+ This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the [model hub](https://huggingface.co/models?search=donut) to look for fine-tuned versions on a task that interests you.
30
+
31
+ ## Training dataset
32
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
33
+ - 158K GPT-generated multimodal instruction-following data.
34
+ - 450K academic-task-oriented VQA data mixture.
35
+ - 40K ShareGPT data.
36
+
37
+ ### How to use
38
+
39
+ Here is how to use this model in PyTorch:
40
+
41
+ ```python
42
+ import os
43
+ from PIL import Image
44
+ import torch
45
+ from donut import DonutModel
46
+
47
+ # Load the pre-trained model
48
+ model = DonutModel.from_pretrained("shreyanshu09/block_diagram_global_information")
49
+
50
+ # Move the model to GPU if available
51
+ if torch.cuda.is_available():
52
+ model.half()
53
+ device = torch.device("cuda:0")
54
+ model.to(device)
55
+
56
+ # Function to process a single image
57
+ def process_image(image_path):
58
+ # Load and process the image
59
+ image = Image.open(image_path)
60
+ task_name = os.path.basename('/block_diagram_global_information/dataset/c2t_data/') # Create empty folder anywhere
61
+ result = model.inference(image=image, prompt=f"<s_{task_name}>")["predictions"][0]
62
+
63
+ # Extract the relevant information from the result
64
+ if 'c2t' in result:
65
+ return result['c2t']
66
+ else:
67
+ return result['text_sequence']
68
+ ```