shreyanshu09
/

block_diagram_global_information

Inference Endpoints

Model card Files Files and versions Community

shreyanshu09 commited on May 25

Commit

8187697

•

1 Parent(s): 0de896c

Update README.md

Files changed (1) hide show

README.md +68 -3

README.md CHANGED Viewed

@@ -1,3 +1,68 @@
----
-license: mit
----

+---
+license: mit
+tags:
+- donut
+- image-to-text
+- vision
+datasets:
+- shreyanshu09/Block_Diagram
+- shreyanshu09/BD-EnKo
+language:
+- en
+- ko
+---
+# Donut (base-sized model, pre-trained only)
+Donut model pre-trained-only. It was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).
+## Model description
+Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.
+![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)
+## Intended uses & limitations
+This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the [model hub](https://huggingface.co/models?search=donut) to look for fine-tuned versions on a task that interests you.
+## Training dataset
+- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
+- 158K GPT-generated multimodal instruction-following data.
+- 450K academic-task-oriented VQA data mixture.
+- 40K ShareGPT data.
+### How to use
+Here is how to use this model in PyTorch:
+```python
+import os
+from PIL import Image
+import torch
+from donut import DonutModel
+# Load the pre-trained model
+model = DonutModel.from_pretrained("shreyanshu09/block_diagram_global_information")
+# Move the model to GPU if available
+if torch.cuda.is_available():
+    model.half()
+    device = torch.device("cuda:0")
+    model.to(device)
+# Function to process a single image
+def process_image(image_path):
+    # Load and process the image
+    image = Image.open(image_path)
+    task_name = os.path.basename('/block_diagram_global_information/dataset/c2t_data/')                  # Create empty folder anywhere
+    result = model.inference(image=image, prompt=f"<s_{task_name}>")["predictions"][0]
+    # Extract the relevant information from the result
+    if 'c2t' in result:
+        return result['c2t']
+    else:
+        return result['text_sequence']
+```