Upload folder using huggingface_hub
Browse files- README.md +47 -0
- clip_encoder.pth +3 -0
- config.json +29 -0
- projector.pth +3 -0
- sam_encoder.pth +3 -0
README.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DeepEncoder (Extracted from DeepSeek-OCR)
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
This directory contains the encoder components extracted from DeepSeek-OCR.
|
| 5 |
+
|
| 6 |
+
## Model Files
|
| 7 |
+
- `sam_encoder.pth`: SAM ViT-B encoder (95,569,152 params, 364.6 MB)
|
| 8 |
+
- `clip_encoder.pth`: CLIP-Large encoder (303,177,728 params, 1156.6 MB)
|
| 9 |
+
- `projector.pth`: Linear projector (2,622,720 params, 10.0 MB)
|
| 10 |
+
- `config.json`: Model configuration
|
| 11 |
+
|
| 12 |
+
**Total:** 401,369,600 parameters
|
| 13 |
+
|
| 14 |
+
## Architecture
|
| 15 |
+
```
|
| 16 |
+
Image (1024×1024) → SAM (95M) → 16× Conv → CLIP (303M) → Projector (3M) → 256 vision tokens
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## Usage
|
| 20 |
+
```python
|
| 21 |
+
import torch
|
| 22 |
+
from deepencoder import build_sam_vit_b, build_clip_l, MlpProjector
|
| 23 |
+
from easydict import EasyDict as adict
|
| 24 |
+
|
| 25 |
+
# Load models
|
| 26 |
+
sam = build_sam_vit_b(checkpoint=None)
|
| 27 |
+
sam.load_state_dict(torch.load('sam_encoder.pth'))
|
| 28 |
+
|
| 29 |
+
clip = build_clip_l()
|
| 30 |
+
clip.load_state_dict(torch.load('clip_encoder.pth'))
|
| 31 |
+
|
| 32 |
+
projector_cfg = adict({'projector_type': 'linear', 'input_dim': 2048, 'n_embed': 1280})
|
| 33 |
+
projector = MlpProjector(projector_cfg)
|
| 34 |
+
projector.load_state_dict(torch.load('projector.pth'))
|
| 35 |
+
|
| 36 |
+
# Run encoder
|
| 37 |
+
vision_tokens = encode(image) # [1, 256, 1280]
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Training
|
| 41 |
+
These weights are:
|
| 42 |
+
- Initialized from pretrained SAM (SA-1B) + CLIP (LAION-2B)
|
| 43 |
+
- Fine-tuned together on optical compression/OCR tasks
|
| 44 |
+
- Optimized for text preservation in compressed form
|
| 45 |
+
|
| 46 |
+
## Source
|
| 47 |
+
Extracted from: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
|
clip_encoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:53f61b63263cd928dae17d5cf5d6f9c0b6b4ff3f31d4b08c65141b950ae10b4f
|
| 3 |
+
size 1212819919
|
config.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"sam": {
|
| 3 |
+
"params": 95569152,
|
| 4 |
+
"architecture": "SAM ViT-B",
|
| 5 |
+
"image_size": 1024,
|
| 6 |
+
"patch_size": 16,
|
| 7 |
+
"embed_dim": 768,
|
| 8 |
+
"depth": 12,
|
| 9 |
+
"num_heads": 12
|
| 10 |
+
},
|
| 11 |
+
"clip": {
|
| 12 |
+
"params": 303177728,
|
| 13 |
+
"architecture": "CLIP-Large",
|
| 14 |
+
"image_size": 224,
|
| 15 |
+
"patch_size": 14,
|
| 16 |
+
"width": 1024,
|
| 17 |
+
"layers": 24,
|
| 18 |
+
"heads": 16
|
| 19 |
+
},
|
| 20 |
+
"projector": {
|
| 21 |
+
"params": 2622720,
|
| 22 |
+
"type": "linear",
|
| 23 |
+
"input_dim": 2048,
|
| 24 |
+
"output_dim": 1280
|
| 25 |
+
},
|
| 26 |
+
"total_params": 401369600,
|
| 27 |
+
"output_tokens": 256,
|
| 28 |
+
"output_dim": 1280
|
| 29 |
+
}
|
projector.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:62dd2f2e01ca17b94b1778b37cd34a0c24194342a84d514151f3a663ab5ad4db
|
| 3 |
+
size 10492853
|
sam_encoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5e51a1b5e63ec43400bd25afb739588a4e024847970f2e30f8aa77f5b4e58428
|
| 3 |
+
size 382336317
|