Volkopat commited on
Commit
51d127a
·
verified ·
1 Parent(s): 69b8fa5

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +47 -0
  2. clip_encoder.pth +3 -0
  3. config.json +29 -0
  4. projector.pth +3 -0
  5. sam_encoder.pth +3 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepEncoder (Extracted from DeepSeek-OCR)
2
+
3
+ ## Overview
4
+ This directory contains the encoder components extracted from DeepSeek-OCR.
5
+
6
+ ## Model Files
7
+ - `sam_encoder.pth`: SAM ViT-B encoder (95,569,152 params, 364.6 MB)
8
+ - `clip_encoder.pth`: CLIP-Large encoder (303,177,728 params, 1156.6 MB)
9
+ - `projector.pth`: Linear projector (2,622,720 params, 10.0 MB)
10
+ - `config.json`: Model configuration
11
+
12
+ **Total:** 401,369,600 parameters
13
+
14
+ ## Architecture
15
+ ```
16
+ Image (1024×1024) → SAM (95M) → 16× Conv → CLIP (303M) → Projector (3M) → 256 vision tokens
17
+ ```
18
+
19
+ ## Usage
20
+ ```python
21
+ import torch
22
+ from deepencoder import build_sam_vit_b, build_clip_l, MlpProjector
23
+ from easydict import EasyDict as adict
24
+
25
+ # Load models
26
+ sam = build_sam_vit_b(checkpoint=None)
27
+ sam.load_state_dict(torch.load('sam_encoder.pth'))
28
+
29
+ clip = build_clip_l()
30
+ clip.load_state_dict(torch.load('clip_encoder.pth'))
31
+
32
+ projector_cfg = adict({'projector_type': 'linear', 'input_dim': 2048, 'n_embed': 1280})
33
+ projector = MlpProjector(projector_cfg)
34
+ projector.load_state_dict(torch.load('projector.pth'))
35
+
36
+ # Run encoder
37
+ vision_tokens = encode(image) # [1, 256, 1280]
38
+ ```
39
+
40
+ ## Training
41
+ These weights are:
42
+ - Initialized from pretrained SAM (SA-1B) + CLIP (LAION-2B)
43
+ - Fine-tuned together on optical compression/OCR tasks
44
+ - Optimized for text preservation in compressed form
45
+
46
+ ## Source
47
+ Extracted from: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
clip_encoder.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53f61b63263cd928dae17d5cf5d6f9c0b6b4ff3f31d4b08c65141b950ae10b4f
3
+ size 1212819919
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "sam": {
3
+ "params": 95569152,
4
+ "architecture": "SAM ViT-B",
5
+ "image_size": 1024,
6
+ "patch_size": 16,
7
+ "embed_dim": 768,
8
+ "depth": 12,
9
+ "num_heads": 12
10
+ },
11
+ "clip": {
12
+ "params": 303177728,
13
+ "architecture": "CLIP-Large",
14
+ "image_size": 224,
15
+ "patch_size": 14,
16
+ "width": 1024,
17
+ "layers": 24,
18
+ "heads": 16
19
+ },
20
+ "projector": {
21
+ "params": 2622720,
22
+ "type": "linear",
23
+ "input_dim": 2048,
24
+ "output_dim": 1280
25
+ },
26
+ "total_params": 401369600,
27
+ "output_tokens": 256,
28
+ "output_dim": 1280
29
+ }
projector.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62dd2f2e01ca17b94b1778b37cd34a0c24194342a84d514151f3a663ab5ad4db
3
+ size 10492853
sam_encoder.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e51a1b5e63ec43400bd25afb739588a4e024847970f2e30f8aa77f5b4e58428
3
+ size 382336317