Upload folder using huggingface_hub

Files changed (5) hide show

README.md ADDED Viewed

+# DeepEncoder (Extracted from DeepSeek-OCR)
+## Overview
+This directory contains the encoder components extracted from DeepSeek-OCR.
+## Model Files
+- `sam_encoder.pth`: SAM ViT-B encoder (95,569,152 params, 364.6 MB)
+- `clip_encoder.pth`: CLIP-Large encoder (303,177,728 params, 1156.6 MB)
+- `projector.pth`: Linear projector (2,622,720 params, 10.0 MB)
+- `config.json`: Model configuration
+**Total:** 401,369,600 parameters
+## Architecture
+```
+Image (1024×1024) → SAM (95M) → 16× Conv → CLIP (303M) → Projector (3M) → 256 vision tokens
+```
+## Usage
+```python
+import torch
+from deepencoder import build_sam_vit_b, build_clip_l, MlpProjector
+from easydict import EasyDict as adict
+# Load models
+sam = build_sam_vit_b(checkpoint=None)
+sam.load_state_dict(torch.load('sam_encoder.pth'))
+clip = build_clip_l()
+clip.load_state_dict(torch.load('clip_encoder.pth'))
+projector_cfg = adict({'projector_type': 'linear', 'input_dim': 2048, 'n_embed': 1280})
+projector = MlpProjector(projector_cfg)
+projector.load_state_dict(torch.load('projector.pth'))
+# Run encoder
+vision_tokens = encode(image)  # [1, 256, 1280]
+```
+## Training
+These weights are:
+- Initialized from pretrained SAM (SA-1B) + CLIP (LAION-2B)
+- Fine-tuned together on optical compression/OCR tasks
+- Optimized for text preservation in compressed form
+## Source
+Extracted from: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)

clip_encoder.pth ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:53f61b63263cd928dae17d5cf5d6f9c0b6b4ff3f31d4b08c65141b950ae10b4f
+size 1212819919

config.json ADDED Viewed

+{
+  "sam": {
+    "params": 95569152,
+    "architecture": "SAM ViT-B",
+    "image_size": 1024,
+    "patch_size": 16,
+    "embed_dim": 768,
+    "depth": 12,
+    "num_heads": 12
+  },
+  "clip": {
+    "params": 303177728,
+    "architecture": "CLIP-Large",
+    "image_size": 224,
+    "patch_size": 14,
+    "width": 1024,
+    "layers": 24,
+    "heads": 16
+  },
+  "projector": {
+    "params": 2622720,
+    "type": "linear",
+    "input_dim": 2048,
+    "output_dim": 1280
+  },
+  "total_params": 401369600,
+  "output_tokens": 256,
+  "output_dim": 1280
+}

projector.pth ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:62dd2f2e01ca17b94b1778b37cd34a0c24194342a84d514151f3a663ab5ad4db
+size 10492853

sam_encoder.pth ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e51a1b5e63ec43400bd25afb739588a4e024847970f2e30f8aa77f5b4e58428
+size 382336317