Michaelq commited on
Commit
cf39aa0
·
verified ·
1 Parent(s): 84fd40b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -3
README.md CHANGED
@@ -1,3 +1,115 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - codestral
5
+ - vision-language
6
+ - code-generation
7
+ - multimodal
8
+ - mlx
9
+ license: other
10
+ library_name: mlx
11
+ inference: false
12
+ license_name: mnpl
13
+ license_link: https://mistral.ai/licences/MNPL-0.1.md
14
+ ---
15
+
16
+ # Codestral-ViT
17
+
18
+ A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.
19
+
20
+ ## Overview
21
+
22
+ Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:
23
+ - Generate code from text descriptions
24
+ - Understand and explain code from screenshots
25
+ - Suggest improvements to code based on visual context
26
+ - Process multiple images with advanced tiling strategies
27
+
28
+ ## Technical Details
29
+
30
+ - **Base Models:**
31
+ - Language: Codestral-22B (4-bit quantized)
32
+ - Vision: CLIP ViT-Large/14
33
+ - Framework: MLX (Apple Silicon)
34
+
35
+ - **Architecture:**
36
+ - Vision encoder processes images into 512-dim embeddings
37
+ - Learned projection layer maps vision features to language space
38
+ - Dynamic RoPE scaling for 32K context window
39
+ - Support for overlapping image crops and tiling
40
+
41
+ - **Input Processing:**
42
+ - Images: 224x224 pixels, CLIP normalization
43
+ - Text: Up to 32,768 tokens
44
+ - Special tokens for image-text fusion
45
+
46
+ ## Example Usage
47
+
48
+ ```python
49
+ from PIL import Image
50
+ from src.model import MultimodalCodestral
51
+
52
+ model = MultimodalCodestral()
53
+
54
+ # Code generation from screenshot
55
+ image = Image.open("code_screenshot.png")
56
+ response = model.generate_with_images(
57
+ prompt="Explain this code and suggest improvements",
58
+ images=[image]
59
+ )
60
+
61
+ # Multiple image processing
62
+ images = [Image.open(f) for f in ["img1.png", "img2.png"]]
63
+ response = model.generate_with_images(
64
+ prompt="Compare these code implementations",
65
+ images=images
66
+ )
67
+ ```
68
+
69
+ ## Capabilities
70
+
71
+ - **Code Understanding:**
72
+ - Analyzes code structure from screenshots
73
+ - Identifies patterns and anti-patterns
74
+ - Suggests contextual improvements
75
+
76
+ - **Image Processing:**
77
+ - Handles multiple image inputs
78
+ - Supports various image formats
79
+ - Advanced crop and resize strategies
80
+
81
+ - **Generation Features:**
82
+ - Context-aware code completion
83
+ - Documentation generation
84
+ - Code refactoring suggestions
85
+ - Bug identification and fixes
86
+
87
+ ## Requirements
88
+
89
+ - Apple Silicon hardware (M1/M2/M3)
90
+ - 32GB+ RAM recommended
91
+ - MLX framework
92
+ - Python 3.8+
93
+
94
+ ## Limitations
95
+
96
+ - Apple Silicon only (no CPU/CUDA support)
97
+ - Memory intensive for large images/codebases
98
+ - Visual understanding bounded by CLIP's capabilities
99
+ - Generation quality depends on input clarity
100
+
101
+ ## License
102
+
103
+ This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md).
104
+
105
+ ## Citation
106
+
107
+ ```bibtex
108
+ @software{codestral-vit,
109
+ author = {Mike Casale},
110
+ title = {Codestral-ViT: A Vision-Language Model for Code Generation},
111
+ year = {2023},
112
+ publisher = {Hugging Face},
113
+ url = {https://huggingface.co/casale-xyz/codestral-vit}
114
+ }
115
+ ```