Transformers
English
TheOpenMachine commited on
Commit
938100e
·
verified ·
1 Parent(s): 7d467c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -1
README.md CHANGED
@@ -6,4 +6,213 @@ datasets:
6
  language:
7
  - en
8
  library_name: transformers
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  language:
7
  - en
8
  library_name: transformers
9
+ ---
10
+ # Model Card for Model ID
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+
15
+ ## Model Details
16
+ - 245M parameters
17
+ - 4 Layers
18
+ - D_size 1280
19
+ - 16 MoE
20
+ - 8 KV
21
+ - FP32 2.3GB - Onix export
22
+
23
+ Trained on only 20B tokens of web text data.
24
+
25
+ Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.
26
+
27
+ ### Model Description
28
+
29
+ # Lulu Local Android Demo
30
+
31
+ **Lulu Local** is an offline Android AI demo by **Open Machine**.
32
+
33
+ This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**.
34
+
35
+ No cloud.
36
+ No server.
37
+ No GPU.
38
+ No NPU.
39
+ No internet required after install.
40
+
41
+ Runs on the Samsung A25 5G.
42
+
43
+ This is a raw early proof that a custom local model can run directly on consumer Android hardware.
44
+
45
+ For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet.
46
+ The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory.
47
+ This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred.
48
+ We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.
49
+
50
+ Why this is interesting
51
+
52
+ Most mobile LLM demos rely on one or more of the following:
53
+
54
+ heavily quantized models
55
+ GPU acceleration
56
+ NPU acceleration
57
+ server-side inference
58
+ vendor SDKs
59
+ cloud APIs
60
+
61
+ This demo is intentionally simple and direct:
62
+
63
+ Android app
64
+ + ONNX Runtime
65
+ + local tokenizer
66
+ + local ONNX model
67
+ + CPU only
68
+
69
+ The current model is not small, not heavily optimized, and not using mobile accelerator tricks.
70
+ That is the point of the demo.
71
+
72
+ Model architecture note
73
+
74
+ The Android build uses a stateful single-token step ONNX export.
75
+
76
+ The runtime loop is:
77
+
78
+ token_id + position + cache tensors
79
+ → ONNX step model
80
+ → logits + updated cache tensors
81
+ → sample next token
82
+ → repeat
83
+
84
+ This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.
85
+
86
+ Current ONNX interface:
87
+
88
+ Inputs:
89
+ - token_id: [1, 1] int64
90
+ - pos: [1] int64
91
+ - k_0, v_0 ... k_23, v_23
92
+
93
+ Outputs:
94
+ - logits: [1, 32000] float32
95
+ - out_k_0, out_v_0 ... out_k_23, out_v_23
96
+
97
+ Cache shape per K/V tensor:
98
+
99
+ [1, 16, 128, 80]
100
+
101
+ Total runtime cache is about 31 MB.
102
+
103
+ - **Developed by: The Open Machine**
104
+ - **Model type:** [The Open Machine Transformers Version]
105
+ - **Language(s) (NLP):** [English]
106
+ - **License:** [Apache 2.0 ]
107
+
108
+ ### Model Sources [optional]
109
+
110
+ <!-- Provide the basic links for the model. -->
111
+
112
+ - **Repository:** [Wiull be provided in upcoming days]
113
+ - **Paper [optional]:** [Coming Soon]
114
+ - **Demo [optional]:** [More Information Needed]
115
+
116
+ ## Uses
117
+
118
+ Demo highlights
119
+ Fully offline Android assistant
120
+ Runs on mobile CPU only
121
+ Stateful single-token ONNX generation
122
+ Live token streaming UI
123
+ Battery / RAM / speed display
124
+ Cool / Turbo mode
125
+ Cool: 2 CPU threads
126
+ Turbo: 4 CPU threads
127
+ No GPU acceleration
128
+ No NPU acceleration
129
+ No network calls required for inference
130
+
131
+ Tested device
132
+
133
+ Early demo testing was done on a Samsung A25-class Android phone.
134
+
135
+ Observed behavior:
136
+
137
+ Model loads locally from app storage
138
+ Generation works fully offline
139
+ CPU-only generation is slow but usable for demo purposes
140
+ Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode
141
+
142
+ This is not yet optimized.
143
+
144
+ Install
145
+
146
+ Download the APK:
147
+
148
+ LuluLocal-Android-CPU-fp32.apk
149
+
150
+ On Android:
151
+
152
+ Open the APK file.
153
+ Allow install from unknown sources if Android asks.
154
+ Install.
155
+ Open Lulu.
156
+ Wait for the model to load.
157
+ Ask a question.
158
+
159
+ First load may take longer because the app prepares the local ONNX model.
160
+
161
+ ### Direct Use
162
+
163
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
164
+
165
+ [Privacy
166
+
167
+ Inference is local.
168
+
169
+ The demo is designed so prompts are processed on-device.
170
+ No cloud inference is required.
171
+
172
+ If you build or modify the app, review the source code and Android permissions yourself.]
173
+
174
+
175
+
176
+ ### Out-of-Scope Use
177
+
178
+ [Important warning
179
+
180
+ This is an experimental local AI demo.
181
+
182
+ The model may:
183
+
184
+ hallucinate
185
+ answer incorrectly
186
+ repeat itself
187
+ generate incomplete text
188
+ be slow on low-end hardware
189
+ consume significant battery and RAM
190
+
191
+ Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]
192
+
193
+ ## Bias, Risks, and Limitations
194
+
195
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
196
+
197
+ [Current limitations
198
+ CPU only
199
+ fp32 ONNX model is large
200
+ no NPU backend yet
201
+ no GPU/Vulkan backend yet
202
+ no quantization yet
203
+ context length currently limited
204
+ APK size is large
205
+ generation quality is still experimental]
206
+
207
+
208
+ ## Model Card Authors [optional]
209
+ Credits
210
+
211
+ Built by Open Machine.
212
+
213
+ Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.
214
+
215
+ ## Model Card Contact
216
+
217
+ Open Machine
218
+ info@theopenmachine.com