OpenMachineAI
/

Lulu245M-Mobile

Transformers

English

Model card Files Files and versions

xet

Community

TheOpenMachine commited on 11 days ago

Commit

938100e

verified ·

1 Parent(s): 7d467c3

Update README.md

Browse files

Files changed (1) hide show

README.md +210 -1

README.md CHANGED Viewed

@@ -6,4 +6,213 @@ datasets:
 language:
 - en
 library_name: transformers
----

 language:
 - en
 library_name: transformers
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+- 245M parameters
+- 4 Layers
+- D_size 1280
+- 16 MoE
+- 8 KV
+- FP32 2.3GB - Onix export
+Trained on only 20B tokens of web text data.
+Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.
+### Model Description
+# Lulu Local Android Demo
+**Lulu Local** is an offline Android AI demo by **Open Machine**.
+This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**.
+No cloud.
+No server.
+No GPU.
+No NPU.
+No internet required after install.
+Runs on the Samsung A25 5G.
+This is a raw early proof that a custom local model can run directly on consumer Android hardware.
+For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet.
+The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory.
+This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred.
+We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.
+Why this is interesting
+Most mobile LLM demos rely on one or more of the following:
+heavily quantized models
+GPU acceleration
+NPU acceleration
+server-side inference
+vendor SDKs
+cloud APIs
+This demo is intentionally simple and direct:
+Android app
++ ONNX Runtime
++ local tokenizer
++ local ONNX model
++ CPU only
+The current model is not small, not heavily optimized, and not using mobile accelerator tricks.
+That is the point of the demo.
+Model architecture note
+The Android build uses a stateful single-token step ONNX export.
+The runtime loop is:
+token_id + position + cache tensors
+→ ONNX step model
+→ logits + updated cache tensors
+→ sample next token
+→ repeat
+This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.
+Current ONNX interface:
+Inputs:
+- token_id: [1, 1] int64
+- pos: [1] int64
+- k_0, v_0 ... k_23, v_23
+Outputs:
+- logits: [1, 32000] float32
+- out_k_0, out_v_0 ... out_k_23, out_v_23
+Cache shape per K/V tensor:
+[1, 16, 128, 80]
+Total runtime cache is about 31 MB.
+- **Developed by: The Open Machine**
+- **Model type:** [The Open Machine Transformers Version]
+- **Language(s) (NLP):** [English]
+- **License:** [Apache 2.0 ]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [Wiull be provided in upcoming days]
+- **Paper [optional]:** [Coming Soon]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+Demo highlights
+Fully offline Android assistant
+Runs on mobile CPU only
+Stateful single-token ONNX generation
+Live token streaming UI
+Battery / RAM / speed display
+Cool / Turbo mode
+Cool: 2 CPU threads
+Turbo: 4 CPU threads
+No GPU acceleration
+No NPU acceleration
+No network calls required for inference
+Tested device
+Early demo testing was done on a Samsung A25-class Android phone.
+Observed behavior:
+Model loads locally from app storage
+Generation works fully offline
+CPU-only generation is slow but usable for demo purposes
+Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode
+This is not yet optimized.
+Install
+Download the APK:
+LuluLocal-Android-CPU-fp32.apk
+On Android:
+Open the APK file.
+Allow install from unknown sources if Android asks.
+Install.
+Open Lulu.
+Wait for the model to load.
+Ask a question.
+First load may take longer because the app prepares the local ONNX model.
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[Privacy
+Inference is local.
+The demo is designed so prompts are processed on-device.
+No cloud inference is required.
+If you build or modify the app, review the source code and Android permissions yourself.]
+### Out-of-Scope Use
+[Important warning
+This is an experimental local AI demo.
+The model may:
+hallucinate
+answer incorrectly
+repeat itself
+generate incomplete text
+be slow on low-end hardware
+consume significant battery and RAM
+Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[Current limitations
+CPU only
+fp32 ONNX model is large
+no NPU backend yet
+no GPU/Vulkan backend yet
+no quantization yet
+context length currently limited
+APK size is large
+generation quality is still experimental]
+## Model Card Authors [optional]
+Credits
+Built by Open Machine.
+Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.
+## Model Card Contact
+Open Machine
+info@theopenmachine.com