Mistral AI EAP

company

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

pstock authored a paper 3 months ago

Pixtral 12B

timlacroix authored a paper 3 months ago

Pixtral 12B

ArthurZ authored a paper 9 months ago

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

View all activity

mistralai-eap's activity

ArthurZ

posted an update about 1 month ago

Post

2712

Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support!

Contributions are welcome to support more models! 🔥

pstock

authored a paper 3 months ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9 • 62

timlacroix

authored a paper 3 months ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9 • 62

pcuenq

posted an update 8 months ago

Post

4683

OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!

4 replies

ArthurZ

authored a paper 9 months ago

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Paper • 2404.07839 • Published Apr 11 • 43

ArthurZ

posted an update 10 months ago

Post

mamba is now available in transformers. Thanks to @tridao and @albertgu for this brilliant model! 🚀 and the amazing mamba-ssm kernels powering this!
Checkout the collection here:
state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406