inclusionAI
/

Ring-mini-linear-2.0

@@ -21,7 +21,7 @@ tags:
 Today, we are officially open-sourcing Ring-mini-linear-2.0.
-This model continues to employ a hybrid architecture that combines linear attention and standard attention mechanisms, striking a balance between performance and efficiency. Inheriting the efficient MoE (Mixture-of-Experts) design from the Ling 2.0 series, and through architectural optimizations such as a 1/32 expert activation ratio and MTP layers, Ring-mini-linear achieves the performance of an ~8B dense model while activating only 1.4B of its 16B total parameters. This model was converted from [Ling-mini-base-2.0](https://huggingface.co/inclusionAI/Ling-mini-base-2.0-20T), continually trained on an additional 600B tokens. In terms of performance, the hybrid linear model is comparable in overall performance to standard attention models of a similar size (e.g., Ring-mini-2) and surpasses other open-source MoE and Dense models of the same class on several challenging benchmarks. Furthermore, it natively supports a 128k long context window, demonstrating superior speed and accuracy, especially on tasks involving long inputs and outputs.
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
@@ -182,7 +182,7 @@ from vllm import LLM, SamplingParams
 tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
-sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
 llm = LLM(model="inclusionAI/Ring-mini-linear-2.0", dtype='bfloat16', enable_prefix_caching=False, max_num_seqs=128)
 prompt = "Give me a short introduction to large language models."

 Today, we are officially open-sourcing Ring-mini-linear-2.0.
+This model continues to employ a hybrid architecture that combines linear attention and standard attention mechanisms, striking a balance between performance and efficiency. Inheriting the efficient MoE (Mixture-of-Experts) design from the Ling 2.0 series, and through architectural optimizations such as a 1/32 expert activation ratio and MTP layers, Ring-mini-linear achieves the performance of an ~8B dense model while activating only 1.4B of its 16B total parameters. This model was converted from [Ling-mini-base-2.0](https://huggingface.co/inclusionAI/Ling-mini-base-2.0-20T), continually trained on an additional 600B tokens. In terms of performance, the hybrid linear model is comparable in overall performance to standard attention models of a similar size (e.g., Ring-mini-2) and surpasses other open-source MoE and Dense models of the same class on several challenging benchmarks. Additionally, we support a 512k long context window, achieved by extrapolating the window 4x using YaRN. This provides superior speed, especially on tasks involving long inputs and outputs.
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
 tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
+sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=16384)
 llm = LLM(model="inclusionAI/Ring-mini-linear-2.0", dtype='bfloat16', enable_prefix_caching=False, max_num_seqs=128)
 prompt = "Give me a short introduction to large language models."