Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🤗 Hugging Face • 🤖 ModelScope • 👾 Wisemodel • 💬 WeChat• 📜Tech Report

GitHub Stars GitHub Forks

Project Introduction

Skywork-MoE is a high-performance mixture-of-experts (MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model.

We introduce two innovative techniques: Gating Logit Normalization, which enhances expert diversification, and Adaptive Auxiliary Loss Coefficients, which allow for layer-specific adjustment of auxiliary loss coefficients.

Skywork-MoE demonstrates comparable or superior performance to models with more parameters or more activated parameters, such as Grok-1, DBRX, Mistral 8*22, and Deepseek-V2.

News and Updates

  • 2024.6.3 We release the Skywork-MoE-Base model.

Table of contents

Download URL

HuggingFace Model ModelScope Model Wisemodel Model
Skywork-MoE-Base 🤗 Skywork-MoE-Base 🤖Skywork-MoE-Base 👾Skywork-MoE-Base
Skywork-MoE-Base-FP8 🤗 Skywork-MoE-Base-FP8 🤖Skywork-MoE-Base-FP8 👾Skywork-MoE-Base-FP8
Skywork-MoE-Chat 😊 Coming Soon 🤖 👾

Benchmark Results

We evaluated Skywork-MoE-Base model on various popular benchmarks, including C-Eval, MMLU, CMMLU, GSM8K, MATH and HumanEval. Image

Demonstration of Hugging Face Model Inference

Base Model Inference

We can perform inference for the Skywork-MoE-Base (16x13B size) model using HuggingFace on 8xA100/A800 or higher GPU hardware configurations.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Skywork/Skywork-MoE-Base", trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-MoE-Base", trust_remote_code=True)

inputs = tokenizer('陕西的省会是西安', return_tensors='pt').to(model.device)
response = model.generate(inputs.input_ids, max_length=128)
print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))

inputs = tokenizer('陕西的省会是西安,甘肃的省会是兰州,河南的省会是郑州', return_tensors='pt').to(model.device)
response = model.generate(inputs.input_ids, max_length=128)
print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))

Chat Model Inference

coming soon...

Demonstration of vLLM Model Inference

Quickstart with vLLM

We provide a method to quickly deploy the Skywork-MoE-Base model based on vllm.

Under fp8 precision you can run Skywork-MoE-Base with just only 8*4090.

You can get the source code in vllm

You can get the fp8 model in Skywork-MoE-Base-FP8

Based on local environment

Since pytorch only supports 4090 using fp8 precision in the nightly version, you need to install the corresponding or newer version of pytorch.

# for cuda12.1
pip3 install --pre torch pytorch-triton --index-url https://download.pytorch.org/whl/nightly/cu121
# for cuda12.4
pip3 install --pre torch pytorch-triton --index-url https://download.pytorch.org/whl/nightly/cu124

Some other dependencies also need to be installed:

MAX_JOBS=8 pip3 install git+https://github.com/facebookresearch/xformers.git # need to wait for a long time
pip3 install vllm-flash-attn --no-deps

Then clone the vllm provided by skywork:

git clone https://github.com/SkyworkAI/vllm.git
cd vllm

Then compile and install vllm:

pip3 install -r requirements-build.txt
pip3 install -r requirements-cuda.txt
MAX_JOBS=8 python3 setup.py install

Base on docker

You can use the docker image provided by skywork to run vllm directly:

docker pull registry.cn-wulanchabu.aliyuncs.com/triple-mu/skywork-moe-vllm:v1

Then start the container and set the model path and working directory.


docker run \
    --runtime nvidia \
    --gpus all \
    -it \
    --rm \
    --shm-size=1t \
    --ulimit memlock=-1 \
    --privileged=true \
    --ulimit stack=67108864 \
    --ipc=host \
    -v ${model_path}:/Skywork-MoE-Base-FP8 \
    -v ${workspace}:/workspace \

Now, you can run the Skywork MoE model for fun!

Text Completion

from vllm import LLM, SamplingParams

model_path = 'Skywork/Skywork-MoE-Base-FP8'
prompts = [
    "The president of the United States is",
    "The capital of France is",

sampling_params = SamplingParams(temperature=0.3, max_tokens=256)

llm = LLM(

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Declaration and License Agreement


We hereby declare that the Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Additionally, we request users not to deploy the Skywork model for internet services without appropriate security reviews and records. We hope that all users will adhere to this principle to ensure that technological advancements occur in a regulated and lawful environment.

We have done our utmost to ensure the compliance of the data used during the model's training process. However, despite our extensive efforts, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, if any problems arise as a result of using the Skywork open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized, we will not assume any responsibility.

License Agreement

The community usage of Skywork model requires Skywork Community License. The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within Skywork Community License.

Contact Us and Citation

If you find our work helpful, please feel free to cite our paper~

      title={Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models}, 
      author={Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou},
  title={LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models},
  author={Zhao, Liang and Wei, Tianwen and Zeng, Liang and Cheng, Cheng and Yang, Liu and Cheng, Peng and Wang, Lijie and Li, Chenxia and Wu, Xuejie and Zhu, Bo and others},
  journal={arXiv preprint arXiv:2406.00605},
Downloads last month
Model size
146B params
Tensor type
Inference API (serverless) does not yet support model repos that contain custom code.