metadata

license: mit
pipeline_tag: text-generation
library_name: transformers
tags:
  - safetensors

Arsh-V1 Model Card

We introduce our first reasoning and chat model based on phi weights. The Arsh architecture is built on new technologies, making this model optimized for performance and extendability in chat applications. Our team has utilized high-quality datasets for training to ensure reliable and robust performance.

Our ongoing efforts include frequent updates and enhancements that aim to blend multiple architectures, striving to build the best model possible. We are making significant progress and are dedicated to refining Arsh-V1 further.

Model Overview

The Arsh-V1 model belongs to the Causal Language Model (LLM) family and is designed for text generation tasks. It comprises a stack of transformer blocks and is optimized for high throughput and low latency in language generation.

Architecture

Model Name: ArshForCausalLM
Model Type: Llama
Hidden Size: 5120
Number of Layers: 40
Attention Heads: 40
Maximum Sequence Length: 16384
Vocabulary Size: 100352
Activation Function: SiLU (Gaussian Error Linear Unit)

This architecture enables the model to capture complex language patterns, making it suitable for conversation and reasoning tasks.

Key Components

1. RMS Normalization

The ArshRMSNorm layer is employed for normalization within the model, enhancing training stability and speed.

norm = ArshRMSNorm(hidden_size=5120)
x = torch.randn(1, 10, 5120)
output = norm(x)

2. Rotary Position Embedding

ArshRotaryEmbedding leverages rotary embeddings for efficient positional encoding in transformer architectures.

config = ArshConfig(max_position_embeddings=16384)
rotary_emb = ArshRotaryEmbedding(config)

3. Gated MLP Block

The ArshMLP component is responsible for non-linear transformations, incorporating a gating mechanism.

mlp = ArshMLP(config)
x = torch.randn(1, 10, 5120)
output = mlp(x)

4. Multi-Head Attention

The ArshAttention layer implements multi-head attention with support for rotary positional embeddings, enhancing context understanding.

attention_layer = ArshAttention(config, layer_idx=0)
hidden_states = torch.randn(1, 10, 5120)
attn_output, attn_weights = attention_layer(hidden_states)

5. Transformer Decoder Layer

The ArshDecoderLayer integrates self-attention and feed-forward neural network components in series.

decoder_layer = ArshDecoderLayer(config, layer_idx=0)
hidden_states = decoder_layer(hidden_states)

License

Our model is based on MIT license, which allows for modifications, creation, fine-tuning, and commercial usage. We appreciate your contributions to make Arsh-V1 better!