Report

by SafeAI-HF - opened Apr 24

Apr 24

I am requesting a moderation review of the Hugging Face repository i3-lab/i3-200m-v2.

The model card appears to make several claims that do not match the implementation, including:

A reported parameter count of roughly 169.85M parameters
A reported perplexity of 5.2
Claims of efficient RWKV/Mamba-style architecture
A claimed BPE vocabulary size of 32,000

Based on the available code and configuration, these claims appear inconsistent with the actual architecture, tokenizer, and training setup.

Parameter Count Appears Significantly Overstated

The repository claims a model size of approximately 169.85M parameters, using a hidden dimension of 512 across 16 layers.

However, the implementation appears to use low-rank projections through a LoRPtLinear class with a fixed rank of 64. This greatly reduces the number of parameters compared with standard dense linear layers.

Feed-Forward Layers

A standard feed-forward layer with expansion factor 4 would use a 512 → 2048 → 512 structure.

A standard dense version would contain approximately:

(512 × 2048 + 2048) + (2048 × 512 + 512)
= 2,098,688 parameters

But with low-rank projections using rank 64, the count is closer to:

(2048 × 64 + 64 × 512 + 2048)
+
(512 × 64 + 64 × 2048 + 512)
= 330,240 parameters

This is far smaller than a standard dense feed-forward block.

Attention Layers

The attention projections also appear to use low-rank forms rather than full dense 512 × 512 matrices.

A rough estimate for the low-rank Q, K, V, and output projections is:

3 × (512 × 64 + 64 × 512) + 65,536
= 262,144 parameters
Estimated Total

Using the apparent vocabulary size of 4,466, the embedding table is approximately:

4,466 × 512 = 2,286,592 parameters

Across 16 layers, a rough upper estimate is:

16 × (330,240 + 262,144 + additional state parameters)
≈ 10M–12M layer parameters

That puts the likely model size around 12M–15M parameters, not 169.85M.

This suggests the reported parameter count may be inflated by more than an order of magnitude.

Reported Perplexity Appears Unlikely Given the Training Setup

The model card reports a final loss of about 1.6, corresponding to a perplexity of roughly 5.2.

However, the training logs reportedly show a throughput of around 300 tokens per second over approximately 2 hours.

That means the model would have processed roughly:

300 tokens/sec × 7,200 sec = 2,160,000 tokens

A model of this size trained on only about 2.16M tokens would be very unlikely to reach a perplexity of 5.2 on broad datasets such as Wikitext and TinyStories unless the evaluation setup was unusually narrow, leaked, or otherwise not comparable.

At minimum, the repository should provide:

The exact evaluation script
The evaluation dataset split
Whether training and evaluation data overlapped
The tokenizer used during evaluation
The checkpoint used for the reported metric

Without that information, the perplexity claim does not appear well supported.

Efficiency Claims Do Not Match the Implementation

The repository describes the model as using a high-efficiency RWKV/Mamba-style hybrid.

However, the implementation appears to use a standard Python/PyTorch loop over the sequence length:

for t in range(T):
    x_t = x[:, t, :]
    h = self.w_mix * h + (1 - self.w_mix) * x_t

This is not equivalent to an optimized Mamba-style scan implementation.

Efficient Mamba implementations rely on custom kernels or parallel scan methods to avoid slow step-by-step execution over the sequence dimension. A plain loop over T forces sequential processing and limits GPU parallelism.

For a sequence length of 256 and 16 layers, this creates thousands of sequential operations per forward pass. That makes the implementation much less efficient than the model card suggests.

The repository should clarify that this is a simple sequential recurrence, not an optimized Mamba-style implementation.

Tokenizer Claims Appear Inconsistent

The model card reportedly claims a BPE vocabulary size of 32,000.

However, the tokenizer implementation appears to use a custom ChunkTokenizer, based on a small set of predefined trigrams and simple character chunks. This does not appear to be a standard BPE tokenizer with learned merge rules.

The configuration also appears to reference a vocabulary size of 4,466, not 32,000.

This creates a clear mismatch between the model card and the implementation.

The repository should clarify:

The actual tokenizer used
The actual vocabulary size
Whether any BPE merge file exists
Whether the reported metrics were generated using this tokenizer
Requested Action

I am requesting that Hugging Face review the repository i3-lab/i3-200m-v2 for potentially inaccurate or misleading model claims.

Specifically, I ask that the moderation team verify:

Whether the model actually contains 169.85M parameters
Whether the reported perplexity of 5.2 is reproducible
Whether the tokenizer is truly a 32,000-token BPE tokenizer
Whether the architecture should be described as an efficient RWKV/Mamba-style implementation

If these claims cannot be verified, I request that the model card be corrected to reflect the actual implementation and reproducible results.

The goal is not to punish experimentation, but to make sure public model metadata is accurate and not misleading.

FlameF0X

i3-lab org Apr 24

Idk what am I supposed to do in this position.

FlameF0X

i3-lab org Apr 24

My day has been shit anyway 乁⁠(⁠ ⁠•⁠_⁠•⁠ ⁠)⁠ㄏ
So I lost the interest into doing anything.

FlameF0X

i3-lab org Apr 24

Feel free to tell me what am I supposed to do.

FlameF0X

i3-lab org Apr 24

•

edited Apr 24

Okay, so, i finally managed to read the whole report you have sent.

The model is indeed 169.85M parameters as far i can remember from the training script.
About the model efficiency i think is quite okay, If you try to train a model of this size on a single P100 it would hit a OOM (now it depending on the ctx window- i dont remember how larger is the ctx window, probably something small, so i think you can train a model of this size, but not in 2-4hours, I'm not sure). In the WandB logs i have like 12 runs of the same training code that show the same results.
Also, the model is Attention RWKV hybrid, i accidentally wrote Mamba/RWKV.
I dont remember the size of the tokenizer.
I'm very sure that the PPL is wrong, i think that the data leak over.

There are a lot of things that i forgot about the model architecture since i haven't maintained the project for a while and did other things. There are the chance that i DID some mistakes that im not aware of accidentally, which are human made errors.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment