HandleAtlas-166m

A fine-tuned GLiNER small v2.1 (~166M params) for extracting social-media handles from short bios. Built on Twitter/X bios but the patterns generalize to other platforms.

Labels

  • instagram_username
  • snapchat_username
  • youtube_username
  • twitch_username
  • tiktok_username
  • discord_username
  • x_username
  • cashapp_username
  • onlyfans_username
  • tumblr_username
  • github_username
  • kofi_username
  • patreon_username
  • roblox_username
  • generic_username

generic_username is a fallback for handle-shaped strings without a clear platform indicator.

Usage

from gliner import GLiNER

model = GLiNER.from_pretrained("LumeData/HandleAtlas-166m")

labels = ['instagram_username', 'snapchat_username', 'youtube_username', 'twitch_username', 'tiktok_username', 'discord_username', 'x_username', 'cashapp_username', 'onlyfans_username', 'tumblr_username', 'github_username', 'kofi_username', 'patreon_username', 'roblox_username', 'generic_username']

text = "Insta: foodgrammer | Snap: chefchef | DC: gamer420 | $cashtag"
for ent in model.predict_entities(text, labels, threshold=0.5):
    print(f"{ent['text']!r} -> {ent['label']} ({ent['score']:.2f})")

Training

  • Base: urchade/gliner_small-v2.1
  • Real data: ~1,000 hand-labeled Twitter bios
  • Synthetic data: ~2,200 generated bios (template-based + IG→Discord text rewriting for the discord_username class)
  • Case augmentation: each training record is emitted in original + fully-lowercased form so the model is robust to casing of platform prefixes (Dc:/dc:/DC: etc.)
  • 5 epochs, batch 4 × grad-accum 2, lr 5e-6 (encoder) / 1e-5 (heads), cosine schedule

Eval

On a 100-record held-out slice of real Twitter bios:

metric value
precision 0.849
recall 0.929
F1 0.887

Strong per-label F1 on instagram (0.95), youtube (1.00), tiktok (1.00), twitch (1.00), onlyfans (1.00), generic (0.88), cashapp (0.86), snapchat (0.80).

Recommended thresholds

  • Default: threshold=0.5
  • For generic_username, bump to 0.65 to reduce false positives; it's the catch-all label and over-fires at the default threshold.

Limitations

  • Trained on patterns common in Twitter/X bios; performance on other domains (LinkedIn-style, Reddit, forum sigs) will be lower.
  • discord_invite is not predicted — invite codes will be classified as discord_username or skipped.
  • Multi-line bios with many handles can occasionally confuse adjacent URL labels (e.g., patreon.com/x | github.com/x chains).
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LumeData/HandleAtlas-166m

Finetuned
(3)
this model
Quantizations
1 model