Typosquat T5 detector
Model Details
Model Description
This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the flan-t5-large transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.
- Developed by: Anvilogic
- Model type: Encoder-Decoder
- Maximum Sequence Length: 512 tokens
- Language(s) (NLP): Multilingual
- License: MIT
- Finetuned from model : flan-t5-large
Usage
Direct Usage (Transformers)
This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.
To start using this model, the following code can be used for loading and testing:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
# Example input
typosquat_candidate='goog1e.com'
legitimate_domain='google.com'
input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))
false
Downstream Usage
This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.
For embedding, consider using: Anvilogic/Embedder-typosquat-detect
Bias, Risks, and Limitations
Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.
Training Details
Framework Versions
- Python: 3.10.14
- Transformers: 4.46.2
- PyTorch: 2.2.2
- Tokenizers: 0.20.3
Training Data
The model was fine-tuned using Anvilogic/T5-Typosquat-Training-Dataset, which contains pairs of domain names and the expected response.
Training Procedure
The model was optimized using the binary cross-entropy loss function with logits, CrossEntropyLoss()
.
Training Hyperparameters
- Model Architecture: Encoder-Decoder fine-tuned from flan-t5-large
- Batch Size: 8
- Epochs: 5
- Learning Rate: 5e-5
Evaluation
Training loss
Epoch | Training loss | Validation loss |
---|---|---|
Epoch 1 | 0.0807 | 0.016496 |
Epoch 2 | 0.0270 | 0.018645 |
Epoch 3 | 0.0034 | 0.016577 |
Epoch 4 | 0.0002 | 0.012842 |
Epoch 5 | 0.0407 | 0.014530 |
We only kept the fourth checkpoint as it exhibits the best loss.
Model tree for Anvilogic/Flan-T5-typosquat-detect
Base model
google/flan-t5-large