Add model documentation
Browse files- README_model.md +55 -0
README_model.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- log-analysis
|
5 |
+
- pythia
|
6 |
+
- hdfs
|
7 |
+
license: mit
|
8 |
+
datasets:
|
9 |
+
- honicky/log-analysis-hdfs-preprocessed
|
10 |
+
metrics:
|
11 |
+
- cross-entropy
|
12 |
+
- perplexity
|
13 |
+
base_model: EleutherAI/pythia-70m
|
14 |
+
---
|
15 |
+
|
16 |
+
# pythia-70m-hdfs-logs
|
17 |
+
|
18 |
+
Fine-tuned Pythia-14m model for HDFS log analysis, specifically for anomaly detection.
|
19 |
+
|
20 |
+
## Model Description
|
21 |
+
|
22 |
+
This model is fine-tuned from `EleutherAI/pythia-70m` for analyzing HDFS log sequences. It's designed to understand and predict patterns in
|
23 |
+
HDFS log data so that we can detect anomalies using the perplexity of the log sequence. THhe HDFS sequence is handy because it has labels
|
24 |
+
so we can use it to validate that the model can predict anomalies.
|
25 |
+
|
26 |
+
We will use this model to understand the ability of a small model to predict anomalies in a specific dataset. We will study model scale
|
27 |
+
and experiment with tokenization, intialization, data set size, etc. to find a configuration that is minimal in size and fast, but can
|
28 |
+
effectively predict anomalies. We will then attempt build a model that is more robust to different log formats.
|
29 |
+
|
30 |
+
- Huggingface Model: [honicky/pythia-14m-hdfs-logs](https://huggingface.co/honicky/pythia-14m-hdfs-logs)
|
31 |
+
|
32 |
+
## Training Details
|
33 |
+
- Base model: EleutherAI/pythia-70m
|
34 |
+
- Dataset: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 + preprocessed data at honicky/log-analysis-hdfs-preprocessed
|
35 |
+
- Batch size: 16
|
36 |
+
- Max sequence length: 405
|
37 |
+
- Learning rate: 0.0001
|
38 |
+
- Training steps: 2000
|
39 |
+
- Weights and Biases run: https://wandb.ai/honicky/log-analysis-pythia/runs/jomdv9lz
|
40 |
+
|
41 |
+
|
42 |
+
## Special Tokens
|
43 |
+
- Added `<|sep|>` token for event ID separation
|
44 |
+
|
45 |
+
## Intended Use
|
46 |
+
This model is intended for:
|
47 |
+
- Analyzing HDFS log sequences
|
48 |
+
- Detecting anomalies in log patterns
|
49 |
+
- Understanding system behavior through log analysis
|
50 |
+
|
51 |
+
## Limitations
|
52 |
+
- Model is specifically trained on HDFS logs and may not generalize to other log formats
|
53 |
+
- Limited to the context window size of 405 tokens
|
54 |
+
|
55 |
+
|