File size: 3,237 Bytes
eb4e3b2
 
72b62c4
b1d306c
2f2130b
 
0484461
 
8c42d7b
0484461
9527169
0484461
016ddc5
0484461
8c42d7b
0484461
8c42d7b
 
 
ecd7ea9
8c42d7b
ecd7ea9
0484461
8c42d7b
0484461
8c42d7b
0484461
8c42d7b
 
 
 
 
 
0fb4ef1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4eee4e
 
0fb4ef1
 
 
 
 
 
7d9bdfd
31411e7
 
7d9bdfd
0fb4ef1
31411e7
 
 
0fb4ef1
31411e7
0fb4ef1
229c977
c9f9bbe
0fb4ef1
 
7d9bdfd
31411e7
 
7d9bdfd
0fb4ef1
31411e7
 
 
0fb4ef1
31411e7
0fb4ef1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
library_name: transformers
inference: false
license: cc-by-sa-4.0
base_model:
- nqzfaizal77ai/swiftstrike-aero-init-580m
---

**Swiftstrike Aero Model (Falcon Pruned Model)**

This model is a fine-tuned version of the Swiftstrike Aero Model, specifically tailored for context-aware keyword searches related to culture. It's designed to process 1-block contexts, equivalent to approximately 384 tokens or a single paragraph of wikipedia standard(common length paragraph).

**Training Data (Part 1 Culture Context Wikipedia)**

The model was trained on a multi-stage dataset derived from Wikipedia's culture-related content:

1. **Base Dataset:** 
   - 13,000 rows of capitalized and lowercase words extracted from Wikipedia's culture sentences.
2. **Sentence-Level Dataset:** 
   - 2,300 rows of full sentences from Wikipedia's culture data.
3. **1-Block Context Dataset:**
   - 500 rows of 1-block contexts (approximately 1 paragraph) from Wikipedia's culture data.

**Dataset Organization**

The dataset is structured hierarchically, with each level representing an increasing level of complexity:

1. **Part:** Individual components or elements.
2. **Merge Part:** Combination of two or more parts.
3. **Fragment:** Combination of two or more merge parts.
4. **Sub-Unit:** Combination of two or more fragments.
5. **Unit:** Combination of two or more sub-units.
6. **Super-Unit:** Combination of two or more units.
7. **Mega-Unit:** Combination of two or more super-units.

**How to Use**

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import display, HTML

# prompt: load model and generate example
model_name = "nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc"

model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)

torch.manual_seed(3077)

input_text = "The cultural impact of the internet is"
inputs = tokenizer(input_text, return_tensors="pt")

# Example usage stochastic decode
output = model.generate(**inputs,
                          do_sample=True,
                          top_k=50,
                          top_p=0.95,
                          repetition_penalty=1.2,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)

# Example usage greedy decode
output = model.generate(**inputs,
                          do_sample=False,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)
```