32 5 16

Waseem AlShikh

wassemgtk

https://writer.com/

AI & ML interests

Multi-modal, Palmyra LLMs, Knowledge Graph

Recent Activity

updated a model 5 days ago

wassemgtk/mergekit-ties-isswcgh

published a model 5 days ago

wassemgtk/mergekit-ties-isswcgh

replied to their post 8 days ago

I’ve been diving into the iRoPE architecture from Llama 4—a game-changer for long-context models! It interleaves local attention (with RoPE) for short contexts and global attention (with inference-time temp scaling) for long-range reasoning, aiming for infinite context. I’m going to try writing iRoPE—who wants to help? Code: https://github.com/wassemgtk/iRoPE-try/blob/main/iRoPE.ipynb

View all activity

Organizations

wassemgtk's activity

updated a model 5 days ago

wassemgtk/mergekit-ties-isswcgh

Text Generation • Updated 5 days ago • 4

published a model 5 days ago

wassemgtk/mergekit-ties-isswcgh

Text Generation • Updated 5 days ago • 4

replied to their post 8 days ago

Applied iRoPE to any model like LLaMA 3.2-3B VERY possible! Interleaved local (RoPE) & global (temp-scaled) attention boosts long-context (10M tokens) handling. With chunking & weight transfer, it’s adaptable to "any" transformer model.
Infinite context feels closer 🤯

https://github.com/wassemgtk/iRoPE-try

posted an update 8 days ago

Post

2699

I’ve been diving into the iRoPE architecture from Llama 4—a game-changer for long-context models! It interleaves local attention (with RoPE) for short contexts and global attention (with inference-time temp scaling) for long-range reasoning, aiming for infinite context. I’m going to try writing iRoPE—who wants to help?

Code: https://github.com/wassemgtk/iRoPE-try/blob/main/iRoPE.ipynb

1 reply

updated a model 18 days ago

wassemgtk/pruned-llama-u42y1jwh

Updated 18 days ago • 3

published a model 18 days ago

wassemgtk/pruned-llama-u42y1jwh

Updated 18 days ago • 3

posted an update 20 days ago

Post

2072

For fun, a new project: SuperTokenizer! A BPE tokenizer trained on C4 to beat GPT-4. Byte-level, A100-powered, and open-source. Messing around with tokens!
https://github.com/wassemgtk/SuperTokenizer

1 reply

updated a model 20 days ago

wassemgtk/mergekit-passthrough-dmvdobt

Text Generation • Updated 20 days ago • 6

published a model 20 days ago

wassemgtk/mergekit-passthrough-dmvdobt

Text Generation • Updated 20 days ago • 6

updated a model 21 days ago

Writer/Palmyra-local-1_7B

Updated 7 days ago • 29 • 1

published a model 22 days ago

Writer/Palmyra-local-1_7B

Updated 7 days ago • 29 • 1

replied to their post about 1 month ago

AMAZING WORK - Based the updated model snippet and results, I’ll provide new and additional suggestions to further refine AdaptiveGESAL, targeting an RMSE of 10–16 cycles while maintaining efficiency and scalability.
The Accuracy (±50 cycles) => 100.0% is excellent, indicating robust generalization within the ±50 cycle tolerance, but RMSE/MAE show room for precision improvement.

The temporal layers (Conv1d, LSTM) are working well, but i belive having deeper or more specialized layers could capture finer degradation patterns.

Include parallel Conv1d layers with different kernel sizes (e.g., 3, 5, 7) to capture short- and long-term trends, then concatenate outputs before the LSTM:

self.conv1d_short = nn.Conv1d(input_dim, hidden_dim // 3, kernel_size=3, padding=1)
self.conv1d_med = nn.Conv1d(input_dim, hidden_dim // 3, kernel_size=5, padding=2)
self.conv1d_long = nn.Conv1d(input_dim, hidden_dim // 3, kernel_size=7, padding=3)
def forward(self, x):
    x = x.unsqueeze(1)  # (batch, 1, features)
    short = self.activation(self.conv1d_short(x))
    med = self.activation(self.conv1d_med(x))
    long = self.activation(self.conv1d_long(x))
    x = torch.cat([short, med, long], dim=2).squeeze(1)
    x, _ = self.lstm(x)
    # Continue with SVF and output layers

ANd improves temporal context, reducing MAE.

self.lstm = nn.LSTM(hidden_dim, hidden_dim // 2, batch_first=True, bidirectional=True, num_layers=1)
x, _ = self.lstm(x)  # Output shape: (batch, seq_len, hidden_dim)
x = x.squeeze(1) * 2  # Scale to match original hidden_dim

Then increases model capacity for complex patterns while maintaining efficiency via SVF like the below

original_fc1 = nn.Linear(256, 128)
original_fc2 = nn.Linear(128, 64)
original_fc3 = nn.Linear(64, 32)
self.svf1 = SVFLinear(original_fc1, dropout_rate=0.2, l2_lambda=0.01)
self.svf2 = SVFLinear(original_fc2, dropout_rate=0.2, l2_lambda=0.01)
self.svf3 = SVFLinear(original_fc3, dropout_rate=0.2, l2_lambda=0.01)
self.output_layer = nn.Linear(32, 1)

replied to their post about 1 month ago

https://github.com/writer/AI-Adaptive-Learning-GESAL/blob/main/Adaptive_GESAL_NASA_dataset_2.ipynb

replied to their post about 1 month ago

One more idea add temporal layers
Integrate 1D convolutional layers or LSTM layers before the SVFLinear layers to capture temporal dependencies in the sensor data over cycles. something like;

class AdaptiveGESAL(nn.Module):
    def __init__(self, input_dim=21, hidden_dim=128, num_nodes=50):
        super().__init__()
        self.conv1d = nn.Conv1d(input_dim, hidden_dim, kernel_size=3, padding=1)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
        self.svf_layers = [SVFLinear(nn.Linear(hidden_dim, hidden_dim)) for _ in range(2)]
        self.output_layer = nn.Linear(hidden_dim, 1)  # RUL prediction
        # Graph and SVF initialization as before

With replace MSE (implicit in RMSE) with a hybrid loss combining MSE and a quantile loss (e.g., 0.9 quantile for conservative RUL estimates) this penalizes underestimation, aligning with conservative RUL needs.

replied to their post about 1 month ago

Few more suggestions:

Use correlation analysis or techniques like Principal Component Analysis (PCA) to identify the most predictive features (e.g., vibration, temperature, pressure) and reduce noise from less relevant sensors.
Transform features into time-series statistics (e.g., rolling averages, standard deviations, or slopes over cycles) to capture degradation trends. For example, compute a 10-cycle rolling mean for T30 (total temperature at LPC outlet) and Nf (physical fan speed).
Normalize or standardize features (e.g., z-scores or min-max scaling) per engine to account for individual variability, ensuring AdaptiveGESAL’s embeddings better distinguish degradation states.

published a model about 2 months ago

Writer/Palmyra-56B-Instruct

Text Generation • Updated Oct 24, 2024 • 2

updated a collection about 2 months ago

Palmyra (Writer license)

Collection

Palmyra LLMs under Writer license https://writer.com/legal/open-model-license/ • 11 items • Updated Feb 27 • 6

replied to their post about 2 months ago

https://github.com/writer/AI-Adaptive-Learning-GESAL/blob/main/GESAL_synthetic_engine_maintenance_dataset.ipynb

replied to their post about 2 months ago

The accuracy drop likely ties to prompt inconsistency—standardization is key. and if your setup can handle more nodes and data; focus on tuning distance_threshold.

Try these tweaks:

Prompt: "Given Engine X with [params], predict ‘replace’, ‘maintenance’, or ‘check’ based on wear."
Hyperparameters: temperature=0.4, top_k=20, distance_threshold=0.25, lr=0.005, buffer_size=10.
Scale: Batch 500 engines, aiming for 10–15 nodes.

replied to their post about 2 months ago

@oieieio This is awesome! What is your primary feedback on how I can improve it? I haven't had a chance to run it on a larger evaluation yet.