metadata

license: apache-2.0
base_model:
  - openai/whisper-large-v3-turbo
pipeline_tag: automatic-speech-recognition
tags:
  - custom
  - whisper
  - echo
  - rotary
  - givens
  - learned
  - dynamic
  - meta-learning
  - Multi-Dimensional Adaptation
  - self-tuning

Self-Adjusting Parameters: The model dynamically adjusts the base parameter in response to training loss, optimizing positional embeddings in real-time. This adaptive mechanism enhances the model's ability to fine-tune itself during training, ensuring better performance and efficiency.

Combined Rotary, Orthogonally Initialized, and Givens Rotation Matrices: This component integrates rotary embeddings, an orthogonally initialized rotation matrix, and Givens rotation matrices. This robust combination provides stable and effective positional embeddings, enhancing the model's capacity to represent positional information accurately. The integration helps capture dependencies and relationships within the feature space through rotational symmetry.

Learned Sinusoidal Embeddings with Checkpointing: This unique integration of sinusoidal embeddings with optional checkpointing helps manage memory efficiently during training while maintaining stable embedding magnitudes through L2 normalization.

Dynamic Positional Bias: Supports rotary embeddings and includes relative positional bias, capturing dependencies effectively. The attention mechanism is finely tuned with a dynamically adjustable base parameter, providing flexibility and precision automatically during training.

Combining Local and Global Attention: This component leverages both local and global attention mechanisms, ensuring that the model captures both fine-grained and broad context. The sliding window approach for local attention enhances its ability to process long sequences efficiently.

Integrating Convolution and Attention: This component enriches feature representation by combining convolutional layers with attention mechanisms, enabling the model to extract local context while attending to global information simultaneously.

Rotation Block Output

The transformation applied to the tensor using the Combined Rotary Embedding, Givens Rotation Matrix, and Rotary Orthogonal Matrix is summarized by the following equation:

$\mathbf{x}_{\text{transformed}} = \mathbf{x} \cdot \left( \prod_{k=1}^{N} G_k \right) \cdot R$

Performance Improvements

Applying these transformations to a tensor can improve the model's performance in several ways:

Rotational Symmetry: These transformations exploit rotational symmetry in the feature space, which can help the model recognize patterns and relationships that are invariant to rotations. This is particularly useful in tasks where rotation invariance is important, such as image and signal processing, as well as natural language processing where word order may vary.
Enhanced Representational Power: By introducing rotations, the model can create more complex and nuanced feature representations. This allows the model to capture richer and more detailed information from the input data, leading to better performance on tasks like classification, regression, and generation.
Dimensional Decorrelation: Applying orthogonal transformations helps in decorrelating the dimensions of the feature space. This can reduce redundancy and improve the efficiency of the learned representations, making the model more robust and less prone to overfitting.
Stable Training: Orthogonal matrices preserve the Euclidean norm of the vectors they transform, which can help in maintaining numerical stability during training. This is beneficial for gradient-based optimization algorithms, as it prevents gradients from exploding or vanishing.
Efficient Computations: Using Givens rotations allows for efficient computation of rotations in high-dimensional spaces. This can lead to faster training times and reduced computational complexity compared to other methods of achieving similar transformations.

Overall, these transformations can help in enhancing the model's ability to learn meaningful patterns from the data, resulting in improved performance and generalization on a variety of tasks.

Hugging face whisper config parameters can be ignored and are there for compatibility model should be loaded with initialization of model code for now. Model is not trained and is "medium".

config = WhisperConfig (

n_mels=80,
n_audio_ctx=1500,
n_audio_state=1024,
n_audio_head=16,
n_audio_layer=24,
vocab_size=51865,
n_text_ctx=448,
n_text_state=1024,
n_text_head=16,
n_text_layer=20,
max_rel_dist=15,
cross_attention=True,
checkpointing=True,
base=10000,
bos_token_id = 50257,
eos_token_id = 50257,
pad_token_id = 50257,
decoder_start_token_id = 50258,
is_encoder_decoder = True,
init_std=0.02,
)

model = Echo(config).to(device)

model.apply_initialization()

model.save_pretrained("./models/echo2")

model = Echo.from_pretrained("./models/echo2")