Spaces:

ethanlshen
/

SuperposedDecoding

Sleeping

App Files Files Community

Ethan Shen commited on 12 days ago

Commit

dda1539

•

1 Parent(s): bc4858e

Initial commit

Browse files

Files changed (41) hide show

.gitignore +3 -0
LICENSE +126 -0
app.py +97 -0
params/g15_d3_mixed.json +27 -0
params/g20_d3_mixed.json +27 -0
params/g5_d3_mixed.json +27 -0
params/p15_d10_mixed.json +26 -0
params/p15_d2_mixed.json +26 -0
params/p15_d3_mixed.json +26 -0
params/p15_d3_ngram4_mixed.json +22 -0
params/p15_d4_mixed.json +26 -0
params/p15_d5_mixed.json +26 -0
params/p15_d6_mixed.json +26 -0
params/p25_d3_mixed.json +26 -0
params/p40_d3_mixed.json +12 -0
params/p5_d3_mixed.json +26 -0
requirements.txt +11 -0
superposed/llama/__init__.py +6 -0
superposed/llama/__pycache__/__init__.cpython-312.pyc +0 -0
superposed/llama/__pycache__/generation.cpython-312.pyc +0 -0
superposed/llama/__pycache__/model.cpython-312.pyc +0 -0
superposed/llama/__pycache__/superpose.cpython-312.pyc +0 -0
superposed/llama/__pycache__/superposed_generation.cpython-312.pyc +0 -0
superposed/llama/__pycache__/superposed_model.cpython-312.pyc +0 -0
superposed/llama/__pycache__/tokenizer.cpython-312.pyc +0 -0
superposed/llama/__pycache__/utils.cpython-312.pyc +0 -0
superposed/llama/generation.py +268 -0
superposed/llama/metrics.py +109 -0
superposed/llama/model.py +548 -0
superposed/llama/superpose.py +328 -0
superposed/llama/superposed_generation.py +198 -0
superposed/llama/superposed_model.py +515 -0
superposed/llama/tokenizer.py +68 -0
superposed/llama/utils.py +70 -0
superposed/ngrams/__pycache__/ngram_models.cpython-312.pyc +0 -0
superposed/ngrams/make_corpus.py +268 -0
superposed/ngrams/ngram_models.py +115 -0
superposed/ngrams/test.json +8 -0
superposed/notebooks/custom.ipynb +289 -0
superposed/notebooks/nq.ipynb +417 -0
superposed/notebooks/triviaqa.ipynb +404 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+.env
+weights
+ckpts-200k

LICENSE ADDED Viewed

	@@ -0,0 +1,126 @@

+LLAMA 2 COMMUNITY LICENSE AGREEMENT
+Llama 2 Version Release Date: July 18, 2023
+"Agreement" means the terms and conditions for use, reproduction, distribution and
+modification of the Llama Materials set forth herein.
+"Documentation" means the specifications, manuals and documentation
+accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+"Licensee" or "you" means you, or your employer or any other person or entity (if
+you are entering into this Agreement on such person or entity's behalf), of the age
+required under applicable laws, rules or regulations to provide legal consent and that
+has legal authority to bind your employer or such other person or entity if you are
+entering in this Agreement on their behalf.
+"Llama 2" means the foundational large language models and software and
+algorithms, including machine-learning model code, trained model weights,
+inference-enabling code, training-enabling code, fine-tuning enabling code and other
+elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+"Llama Materials" means, collectively, Meta's proprietary Llama 2 and
+Documentation (and any portion thereof) made available under this Agreement.
+"Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
+are an entity, your principal place of business is in the EEA or Switzerland) and Meta
+Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+By clicking "I Accept" below or by using or distributing any portion or element of the
+Llama Materials, you agree to be bound by this Agreement.
+1. License Rights and Redistribution.
+      a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
+transferable and royalty-free limited license under Meta's intellectual property or
+other rights owned by Meta embodied in the Llama Materials to use, reproduce,
+distribute, copy, create derivative works of, and make modifications to the Llama
+Materials.
+      b. Redistribution and Use.
+            i. If you distribute or make the Llama Materials, or any derivative works
+thereof, available to a third party, you shall provide a copy of this Agreement to such
+third party.
+            ii.  If you receive Llama Materials, or any derivative works thereof, from
+a Licensee as part of an integrated end user product, then Section 2 of this
+Agreement will not apply to you.
+            iii. You must retain in all copies of the Llama Materials that you
+distribute the following attribution notice within a "Notice" text file distributed as a
+part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
+Copyright (c) Meta Platforms, Inc. All Rights Reserved."
+            iv. Your use of the Llama Materials must comply with applicable laws
+and regulations (including trade compliance laws and regulations) and adhere to the
+Acceptable Use Policy for the Llama Materials (available at
+https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
+this Agreement.
+            v. You will not use the Llama Materials or any output or results of the
+Llama Materials to improve any other large language model (excluding Llama 2 or
+derivative works thereof).
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the
+monthly active users of the products or services made available by or for Licensee,
+or Licensee's affiliates, is greater than 700 million monthly active users in the
+preceding calendar month, you must request a license from Meta, which Meta may
+grant to you in its sole discretion, and you are not authorized to exercise any of the
+rights under this Agreement unless or until Meta otherwise expressly grants you
+such rights.
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
+LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
+PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
+EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
+WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
+FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
+FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
+THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
+USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
+LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
+NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
+AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
+CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
+IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
+ANY OF THE FOREGOING.
+5. Intellectual Property.
+      a. No trademark licenses are granted under this Agreement, and in
+connection with the Llama Materials, neither Meta nor Licensee may use any name
+or mark owned by or associated with the other or any of its affiliates, except as
+required for reasonable and customary use in describing and redistributing the
+Llama Materials.
+      b. Subject to Meta's ownership of Llama Materials and derivatives made by or
+for Meta, with respect to any derivative works and modifications of the Llama
+Materials that are made by you, as between you and Meta, you are and will be the
+owner of such derivative works and modifications.
+      c. If you institute litigation or other proceedings against Meta or any entity
+(including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
+Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
+constitutes an infringement of intellectual property or other rights owned or licensable
+by you, then any licenses granted to you under this Agreement shall terminate as of
+the date such litigation or claim is filed or instituted. You will indemnify and hold
+harmless Meta from and against any claim by any third party arising out of or related
+to your use or distribution of the Llama Materials.
+6. Term and Termination. The term of this Agreement will commence upon your
+acceptance of this Agreement or access to the Llama Materials and will continue in
+full force and effect until terminated in accordance with the terms and conditions
+herein. Meta may terminate this Agreement if you are in breach of any term or
+condition of this Agreement. Upon termination of this Agreement, you shall delete
+and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
+termination of this Agreement.
+7. Governing Law and Jurisdiction. This Agreement will be governed and
+construed under the laws of the State of California without regard to choice of law
+principles, and the UN Convention on Contracts for the International Sale of Goods
+does not apply to this Agreement. The courts of California shall have exclusive
+jurisdiction of any dispute arising out of this Agreement.

app.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import gradio as gr
+import json
+import os
+import spaces
+import torch
+from dotenv import load_dotenv
+from huggingface_hub import login, snapshot_download
+from superposed.llama.superposed_generation import SuperposedLlama
+from superposed.llama.tokenizer import Tokenizer
+from superposed.ngrams.ngram_models import make_models
+# load_dotenv()
+# print(os.getenv("HF_ACCESS_TOKEN"))
+login(os.getenv("HF_ACCESS_TOKEN"))
+if not os.path.exists("./weights/"):
+    os.mkdir("./weights/")
+snapshot_download(repo_id="meta-llama/Llama-2-7b", local_dir="./weights/")
+weight_path = "./weights/"
+# Load params
+param_file = "params/p15_d3_mixed.json"
+with open(param_file, "r") as f:
+    params = json.load(f)
+alpha = params["alpha"]
+temp = params["temp"]
+n_drafts = params["n_drafts"]
+prompt_len = params["prompt_len"]
+n_token_sample = params["n_token_sample"]
+i_weights = params["i_weights"]
+i_length = params["i_length"]
+# Load main model
+model = SuperposedLlama.build(ckpt_dir=weight_path,
+                         tokenizer_path=f'{weight_path}/tokenizer.model',
+                         max_seq_len=100,
+                         max_batch_size=32,
+                         model_parallel_size=1)
+tokenizer = Tokenizer(f'{weight_path}/tokenizer.model')
+# Create ngram models
+ngrams = make_models("ckpts-200k", bigram=True, trigram=True, fourgram=True, fivegram=True, sixgram=True, sevengram=False)
+def decode(tokenizer, encoding):
+    """
+    Args:
+        tokenizer (Any): Tokenizer
+        encoding (torch.Tensor): Encoding
+    Returns:
+        decoding (str)
+    """
+    eos_locs = (encoding == tokenizer.eos_id).nonzero()
+    if len(eos_locs > 0):
+        encoding = encoding[:eos_locs[0]]
+    return tokenizer.decode(encoding.to(torch.int32).tolist())
+@spaces.GPU
+def update_options(input, num_tokens):
+    tokenized_prompts = tokenizer.encode([input], True, False)
+    alive_gens, _ = model.sup_generate(prompt_tokens=tokenized_prompts,
+                                            smoothing="geom",
+                                            max_gen_len=num_tokens,
+                                            n_token_sample=n_token_sample,
+                                            alpha=alpha,
+                                            temp=temp,
+                                            n_drafts=n_drafts,
+                                            i_weights=i_weights,
+                                            i_length=i_length,
+                                            ngrams=ngrams,
+                                            get_time=False,
+                                            penalty=200)
+    gens = alive_gens[0].reshape(n_drafts, -1)
+    return decode(tokenizer, gens[0]), decode(tokenizer, gens[1]), decode(tokenizer, gens[2])
+with gr.Blocks(theme=gr.themes.Soft()) as demo:
+    gr.Markdown(
+    """
+    # Superposed Decoding
+    Start typing below to see suggestions.
+    """)
+    slider = gr.Slider(minimum=1, maximum=10, step=1, label="Generation length", value=10)
+    inp = gr.Textbox(placeholder="Type anything!", lines=3)
+    option1 = gr.Button(value="Option 1")
+    option2 = gr.Button(value="Option 2")
+    option3 = gr.Button(value="Option 3")
+    inp.change(update_options, inputs=[inp, slider], outputs=[option1, option2, option3])
+    # Button updates
+    @option1.click(inputs=[inp, option1], outputs=inp)
+    def option1_click(curr, txt):
+        return curr + txt
+    @option2.click(inputs=[inp, option2], outputs=inp)
+    def option2_click(curr, txt):
+        return curr + txt
+    @option3.click(inputs=[inp, option3], outputs=inp)
+    def option3_click(curr, txt):
+        return curr + txt
+if __name__ == "__main__":
+    demo.launch(debug=True)

params/g15_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+    "alpha": 0.48,
+    "temp": 0.06,
+    "n_drafts": 3,
+    "prompt_len": 15,
+    "n_token_sample": 15,
+    "max_gen_len": 15,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/g20_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+    "alpha": 0.5,
+    "temp": 0.04,
+    "n_drafts": 3,
+    "prompt_len": 15,
+    "n_token_sample": 15,
+    "max_gen_len": 20,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/g5_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+    "alpha": 0.52,
+    "temp": 0.06,
+    "n_drafts": 3,
+    "prompt_len": 15,
+    "n_token_sample": 15,
+    "max_gen_len": 5,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d10_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.54,
+    "temp": 0.12,
+    "n_drafts": 10,
+    "prompt_len": 15,
+    "n_token_sample": 30,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d2_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.62,
+    "temp": 0.06,
+    "n_drafts": 2,
+    "prompt_len": 15,
+    "n_token_sample": 6,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.54,
+    "temp": 0.06,
+    "n_drafts": 3,
+    "prompt_len": 15,
+    "n_token_sample": 9,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d3_ngram4_mixed.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+    "alpha": 0.55,
+    "temp": 0.1,
+    "n_drafts": 3,
+    "prompt_len": 15,
+    "n_token_sample": 9,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15
+    ],
+    "i_length": [
+        1,
+        2,
+        3
+    ]
+}

params/p15_d4_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.52,
+    "temp": 0.06,
+    "n_drafts": 4,
+    "prompt_len": 15,
+    "n_token_sample": 12,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d5_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.6,
+    "temp": 0.06,
+    "n_drafts": 5,
+    "prompt_len": 15,
+    "n_token_sample": 15,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p15_d6_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.52,
+    "temp": 0.06,
+    "n_drafts": 6,
+    "prompt_len": 15,
+    "n_token_sample": 18,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p25_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.5,
+    "temp": 0.12,
+    "n_drafts": 3,
+    "prompt_len": 25,
+    "n_token_sample": 15,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

params/p40_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "alpha": 0.55,
+    "temp": 0.1,
+    "prompt_len": 40,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [0.01, 0.04, 0.15, 0.18, 0.12],
+    "i_length": [1, 2, 3, 4, 5],
+    "ckpt_path": "../ckpts-200k"
+}

params/p5_d3_mixed.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "alpha": 0.34,
+    "temp": 0.12,
+    "n_drafts": 3,
+    "prompt_len": 5,
+    "n_token_sample": 15,
+    "n_token_consider": 32000,
+    "mixing_method": "sample_new_weights_with_score",
+    "smoothing": "geom",
+    "sample_tokens": 0,
+    "sample_beams": 0,
+    "i_weights": [
+        0.01,
+        0.04,
+        0.15,
+        0.18,
+        0.12
+    ],
+    "i_length": [
+        1,
+        2,
+        3,
+        4,
+        5
+    ]
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+datasets==2.19.0
+fairscale==0.4.13
+loguru==0.7.2
+nltk==3.8.1
+numpy==1.26.4
+Requests==2.32.2
+sentencepiece==0.2.0
+setuptools==58.2.0
+torch==2.3.0
+tqdm==4.66.4
+transformers==4.37.2

superposed/llama/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+from .generation import Llama, Dialog
+from .model import ModelArgs, Transformer
+from .tokenizer import Tokenizer

superposed/llama/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (335 Bytes). View file

superposed/llama/__pycache__/generation.cpython-312.pyc ADDED Viewed

Binary file (13.9 kB). View file

superposed/llama/__pycache__/model.cpython-312.pyc ADDED Viewed

Binary file (26.7 kB). View file

superposed/llama/__pycache__/superpose.cpython-312.pyc ADDED Viewed

Binary file (19.1 kB). View file

superposed/llama/__pycache__/superposed_generation.cpython-312.pyc ADDED Viewed

Binary file (10.1 kB). View file

superposed/llama/__pycache__/superposed_model.cpython-312.pyc ADDED Viewed

Binary file (25.9 kB). View file

superposed/llama/__pycache__/tokenizer.cpython-312.pyc ADDED Viewed

Binary file (3.26 kB). View file

superposed/llama/__pycache__/utils.cpython-312.pyc ADDED Viewed

Binary file (3.97 kB). View file

superposed/llama/generation.py ADDED Viewed

	@@ -0,0 +1,268 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import json
+import os
+import sys
+import time
+from pathlib import Path
+from typing import List, Literal, Optional, Tuple, TypedDict
+import torch
+import torch.nn.functional as F
+from fairscale.nn.model_parallel.initialize import (
+    get_model_parallel_rank,
+    initialize_model_parallel,
+    model_parallel_is_initialized,
+)
+from superposed.llama.model import ModelArgs, Transformer
+from superposed.llama.tokenizer import Tokenizer
+from superposed.llama.utils import *
+Role = Literal["system", "user", "assistant"]
+class Message(TypedDict):
+    role: Role
+    content: str
+class CompletionPrediction(TypedDict, total=False):
+    generation: str
+    tokens: List[str]  # not required
+    logprobs: List[float]  # not required
+class ChatPrediction(TypedDict, total=False):
+    generation: Message
+    tokens: List[str]  # not required
+    logprobs: List[float]  # not required
+Dialog = List[Message]
+B_INST, E_INST = "[INST]", "[/INST]"
+B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
+SPECIAL_TAGS = [B_INST, E_INST, "<<SYS>>", "<</SYS>>"]
+UNSAFE_ERROR = "Error: special tags are not allowed as part of the prompt."
+class Llama:
+    @staticmethod
+    def build(
+        ckpt_dir: str,
+        tokenizer_path: str,
+        max_seq_len: int,
+        max_batch_size: int,
+        device: None,
+        model_parallel_size: Optional[int] = None,
+        seed: int = 1,
+    ) -> "Llama":
+        """
+        Build a Llama instance by initializing and loading a pre-trained model.
+        Args:
+            ckpt_dir (str): Path to the directory containing checkpoint files.
+            tokenizer_path (str): Path to the tokenizer file.
+            max_seq_len (int): Maximum sequence length for input text.
+            max_batch_size (int): Maximum batch size for inference.
+            mixed (bool): Whether to mix embeddings or not
+            model_parallel_size (Optional[int], optional): Number of model parallel processes.
+                If not provided, it's determined from the environment. Defaults to None.
+        Returns:
+            Llama: An instance of the Llama class with the loaded model and tokenizer.
+        Raises:
+            AssertionError: If there are no checkpoint files in the specified directory,
+                or if the model parallel size does not match the number of checkpoint files.
+        Note:
+            This method initializes the distributed process group, sets the device to CUDA,
+            and loads the pre-trained model and tokenizer.
+        """
+        if not torch.distributed.is_initialized():
+            torch.distributed.init_process_group("nccl")
+        if not model_parallel_is_initialized():
+            if model_parallel_size is None:
+                model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
+            initialize_model_parallel(model_parallel_size)
+        local_rank = int(os.environ.get("LOCAL_RANK", 0))
+        print(local_rank)
+        # torch.cuda.set_device(local_rank)
+        if device == None:
+            torch.cuda.set_device(local_rank)
+            device = f"cuda:{local_rank}"
+        # seed must be the same in all processes
+        torch.manual_seed(seed)
+        if local_rank > 0:
+            sys.stdout = open(os.devnull, "w")
+        start_time = time.time()
+        checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
+        assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
+        assert model_parallel_size == len(
+            checkpoints
+        ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"
+        ckpt_path = checkpoints[get_model_parallel_rank()]
+        checkpoint = torch.load(ckpt_path, map_location="cpu")
+        with open(Path(ckpt_dir) / "params.json", "r") as f:
+            params = json.loads(f.read())
+        model_args: ModelArgs = ModelArgs(
+            max_seq_len=max_seq_len,
+            max_batch_size=max_batch_size,
+            **params,
+        )
+        tokenizer = Tokenizer(model_path=tokenizer_path)
+        model_args.vocab_size = tokenizer.n_words
+        torch.set_default_tensor_type(torch.cuda.HalfTensor)
+        model = Transformer(model_args)
+        model.load_state_dict(checkpoint, strict=False)
+        print(f"Loaded in {time.time() - start_time:.2f} seconds")
+        return Llama(model, tokenizer, device)
+    def __init__(self, model: Transformer, tokenizer: Tokenizer, device):
+        self.model = model.to(device).eval()
+        self.tokenizer = tokenizer
+        self.device = device
+    @torch.inference_mode()
+    def generate(
+        self,
+        prompt_tokens: List[List[int]],
+        max_gen_len: int,
+        temperature: float = 0.6,
+        top_p: float = 0.9,
+        logprobs: bool = True,
+        grade: bool = False
+    ) -> Tuple[List[List[int]], Optional[List[List[float]]]]:
+        """
+        Generate text sequences based on provided prompts using the language generation model.
+        Args:
+            prompt_tokens (List[List[int]]): List of tokenized prompts, where each prompt is represented as a list of integers.
+            max_gen_len (int): Maximum length of the generated text sequence.
+            temperature (float, optional): Temperature value for controlling randomness in sampling. Defaults to 0.6.
+            top_p (float, optional): Top-p probability threshold for nucleus sampling. Defaults to 0.9.
+            logprobs (bool, optional): Flag indicating whether to compute token log probabilities. Defaults to False.
+            echo (bool, optional): Flag indicating whether to include prompt tokens in the generated output. Defaults to False.
+        Returns:
+            Tuple[List[List[int]], Optional[List[List[float]]]]: A tuple containing generated token sequences and, if logprobs is True, corresponding token log probabilities.
+        Note:
+            This method uses the provided prompts as a basis for generating text. It employs nucleus sampling to produce text with controlled randomness.
+            If logprobs is True, token log probabilities are computed for each generated token.
+        """
+        params = self.model.params
+        bsz = len(prompt_tokens)
+        assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)
+        min_prompt_len = min(len(t) for t in prompt_tokens)
+        max_prompt_len = max(len(t) for t in prompt_tokens)
+        # assert min_prompt_len == max_prompt_len
+        prompt_len = min_prompt_len
+        assert max_prompt_len <= params.max_seq_len
+        total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
+        pad_id = self.tokenizer.pad_id
+        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=self.device)
+        for k, t in enumerate(prompt_tokens):
+            tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device=self.device)
+        if logprobs:
+            token_logprobs = torch.zeros_like(tokens, dtype=torch.float)
+        prev_pos = 0
+        eos_reached = torch.tensor([False] * bsz, device=self.device)
+        input_text_mask = tokens != pad_id
+        if grade:
+            pad_mask = tokens == pad_id
+            tokens = torch.where(tokens == pad_id, 0, tokens)
+            logits = self.model.forward(tokens, prev_pos, False)
+            tokens[pad_mask] = pad_id
+            token_logprobs = -F.cross_entropy(
+                input=logits[:, :-1, :].transpose(1, 2),
+                target=tokens[:, 1:],
+                reduction="none",
+                ignore_index=pad_id,
+            )
+            #if pad_id in tokens:
+            #    print(pad_id)
+            #    print(tokens)
+            #    print(token_logprobs)
+            return token_logprobs
+        for cur_pos in range(min_prompt_len, total_len):
+            logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos, False)
+            if temperature > 0:
+                probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
+                next_token = sample_top_p(probs, top_p)
+            else:
+                next_token = torch.argmax(logits[:, -1], dim=-1)
+            next_token = next_token.reshape(-1)
+            # only replace token if prompt has already been generated
+            next_token = torch.where(
+                input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
+            )
+            tokens[:, cur_pos] = next_token
+            if logprobs:
+                token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
+                    input=logits.transpose(1, 2),
+                    target=tokens[:, prev_pos + 1 : cur_pos + 1],
+                    reduction="none",
+                    ignore_index=pad_id,
+                )
+            eos_reached |= (~input_text_mask[:, cur_pos]) & (
+                next_token == self.tokenizer.eos_id
+            )
+            prev_pos = cur_pos
+            if all(eos_reached):
+                break
+        # seq_len = torch.sum(tokens != pad_id, dim=1)
+        # return tokens, torch.exp(-1 * torch.sum(logprobs, dim=1) / (seq_len - prompt_len)), torch.exp(-1 * torch.sum(custom_logprobs, dim=1) / )
+        if logprobs:
+            token_logprobs = token_logprobs.tolist()
+        out_ppl = []
+        for i, toks in enumerate(tokens.tolist()):
+            if logprobs:
+                probs = token_logprobs[i][prompt_len : len(prompt_tokens[i]) + max_gen_len]
+            # cut to eos tok if any
+            if self.tokenizer.eos_id in toks:
+                eos_idx = toks.index(self.tokenizer.eos_id)
+                probs = probs[:eos_idx] if logprobs else None
+            out_ppl.append(torch.exp(-1 * torch.sum(torch.tensor(probs)) / len(probs)))
+        return tokens, torch.tensor(out_ppl) if logprobs else None
+def sample_top_p(probs, p, s=1):
+    """
+    Perform top-p (nucleus) sampling on a probability distribution.
+    Args:
+        probs (torch.Tensor): Probability distribution tensor.
+        p (float): Probability threshold for top-p sampling.
+    Returns:
+        torch.Tensor: Sampled token indices.
+    Note:
+        Top-p sampling selects the smallest set of tokens whose cumulative probability mass
+        exceeds the threshold p. The distribution is renormalized based on the selected tokens.
+    """
+    probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
+    probs_sum = torch.cumsum(probs_sort, dim=-1)
+    mask = probs_sum - probs_sort > p
+    probs_sort[mask] = 0.0
+    probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
+    next_token = torch.multinomial(probs_sort, num_samples=s)
+    next_token = torch.gather(probs_idx, -1, next_token)
+    return next_token

superposed/llama/metrics.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import torch
+import nltk
+from nltk.translate.bleu_score import SmoothingFunction
+from tqdm import tqdm
+def calculate_perplexity(model, tokens, prompt_len, bsz=1, marker=False):
+    """
+    Calculate perplexity of given tokens using provided model, ignoring padding tokens.
+    Args:
+        model: Llama model
+        tokens (List[List[int]] or torch.Tensor): Input tokens (n_prompt * n_draft, seqlen)
+        prompt_len (int): Prefix length
+        bsz (int): Batch size
+        marker (bool): Whether to show progress bar
+    Returns:
+        Perplexity across all generations (n_prompt * n_drafts)
+    """
+    it = range(0, len(tokens), bsz)
+    if marker:
+        it = tqdm(it)
+    start = 0
+    ppl = torch.zeros(len(tokens))
+    for start in it:
+        end = start + bsz
+        data = tokens[start : end]
+        if not isinstance(data, list):
+            data = data.tolist()
+        # Remove any padding tokens (-1) in generations
+        for d_idx in range(len(data)):
+            cur = data[d_idx]
+            if -1 in cur:
+                data[d_idx] = cur[:cur.index(-1)]
+        # Calculate cross entropy loss on tokens
+        ce_loss = model.generate(data, max_gen_len=0, temperature=-1, top_p=-1, grade=True)
+        # Cut off everything past `prompt_len`
+        ce_loss = ce_loss[:, prompt_len-1:]  # Subtract 1 because the first token (start token) is removed
+        # Calculate perplexity
+        lengths = (ce_loss != 0).sum(dim=-1)
+        mean = ce_loss.sum(dim=-1) / lengths
+        ppl[start : end] = torch.exp(-1 * mean)
+    return ppl
+def calculate_diversity(generations, k=4):
+    """
+    Calculate diversity of generations using SELF-BLEU.
+    Args:
+        generations (List[List[List[int]]]): Tokenized input
+        k (int, Optional): Number of n-grams to use for bleu
+    Returns:
+        Average diversity across all generations (float)
+    """
+    nltk.download('punkt')  # Can be deleted once downloaded
+    smooth = SmoothingFunction()
+    bleus = []
+    for drafts in generations:
+        tokenized_drafts = []
+        # Stringify tokens
+        for d in drafts:
+            if -1 in d:
+                d = d[:d.index(-1)]
+            tokenized_drafts.append([str(n) for n in d])
+        # Calculate SELF-BLEU
+        minlength = min([len(g) for g in tokenized_drafts])
+        minlength = min(minlength, k)
+        weights = tuple((1. / minlength for _ in range(minlength)))
+        for i in range(len(drafts)):
+            # Create source and reference (all other drafts)
+            src = tokenized_drafts[i]
+            ref = tokenized_drafts[:i] + tokenized_drafts[i+1:]
+            tmp = nltk.translate.bleu_score.sentence_bleu(references=ref,
+                                                          hypothesis=src,
+                                                          weights=weights,
+                                                          smoothing_function=smooth.method1)
+            bleus.append(tmp)
+    bleus = torch.Tensor(bleus)
+    return torch.mean(bleus)
+def calculate_ngram_repetition(sequences):
+    """
+    Calculate uniqueness scores of `sequences`.
+    Args:
+        sequences (List[List[int]]): Generated sequences
+    Returns:
+        (unigram_uniqueness, bigram_uniqueness, trigram_uniqueness)
+    """
+    u_total = 0
+    b_total = 0
+    t_total = 0
+    # Iterate through all sequences indiscriminately
+    for gen in sequences:
+        if -1 in gen:
+            gen = gen[:gen.index(-1)]
+        unigrams, bigrams, trigrams = [], [], []
+        o = [str(i) for i in gen]
+        # Create lists of n-grams for the generation
+        for i in range(len(o)):
+            unigrams.append(o[i])
+        for i in range(len(o) - 1):
+            bigrams.append(o[i] + '_' + o[i + 1])
+        for i in range(len(o) - 2):
+            trigrams.append(o[i] + '_' + o[i + 1] + '_' + o[i + 2])
+        # Calculate uniqueness of the generation
+        u, b, t = len(set(unigrams)) / len(unigrams), len(set(bigrams)) / len(bigrams), len(set(trigrams)) / len(trigrams)
+        u_total += u
+        b_total += b
+        t_total += t
+    return u_total / len(sequences), b_total / len(sequences), t_total / len(sequences)

superposed/llama/model.py ADDED Viewed

	@@ -0,0 +1,548 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import fairscale.nn.model_parallel.initialize as fs_init
+import torch
+import torch.nn.functional as F
+from fairscale.nn.model_parallel.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+    RowParallelLinear,
+)
+from torch import nn
+@dataclass
+class ModelArgs:
+    dim: int = 4096
+    n_layers: int = 32
+    n_heads: int = 32
+    n_kv_heads: Optional[int] = None
+    vocab_size: int = -1  # defined later by tokenizer
+    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
+    ffn_dim_multiplier: Optional[float] = None
+    norm_eps: float = 1e-5
+    max_batch_size: int = 32
+    max_seq_len: int = 2048
+class RMSNorm(torch.nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        """
+        Initialize the RMSNorm normalization layer.
+        Args:
+            dim (int): The dimension of the input tensor.
+            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
+        Attributes:
+            eps (float): A small value added to the denominator for numerical stability.
+            weight (nn.Parameter): Learnable scaling parameter.
+        """
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def _norm(self, x):
+        """
+        Apply the RMSNorm normalization to the input tensor.
+        Args:
+            x (torch.Tensor): The input tensor.
+        Returns:
+            torch.Tensor: The normalized tensor.
+        """
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        """
+        Forward pass through the RMSNorm layer.
+        Args:
+            x (torch.Tensor): The input tensor.
+        Returns:
+            torch.Tensor: The output tensor after applying RMSNorm.
+        """
+        output = self._norm(x.float()).type_as(x)
+        return output * self.weight
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
+    """
+    Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
+    This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
+    and the end index 'end'. The 'theta' parameter scales the frequencies.
+    The returned tensor contains complex values in complex64 data type.
+    Args:
+        dim (int): Dimension of the frequency tensor.
+        end (int): End index for precomputing frequencies.
+        theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
+    Returns:
+        torch.Tensor: Precomputed frequency tensor with complex exponentials.
+    """
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
+    return freqs_cis
+def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
+    """
+    Reshape frequency tensor for broadcasting it with another tensor.
+    This function reshapes the frequency tensor to have the same shape as the target tensor 'x'
+    for the purpose of broadcasting the frequency tensor during element-wise operations.
+    Args:
+        freqs_cis (torch.Tensor): Frequency tensor to be reshaped.
+        x (torch.Tensor): Target tensor for broadcasting compatibility.
+    Returns:
+        torch.Tensor: Reshaped frequency tensor.
+    Raises:
+        AssertionError: If the frequency tensor doesn't match the expected shape.
+        AssertionError: If the target tensor 'x' doesn't have the expected number of dimensions.
+    """
+    ndim = x.ndim
+    assert 0 <= 1 < ndim
+    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
+    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
+    return freqs_cis.view(*shape)
+def apply_rotary_emb(
+    xq: torch.Tensor,
+    xk: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Apply rotary embeddings to input tensors using the given frequency tensor.
+    This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
+    frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
+    is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
+    returned as real tensors.
+    Args:
+        xq (torch.Tensor): Query tensor to apply rotary embeddings.
+        xk (torch.Tensor): Key tensor to apply rotary embeddings.
+        freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.
+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
+    """
+    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
+    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
+    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
+    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
+    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
+    return xq_out.type_as(xq), xk_out.type_as(xk)
+def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
+    bs, slen, n_kv_heads, head_dim = x.shape
+    if n_rep == 1:
+        return x
+    return (
+        x[:, :, :, None, :]
+        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
+        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
+    )
+class Attention(nn.Module):
+    """Multi-head attention module."""
+    def __init__(self, args: ModelArgs):
+        """
+        Initialize the Attention module.
+        Args:
+            args (ModelArgs): Model configuration parameters.
+        Attributes:
+            n_kv_heads (int): Number of key and value heads.
+            n_local_heads (int): Number of local query heads.
+            n_local_kv_heads (int): Number of local key and value heads.
+            n_rep (int): Number of repetitions for local heads.
+            head_dim (int): Dimension size of each attention head.
+            wq (ColumnParallelLinear): Linear transformation for queries.
+            wk (ColumnParallelLinear): Linear transformation for keys.
+            wv (ColumnParallelLinear): Linear transformation for values.
+            wo (RowParallelLinear): Linear transformation for output.
+            cache_k (torch.Tensor): Cached keys for attention.
+            cache_v (torch.Tensor): Cached values for attention.
+        """
+        super().__init__()
+        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
+        model_parallel_size = fs_init.get_model_parallel_world_size()
+        self.n_local_heads = args.n_heads // model_parallel_size
+        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
+        self.n_rep = self.n_local_heads // self.n_local_kv_heads
+        self.head_dim = args.dim // args.n_heads
+        self.wq = ColumnParallelLinear(
+            args.dim,
+            args.n_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wk = ColumnParallelLinear(
+            args.dim,
+            self.n_kv_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wv = ColumnParallelLinear(
+            args.dim,
+            self.n_kv_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wo = RowParallelLinear(
+            args.n_heads * self.head_dim,
+            args.dim,
+            bias=False,
+            input_is_parallel=True,
+            init_method=lambda x: x,
+        )
+        self.cache_k = torch.zeros(
+            (
+                args.max_batch_size,
+                args.max_seq_len,
+                self.n_local_kv_heads,
+                self.head_dim,
+            )
+        ).cuda()
+        self.cache_v = torch.zeros(
+            (
+                args.max_batch_size,
+                args.max_seq_len,
+                self.n_local_kv_heads,
+                self.head_dim,
+            )
+        ).cuda()
+    def forward(
+        self,
+        x: torch.Tensor,
+        start_pos: int,
+        freqs_cis: torch.Tensor,
+        mask: Optional[torch.Tensor],
+        beam: Optional[bool] = None,
+        n_beams: Optional[int] = None,
+        attention_change_ids: Optional[torch.Tensor] = None
+    ):
+        """
+        Forward pass of the attention module.
+        Args:
+            x (torch.Tensor): Input tensor.
+            start_pos (int): Starting position for caching.
+            freqs_cis (torch.Tensor): Precomputed frequency tensor.
+            mask (torch.Tensor, optional): Attention mask tensor.
+        Returns:
+            torch.Tensor: Output tensor after attention.
+        """
+        bsz, seqlen, _ = x.shape
+        _, max_seq_len, n_local_kv_heads, head_dim = self.cache_k.shape
+        # KV Cache updates for beam search
+        if beam:
+            # Extract used cache values
+            used_cache_k = self.cache_k[:bsz]
+            used_cache_v = self.cache_v[:bsz]
+            # Reshape to apply change ids
+            t_cache_k = used_cache_k.reshape(bsz // n_beams, n_beams, max_seq_len, n_local_kv_heads, head_dim)
+            t_cache_v = used_cache_v.reshape(bsz // n_beams, n_beams, max_seq_len, n_local_kv_heads, head_dim)
+            used_cache_k = torch.take_along_dim(t_cache_k, attention_change_ids.reshape(-1, n_beams, 1, 1, 1), 1)
+            used_cache_v = torch.take_along_dim(t_cache_v, attention_change_ids.reshape(-1, n_beams, 1, 1, 1), 1)
+            # Update cache
+            self.cache_k[:bsz] = used_cache_k.reshape(bsz, max_seq_len, n_local_kv_heads, head_dim)
+            self.cache_v[:bsz] = used_cache_v.reshape(bsz, max_seq_len, n_local_kv_heads, head_dim)
+        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
+        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
+        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
+        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
+        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
+        self.cache_k = self.cache_k.to(xq)
+        self.cache_v = self.cache_v.to(xq)
+        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
+        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv
+        keys = self.cache_k[:bsz, : start_pos + seqlen]
+        values = self.cache_v[:bsz, : start_pos + seqlen]
+        # repeat k/v heads if n_kv_heads < n_heads
+        keys = repeat_kv(keys, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
+        values = repeat_kv(values, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
+        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
+        keys = keys.transpose(1, 2) # (bs, n_local_heads, seqlen, head_dim)
+        values = values.transpose(1, 2)
+        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim) # (bs, n_local_heads, seqlen, seqlen)
+        if mask is not None:
+            scores = scores + mask  # (bs, n_local_heads, seqlen, seqlen)
+        scores = F.softmax(scores.float(), dim=-1).type_as(xq) # (bs, n_local_heads, seqlen, seqlen)
+        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
+        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
+        return self.wo(output)
+class FeedForward(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        hidden_dim: int,
+        multiple_of: int,
+        ffn_dim_multiplier: Optional[float],
+    ):
+        """
+        Initialize the FeedForward module.
+        Args:
+            dim (int): Input dimension.
+            hidden_dim (int): Hidden dimension of the feedforward layer.
+            multiple_of (int): Value to ensure hidden dimension is a multiple of this value.
+            ffn_dim_multiplier (float, optional): Custom multiplier for hidden dimension. Defaults to None.
+        Attributes:
+            w1 (ColumnParallelLinear): Linear transformation for the first layer.
+            w2 (RowParallelLinear): Linear transformation for the second layer.
+            w3 (ColumnParallelLinear): Linear transformation for the third layer.
+        """
+        super().__init__()
+        hidden_dim = int(2 * hidden_dim / 3)
+        # custom dim factor multiplier
+        if ffn_dim_multiplier is not None:
+            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
+        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
+        self.w1 = ColumnParallelLinear(
+            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
+        )
+        self.w2 = RowParallelLinear(
+            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
+        )
+        self.w3 = ColumnParallelLinear(
+            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
+        )
+    def forward(self, x):
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))
+class TransformerBlock(nn.Module):
+    def __init__(self, layer_id: int, args: ModelArgs):
+        """
+        Initialize a TransformerBlock.
+        Args:
+            layer_id (int): Identifier for the layer.
+            args (ModelArgs): Model configuration parameters.
+        Attributes:
+            n_heads (int): Number of attention heads.
+            dim (int): Dimension size of the model.
+            head_dim (int): Dimension size of each attention head.
+            attention (Attention): Attention module.
+            feed_forward (FeedForward): FeedForward module.
+            layer_id (int): Identifier for the layer.
+            attention_norm (RMSNorm): Layer normalization for attention output.
+            ffn_norm (RMSNorm): Layer normalization for feedforward output.
+        """
+        super().__init__()
+        self.n_heads = args.n_heads
+        self.dim = args.dim
+        self.head_dim = args.dim // args.n_heads
+        self.attention = Attention(args)
+        self.feed_forward = FeedForward(
+            dim=args.dim,
+            hidden_dim=4 * args.dim,
+            multiple_of=args.multiple_of,
+            ffn_dim_multiplier=args.ffn_dim_multiplier,
+        )
+        self.layer_id = layer_id
+        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
+        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)
+    def forward(
+        self,
+        x: torch.Tensor,
+        start_pos: int,
+        freqs_cis: torch.Tensor,
+        mask: Optional[torch.Tensor],
+        beam: Optional[bool],
+        n_beams: Optional[int] = None,
+        attention_change_ids: Optional[torch.Tensor] = None
+    ):
+        """
+        Perform a forward pass through the TransformerBlock.
+        Args:
+            x (torch.Tensor): Input tensor.
+            start_pos (int): Starting position for attention caching.
+            freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies.
+            mask (torch.Tensor, optional): Masking tensor for attention. Defaults to None.
+        Returns:
+            torch.Tensor: Output tensor after applying attention and feedforward layers.
+        """
+        if beam:
+            h = x + self.attention.forward(
+                self.attention_norm(x), start_pos, freqs_cis, mask, beam, n_beams, attention_change_ids
+            )
+        else:
+            h = x + self.attention.forward(
+                self.attention_norm(x), start_pos, freqs_cis, mask
+            )
+        out = h + self.feed_forward.forward(self.ffn_norm(h))
+        return out
+class Transformer(nn.Module):
+    def __init__(self, params: ModelArgs):
+        """
+        Initialize a Transformer model.
+        Args:
+            params (ModelArgs): Model configuration parameters.
+        Attributes:
+            params (ModelArgs): Model configuration parameters.
+            vocab_size (int): Vocabulary size.
+            n_layers (int): Number of layers in the model.
+            tok_embeddings (ParallelEmbedding): Token embeddings.
+            layers (torch.nn.ModuleList): List of Transformer blocks.
+            norm (RMSNorm): Layer normalization for the model output.
+            output (ColumnParallelLinear): Linear layer for final output.
+            freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies.
+        """
+        super().__init__()
+        self.params = params
+        self.vocab_size = params.vocab_size
+        self.n_layers = params.n_layers
+        self.tok_embeddings = ParallelEmbedding(
+            params.vocab_size, params.dim, init_method=lambda x: x
+        )
+        self.layers = torch.nn.ModuleList()
+        for layer_id in range(params.n_layers):
+            self.layers.append(TransformerBlock(layer_id, params))
+        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
+        self.output = ColumnParallelLinear(
+            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
+        )
+        self.freqs_cis = precompute_freqs_cis(
+            # Note that self.params.max_seq_len is multiplied by 2 because the token limit for the Llama 2 generation of models is 4096.
+            # Adding this multiplier instead of using 4096 directly allows for dynamism of token lengths while training or fine-tuning.
+            self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
+        )
+    @torch.inference_mode()
+    def forward(self,
+                tokens: torch.Tensor,
+                start_pos: int,
+                beam: bool,
+                n_beams: Optional[int] = None,
+                attention_change_ids: Optional[torch.Tensor] = None,
+                verbose: Optional[bool] = False):
+        """
+        Perform a forward pass through the Transformer model.
+        Args:
+            tokens (torch.Tensor): Input token indices.
+            start_pos (int): Starting position for attention caching.
+            verbose (bool): Whether to return intermediate hidden layer states
+        Returns:
+            torch.Tensor or (torch.Tensor, Dict): output logits after applying the Transformer model.
+        """
+        ### ANALYSIS CODE ###
+        if verbose:
+            states = {"layers": [], "tokens": tokens}
+        #
+        _bsz, seqlen = tokens.shape
+        h = self.tok_embeddings(tokens)
+        self.freqs_cis = self.freqs_cis.to(h.device)
+        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]
+        ### ANALYSIS CODE ###
+        if verbose:
+            states["layers"].append(h)
+        #
+        mask = None
+        if seqlen > 1:
+            mask = torch.full(
+                (1, 1, seqlen, seqlen), float("-inf"), device=tokens.device
+            )
+            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)
+        for layer in self.layers:
+            if not beam:
+                h = layer(h, start_pos, freqs_cis, mask, beam)
+            else:
+                h = layer(h, start_pos, freqs_cis, mask, beam, n_beams, attention_change_ids)
+            ### ANALYSIS CODE ###
+            if verbose:
+                states["layers"].append(h)
+            #
+        h = self.norm(h)
+        # if want differences, at end, subtract differences from [-1] position of embedding vectors each iteration
+        ### ANALYSIS CODE ###
+        if verbose:
+            states["layers"].append(h)
+        #
+        output = self.output(h).float()
+        if verbose:
+            return output, states
+        else:
+            return output

superposed/llama/superpose.py ADDED Viewed

	@@ -0,0 +1,328 @@

+# Implementation loosely based on https://github.com/tensorflow/tensor2tensor/blob/bafdc1b67730430d38d6ab802cbd51f9d053ba2e/tensor2tensor/utils/beam_search.py#L554
+import requests
+import time
+from datetime import datetime, timedelta
+from typing import Optional, Literal
+import torch
+import torch.nn as nn
+from transformers import LlamaTokenizer
+from superposed.llama.utils import *
+from superposed.ngrams.ngram_models import NGram
+INF = 1. * 1e7
+# Test by scaling # beams & verify work
+class Superpose(nn.Module):
+    def __init__(self,
+                 initial_tokens,
+                 tokenizer,
+                 vocab_size,
+                 smoothing=Optional[Literal["geom", "all"]],
+                 alpha = None,
+                 verbose = False,
+                 i_weights = None,
+                 i_length = None,
+                 ngrams = None,
+                 sample_beams = False,
+                 sample_tokens = False,
+                 get_time = False,
+                 penalty = 200): # default no effect
+        """
+        Initialize a beam search class.
+        Args:
+            initial_tokens (torch.Tensor): Initial tokens
+            n_prompts (int): Number of prompts
+            tokenizer (Tokenizer): Llama tokenizer
+            vocab_size (int): Total vocab size
+            smoothing (str): Smoothing method ("geom" for default, "all" for only ngram, None for no ngram)
+            ngram_length (int): N gram length to consider
+            alpha (float): Alpha parameter
+            debug (bool): Whether to print information
+        """
+        super().__init__()
+        # primary parameters
+        self.n_prompts, self.n_drafts, _ = initial_tokens.shape
+        self.tokenizer = tokenizer
+        self.vocab_size = vocab_size
+        self.alive_seq = initial_tokens
+        self.fin_seq = initial_tokens
+        self.smoothing = smoothing
+        self.alive_log_probs = torch.zeros(self.n_prompts, self.n_drafts)
+        self.fin_log_probs = torch.full((self.n_prompts, self.n_drafts), float("-inf"))
+        self.alpha = alpha
+        self.verbose = verbose
+        self.penalty = penalty
+        # devices
+        self.cpu = torch.device('cpu')
+        self.gpu = torch.device('cuda')
+        # Interpolation length and weights
+        self.interpolation_weights = i_weights
+        self.i_length = i_length
+        # N-grams
+        self.bigram = ngrams[0] if len(ngrams) >= 1 else None
+        self.trigram = ngrams[1] if len(ngrams) >= 2 else None
+        self.fourgram = ngrams[2] if len(ngrams) >= 3 else None
+        self.fivegram = ngrams[3] if len(ngrams) >= 4 else None
+        self.sixgram = ngrams[4] if len(ngrams) >= 5 else None
+        self.sevengram = ngrams[5] if len(ngrams) >= 6 else None
+        # Timing
+        self.get_time = get_time
+        self.lookup_time = None
+    def forward(self, probs, still_prompt, is_first, cur_pos, n_token_sample):
+        """
+        Apply beam decoding to update generations.
+        Args:
+            probs (torch.Tensor): Next token probability distribution
+            still_prompt (torch.Tensor): Flags of prompts that should not generate yet (n_prompts, )
+            is_first (torch.Tensor): Flags of prompts that are on their first generation (n_prompts, )
+            cur_pos (int): Current generation position
+            n_token_sample (int): Number of tokens from model distribution to use
+        Return:
+            if standard beam search:
+                attention_change_ids (torch.Tensor): New indices in kv cache (n_prompts, n_drafts)
+            if mixed:
+                token_weights (torch.Tensor): Mixing weights (n_prompts, vocab_size)
+        """
+        # Adjust input probabilities
+        probs = self.get_top_k(probs, 32000, n_token_sample)
+        reshaped_probs = probs.reshape(self.n_prompts, 1, -1)
+        reshaped_probs = reshaped_probs.repeat(1, self.n_drafts, 1)
+        # Ngram smoothing
+        if self.smoothing is not None:
+            if self.smoothing == "geom":
+                ngram_probs = self.ngram_probs(self.alive_seq, cur_pos, probs=probs)
+                # Make mask and normalize
+                prob_mask = reshaped_probs != 0
+                ngram_probs *= prob_mask
+                # Calculate logprobs and interpolate distributions
+                llm_log_probs = torch.log(reshaped_probs)
+                ngram_log_probs = torch.log(ngram_probs)
+                log_probs = (1 - self.alpha) * llm_log_probs + self.alpha * ngram_log_probs
+                # Apply penalty to drafts where no interpolation occurred
+                is_all_inf = (log_probs != float("-inf")).sum(dim=-1, keepdims=True) == 0
+                log_probs = torch.where(is_all_inf, (1 - self.alpha) * llm_log_probs - self.penalty, log_probs)
+            elif self.smoothing == "all":
+                ngram_probs = self.ngram_probs(self.alive_seq, cur_pos, probs=None)
+                log_probs = torch.log(ngram_probs)
+        else:
+            log_probs = torch.log(reshaped_probs)
+        curr_log_probs = self.alive_log_probs.unsqueeze(dim=2) + log_probs # [n_prompts, n_drafts, vocab_size]
+        # Warning if nan
+        if (torch.any(torch.isnan(curr_log_probs)).item()):
+            raise RuntimeWarning("nan in sequence log probs", file=self.output_file)
+        # Potential Sequences
+        flat_curr_log_probs = curr_log_probs.reshape(-1, self.vocab_size*self.n_drafts)
+        topk_log_probs, topk_idx = torch.topk(flat_curr_log_probs, 2 * self.n_drafts, dim=-1)
+        topk_beam_id = topk_idx // self.vocab_size # [n_prompts, 2 * n_drafts]
+        topk_idx = topk_idx % self.vocab_size # [n_prompts, 2 * n_drafts]
+        # First timestep uses top-k next tokens
+        is_first_idx = is_first.nonzero(as_tuple=True)[0]
+        if len(is_first_idx) != 0:
+            first_time_log_probs = log_probs[is_first_idx][:, 0, :].squeeze(dim=1)
+            first_time_log_probs, first_time_topk_idx = torch.topk(first_time_log_probs, 2 * self.n_drafts, dim=1)
+            topk_idx[is_first_idx] = first_time_topk_idx
+            topk_log_probs[is_first_idx] = self.alive_log_probs[is_first_idx, 0].unsqueeze(dim=1) + first_time_log_probs
+        # New sequences
+        topk_seq = torch.take_along_dim(self.alive_seq, topk_beam_id.unsqueeze(2), dim=1) # [n_prompts, 2 * n_drafts, vocab_size]
+        topk_seq[:, :, cur_pos] = topk_idx
+        topk_finished = topk_idx == self.tokenizer.eos_id
+        # Only update sequences for those that have begun generating
+        new_alive_seq, new_alive_log_probs = self.grow_alive(topk_seq, topk_log_probs, topk_finished)
+        new_fin_seq, new_fin_log_probs = self.grow_fin(topk_seq, topk_log_probs, topk_finished)
+        still_prompt_probs = still_prompt.reshape(-1, 1)
+        still_prompt_seqs = still_prompt.reshape(-1, 1, 1)
+        self.alive_seq = torch.where(still_prompt_seqs, self.alive_seq, new_alive_seq)
+        self.alive_log_probs = torch.where(still_prompt_probs, self.alive_log_probs, new_alive_log_probs)
+        self.fin_seq = torch.where(still_prompt_seqs, self.fin_seq, new_fin_seq)
+        self.fin_log_probs = torch.where(still_prompt_probs, self.fin_log_probs, new_fin_log_probs)
+        # Create superposition matrix and return it
+        topk_idx = self.alive_seq[:, :, cur_pos].reshape(self.n_prompts, -1)
+        token_weights = self.superposition_matrix(topk_idx)
+        return token_weights
+    def grow_alive(self, topk_seq, topk_log_probs, topk_finished):
+        """
+        Extend running generations.
+        Args:
+            topk_seq (torch.Tensor): Top k sequences (n_prompts, 2 * n_drafts, vocab_size)
+            topk_log_probs (torch.Tensor): Log probabilities (n_prompts, 2 * n_drafts)
+            topk_finished (torch.Tensor): Whether a sequence is finished (n_prompts, 2 * n_drafts)
+        Returns:
+            new_alive_seq, new_alive_log_probs
+        """
+        topk_log_probs = topk_log_probs + topk_finished * -INF
+        new_alive_log_probs, new_alive_idx = torch.topk(topk_log_probs, self.n_drafts, dim=1)
+        new_alive_seq = torch.take_along_dim(topk_seq, new_alive_idx.unsqueeze(2), dim=1)
+        return new_alive_seq, new_alive_log_probs
+    def grow_fin(self, topk_seq, topk_log_probs, topk_finished):
+        """
+        Update stopped generations.
+        Args:
+            topk_seq (torch.Tensor): Top k sequences (n_prompts, 2 * n_drafts, vocab_size)
+            topk_log_probs (torch.Tensor): Log probabilities (n_prompts, 2 * n_drafts)
+            topk_finished (torch.Tensor): Whether a sequence is finished (n_prompts, 2 * n_drafts)
+        Returns:
+            new_fin_seq, new_fin_log_probs
+        """
+        topk_log_probs = topk_log_probs + ~topk_finished * -INF
+        new_fin_seq = torch.cat([self.fin_seq, topk_seq], dim=1)
+        new_fin_log_probs = torch.cat([self.fin_log_probs, topk_log_probs], dim=1)
+        new_fin_log_probs, new_fin_idx = torch.topk(new_fin_log_probs, self.n_drafts, dim=1)
+        new_fin_seq = torch.take_along_dim(new_fin_seq, new_fin_idx.unsqueeze(2), dim=1)
+        return new_fin_seq, new_fin_log_probs
+    def get_top_k(self, probs, m, k):
+        """
+        Zero out all but top-k tokens in a probability distribution.
+        Args:
+            probs (torch.Tensor): Probability distribution tensor.
+            m (float): Number of tokens to consider (only relevant when sampling).
+            k (int): Number of tokens to sample/keep.
+        Returns:
+            torch.Tensor: New probability distribution based on renormalized probabilities.
+        """
+        n_prompts, _ = probs.shape
+        probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
+        top_k_mask = torch.arange(probs.shape[-1])
+        top_k_mask = top_k_mask.expand(probs.shape[0], -1)
+        top_k_mask = top_k_mask >= m # Set to 1 past k elements
+        probs_sort[top_k_mask] = 0.0 # Zero wherever mask = 1
+        probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
+        next_token = torch.gather(probs_idx, -1, torch.topk(probs_sort, k, dim=-1)[1])
+        # Set all other probs to 0
+        new_probs_map = torch.zeros(probs.shape).bool()
+        new_probs_map[torch.repeat_interleave(torch.arange(n_prompts), k), torch.flatten(next_token)] = True
+        new_probs = torch.where(new_probs_map, probs, 0)
+        # Renormalize
+        new_probs.div_(new_probs.sum(dim=-1, keepdim=True))
+        return new_probs
+    def superposition_matrix(self, tokens):
+        """
+        Create superposition matrix based on provided tokens.
+        Args:
+            tokens (torch.Tensor): Tokens to mix (n_prompts, n_drafts)
+        Returns:
+            SUperposition matrix
+        """
+        # Create superposition matrix
+        mixing_matrix = torch.zeros(self.n_prompts, self.vocab_size)
+        # Convert draft log probs to probabilities
+        weightings = log_prob_to_prob(self.alive_log_probs)
+        # Update probabilities in superposition matrix with draft probabilities
+        for p_idx in range(self.n_prompts):
+            for d_idx in range(self.n_drafts):
+                tok_idx = tokens[p_idx][d_idx]
+                mixing_matrix[p_idx][tok_idx] += weightings[p_idx][d_idx]
+        # Renormalize
+        mixing_matrix.div_(mixing_matrix.sum(dim=-1, keepdims=True))
+        return mixing_matrix
+    def ngram_probs(self, alive_seq, cur_pos, probs):
+        """
+        Calculate and return next token distribution using ngram models.
+        Args:
+            alive_seq (torch.Tensor): Current drafts (n_prompts, n_drafts, seqlen)
+            cur_pos (int): Current timestep
+            probs (torch.Tensor): Current next probability distribution from model (n_prompts, vocab_size).
+            As described in the paper, only tokens w/nonzero probability in `prob` are considered for the
+            ngram distribution. However, passing in `None` as `probs` will consider all tokens.
+        Returns:
+            Next token distribution for each draft (n_prompts, n_drafts, vocab_size)
+        """
+        if self.get_time:
+            # Start timer
+            start_time = datetime.now()
+        # Create distribution matrix
+        next_token_probs = torch.zeros(self.n_prompts, self.n_drafts, 32000)
+        if probs is not None:
+            # Loop over all prefixes
+            for p_idx in range(len(alive_seq)):
+                # List of possible tokens for the prefix
+                nz = torch.nonzero(probs[p_idx, :], as_tuple=True)[0].tolist()
+                # Generate next token distribution
+                for draft_idx in range(self.n_drafts):
+                    i_mask = torch.sum(torch.tensor(self.i_length) <= cur_pos)
+                    new_i_weights = self.interpolation_weights[:i_mask]
+                    new_i_length = self.i_length[:i_mask]
+                    # For each next token
+                    for nt in nz:
+                        # Calculate probability using ngram interpolation
+                        for i, weight in zip(new_i_length, new_i_weights):
+                            if cur_pos - i >= 0:
+                                key = tuple(alive_seq[p_idx, draft_idx, cur_pos-i:cur_pos].tolist())
+                                if i == 1:
+                                    prob = self.bigram.prob(key, nt)
+                                elif i == 2:
+                                    prob = self.trigram.prob(key, nt)
+                                elif i == 3:
+                                    prob = self.fourgram.prob(key, nt)
+                                elif i == 4:
+                                    prob = self.fivegram.prob(key, nt)
+                                elif i == 5:
+                                    prob = self.sixgram.prob(key, nt)
+                                elif i == 6:
+                                    prob = self.sevengram.prob(key, nt)
+                            if prob >= 0:
+                                next_token_probs[p_idx, draft_idx, nt] += weight * prob
+        else:
+            for p_idx in range(len(alive_seq)):
+                for draft_idx in range(self.n_drafts):
+                    i_mask = torch.sum(torch.tensor(self.i_length) <= cur_pos)
+                    new_i_weights = self.interpolation_weights[:i_mask]
+                    new_i_length = self.i_length[:i_mask]
+                    for i, weight in zip(new_i_length, new_i_weights):
+                        if cur_pos - i >= 0:
+                            key = tuple(alive_seq[p_idx, draft_idx, cur_pos-i:cur_pos].tolist())
+                            if i == 1:
+                                ntd = self.bigram.ntd(key)
+                            elif i == 2:
+                                ntd = self.trigram.ntd(key)
+                            elif i == 3:
+                                ntd = self.fourgram.ntd(key)
+                            elif i == 4:
+                                ntd = self.fivegram.ntd(key)
+                            elif i == 5:
+                                ntd = self.sixgram.ntd(key)
+                            elif i == 6:
+                                ntd = self.sevengram.ntd(key)
+                        if ntd is not None:
+                            next_token_probs[p_idx, draft_idx, :] += weight * ntd
+        if self.get_time:
+            total_time = datetime.now() - start_time
+            self.lookup_time = total_time if self.lookup_time is None else self.lookup_time + total_time
+        return next_token_probs
+    def return_results(self, prompt_len=None):
+        """
+        Return generations and perplexities
+        Args:
+            prompt_len (int): Length of prompt in tokens. If is None, then ppl is not calculated.
+        Returns:
+            (self.alive_seq, alive_ppl), (self.fin_seq, fin_ppl)
+            OR
+            (self.alive_seq, alive_ppl), (self.fin_seq, fin_ppl), self.lookup_time
+        """
+        # PPL
+        alive_ppl = 0
+        fin_ppl = 0
+        if prompt_len is not None:
+            alive_ppl = torch.exp(self.alive_log_probs / (-1 * (self.alive_seq.size(dim=-1)-prompt_len)))
+            # Fin ppl
+            fin_seq_lengths = (self.fin_seq != self.tokenizer.pad_id).sum(dim=-1)
+            fin_ppl = torch.exp(self.fin_log_probs / (-1 * (fin_seq_lengths - prompt_len)))
+            fin_ppl += ((fin_ppl == 0) * float("inf"))
+        # print time
+        if not self.get_time:
+            return (self.alive_seq.to(torch.long), alive_ppl), (self.fin_seq.to(torch.long), fin_ppl)
+        else:
+            return (self.alive_seq.to(torch.long), alive_ppl), (self.fin_seq.to(torch.long), fin_ppl), self.lookup_time

superposed/llama/superposed_generation.py ADDED Viewed

	@@ -0,0 +1,198 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import json
+import os
+import sys
+import time
+from pathlib import Path
+from typing import List, Optional
+import torch
+import torch.nn.functional as F
+from fairscale.nn.model_parallel.initialize import (
+    get_model_parallel_rank,
+    initialize_model_parallel,
+    model_parallel_is_initialized,
+)
+from superposed.llama.model import ModelArgs
+from superposed.llama.superposed_model import SuperposedTransformer
+from superposed.llama.tokenizer import Tokenizer
+from superposed.llama.superpose import Superpose
+from superposed.llama.utils import *
+from superposed.ngrams.ngram_models import make_models
+class SuperposedLlama:
+    @staticmethod
+    def build(
+        ckpt_dir: str,
+        tokenizer_path: str,
+        max_seq_len: int,
+        max_batch_size: int,
+        device = None,
+        model_parallel_size: Optional[int] = None,
+        seed: int = 1,
+    ):
+        if not torch.distributed.is_initialized():
+            torch.distributed.init_process_group("nccl")
+        if not model_parallel_is_initialized():
+            if model_parallel_size is None:
+                model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
+            initialize_model_parallel(model_parallel_size)
+        local_rank = int(os.environ.get("LOCAL_RANK", 0))
+        if device == None:
+            torch.cuda.set_device(local_rank)
+            device = torch.cuda.current_device()
+        torch.manual_seed(seed)
+        if local_rank > 0:
+            sys.stdout = open(os.devnull, "w")
+        start_time = time.time()
+        checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
+        assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
+        assert model_parallel_size == len(
+            checkpoints
+        ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"
+        ckpt_path = checkpoints[get_model_parallel_rank()]
+        checkpoint = torch.load(ckpt_path, map_location="cpu")
+        with open(Path(ckpt_dir) / "params.json", "r") as f:
+            params = json.loads(f.read())
+        model_args: ModelArgs = ModelArgs(
+            max_seq_len=max_seq_len,
+            max_batch_size=max_batch_size,
+            **params,
+        )
+        tokenizer = Tokenizer(model_path=tokenizer_path)
+        model_args.vocab_size = tokenizer.n_words
+        torch.set_default_tensor_type(torch.cuda.HalfTensor)
+        # Set up superposed decoding
+        model = SuperposedTransformer(model_args)
+        model.load_state_dict(checkpoint, strict=False)
+        print(f"Loaded in {time.time() - start_time:.2f} seconds")
+        return SuperposedLlama(model, tokenizer, device)
+    def __init__(self, model: SuperposedTransformer, tokenizer: Tokenizer, device):
+        print(device)
+        self.model = model.to(device).eval()
+        self.tokenizer = tokenizer
+        self.device = device
+    @torch.inference_mode()
+    def sup_generate(
+        self,
+        prompt_tokens: List[List[int]],
+        smoothing,
+        max_gen_len: int,
+        n_token_sample: int,
+        alpha: int, # weight on bigram probs
+        temp: int,
+        n_drafts: int = 1, # number of beams
+        verbose: bool = False,
+        i_weights = None,
+        i_length = None,
+        ngrams = None,
+        get_time: bool = False,
+        penalty = 200
+    ):
+        """
+        Run multi-sequence generation using superposed embeddings.
+        Args:
+            prompt_tokens (List[List[int]]): Initial tokenized prompts
+            max_gen_len (int): Maximum numbers of tokens to generate
+            alpha (float): Alpha value
+            temp (float): Temperature
+            n_drafts (int): Number of drafts
+            verbose (bool): Whether to save intermediate embeddings for analysis
+            bsz (int): Batch size (default = 16)
+            i_weights (List[float]): Ngram interpolation weights
+            i_length (List[int]): Ngram models to interpolate (1 for bigram, 2 for trigram, etc.)
+            ngrams (Tuple): Ngram models
+            get_time (bool): Return information on time spent doing Ngram lookup
+            penalty (float): Penalty on uninterpolated drafts
+        Returns:
+            (alive_seq, alive_ppl), (fin_seq, fin_ppl): Tuple of (n_prompts, n_drafts, seqlen),
+            (n_prompts, n_drafts) for sequences still generating and sequences that have finished.
+        """
+        # Check batch size and prompt lengths
+        params = self.model.params
+        bsz = len(prompt_tokens)
+        assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)
+        min_prompt_len = min(len(t) for t in prompt_tokens)
+        max_prompt_len = max(len(t) for t in prompt_tokens)
+        prompt_len = min_prompt_len
+        assert max_prompt_len <= params.max_seq_len
+        total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
+        pad_id = self.tokenizer.pad_id
+        # Initialize token tensor and pad where necessary
+        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=self.device)
+        for k, t in enumerate(prompt_tokens):
+            tokens[k, :len(t)] = torch.tensor(t, dtype=torch.long, device=self.device)
+        # If no generation is possible
+        if min_prompt_len == total_len:
+            raise RuntimeError("no generation possible")
+        # Initialize decoding object
+        initial_tokens = tokens.unsqueeze(1).repeat(1, n_drafts, 1)
+        superpose = Superpose(initial_tokens,
+                           tokenizer=self.tokenizer,
+                           vocab_size=params.vocab_size,
+                           smoothing=smoothing,
+                           alpha=alpha,
+                           i_weights=i_weights,
+                           i_length=i_length,
+                           ngrams=ngrams,
+                           get_time=get_time,
+                           penalty=penalty)
+        unseen_first = torch.ones(bsz)
+        # Superposition matrix
+        token_weights = torch.zeros(bsz, self.model.vocab_size)
+        if verbose:
+            state_list = []
+        prev_pos = 0
+        # Begin inference
+        for cur_pos in range(min_prompt_len, total_len):
+            input_text_mask = tokens != pad_id
+            # Take model step
+            if cur_pos == min_prompt_len:
+                token_weights = None
+            logits = self.model.forward(tokens[:, prev_pos:cur_pos],
+                                        start_pos=prev_pos,
+                                        token_weights=token_weights,
+                                        verbose=verbose)
+            if verbose:
+                logits, states = logits
+            # Softmax
+            if temp > 0:
+                probs = torch.softmax(logits[:, -1] / temp, dim=-1)
+            else:
+                raise RuntimeError("Temperature must be greater than 0 while mixing")
+            if verbose:
+                states["end_probs"] = probs
+                state_list.append(states)
+            # Flag prompts on first generation
+            is_first = torch.mul(tokens[:, cur_pos] == pad_id, unseen_first)
+            unseen_first[is_first.nonzero(as_tuple=True)[0]] = 0
+            # Flag prompts not yet generating
+            still_prompt = input_text_mask[:, cur_pos]
+            # Superposition pass
+            token_weights = superpose(probs, still_prompt, is_first, cur_pos, n_token_sample)
+            # Do not superpose for prompts not yet generating
+            keep_idx = input_text_mask[:, cur_pos].ravel().nonzero()
+            keep_token_weights = torch.zeros_like(token_weights)
+            keep_token_weights[keep_idx, tokens[keep_idx, cur_pos]] = 1
+            token_weights = torch.where(input_text_mask[:, cur_pos].unsqueeze(1).expand(-1, self.model.vocab_size),
+                                        keep_token_weights, token_weights)
+            prev_pos = cur_pos
+        results = superpose.return_results(prompt_len)
+        if verbose:
+            torch.save(state_list, "../embeddings.pt")
+            return results
+        else:
+            return results

superposed/llama/superposed_model.py ADDED Viewed

	@@ -0,0 +1,515 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import fairscale.nn.model_parallel.initialize as fs_init
+import torch
+import torch.nn.functional as F
+from fairscale.nn.model_parallel.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+    RowParallelLinear,
+)
+from torch import nn
+@dataclass
+class ModelArgs:
+    dim: int = 4096
+    n_layers: int = 32
+    n_heads: int = 32
+    n_kv_heads: Optional[int] = None
+    vocab_size: int = -1  # defined later by tokenizer
+    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
+    ffn_dim_multiplier: Optional[float] = None
+    norm_eps: float = 1e-5
+    max_batch_size: int = 32
+    max_seq_len: int = 2048
+class RMSNorm(torch.nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        """
+        Initialize the RMSNorm normalization layer.
+        Args:
+            dim (int): The dimension of the input tensor.
+            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
+        Attributes:
+            eps (float): A small value added to the denominator for numerical stability.
+            weight (nn.Parameter): Learnable scaling parameter.
+        """
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def _norm(self, x):
+        """
+        Apply the RMSNorm normalization to the input tensor.
+        Args:
+            x (torch.Tensor): The input tensor.
+        Returns:
+            torch.Tensor: The normalized tensor.
+        """
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        """
+        Forward pass through the RMSNorm layer.
+        Args:
+            x (torch.Tensor): The input tensor.
+        Returns:
+            torch.Tensor: The output tensor after applying RMSNorm.
+        """
+        output = self._norm(x.float()).type_as(x)
+        k = output * self.weight
+        return k
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
+    """
+    Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
+    This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
+    and the end index 'end'. The 'theta' parameter scales the frequencies.
+    The returned tensor contains complex values in complex64 data type.
+    Args:
+        dim (int): Dimension of the frequency tensor.
+        end (int): End index for precomputing frequencies.
+        theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
+    Returns:
+        torch.Tensor: Precomputed frequency tensor with complex exponentials.
+    """
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
+    return freqs_cis
+def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
+    """
+    Reshape frequency tensor for broadcasting it with another tensor.
+    This function reshapes the frequency tensor to have the same shape as the target tensor 'x'
+    for the purpose of broadcasting the frequency tensor during element-wise operations.
+    Args:
+        freqs_cis (torch.Tensor): Frequency tensor to be reshaped.
+        x (torch.Tensor): Target tensor for broadcasting compatibility.
+    Returns:
+        torch.Tensor: Reshaped frequency tensor.
+    Raises:
+        AssertionError: If the frequency tensor doesn't match the expected shape.
+        AssertionError: If the target tensor 'x' doesn't have the expected number of dimensions.
+    """
+    ndim = x.ndim
+    assert 0 <= 1 < ndim
+    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
+    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
+    return freqs_cis.view(*shape)
+def apply_rotary_emb(
+    xq: torch.Tensor,
+    xk: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Apply rotary embeddings to input tensors using the given frequency tensor.
+    This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
+    frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
+    is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
+    returned as real tensors.
+    Args:
+        xq (torch.Tensor): Query tensor to apply rotary embeddings.
+        xk (torch.Tensor): Key tensor to apply rotary embeddings.
+        freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.
+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
+    """
+    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
+    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
+    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
+    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
+    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
+    return xq_out.type_as(xq), xk_out.type_as(xk)
+def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
+    bs, slen, n_kv_heads, head_dim = x.shape
+    if n_rep == 1:
+        return x
+    return (
+        x[:, :, :, None, :]
+        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
+        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
+    )
+class Attention(nn.Module):
+    """Multi-head attention module."""
+    def __init__(self, args: ModelArgs):
+        """
+        Initialize the Attention module.
+        Args:
+            args (ModelArgs): Model configuration parameters.
+        Attributes:
+            n_kv_heads (int): Number of key and value heads.
+            n_local_heads (int): Number of local query heads.
+            n_local_kv_heads (int): Number of local key and value heads.
+            n_rep (int): Number of repetitions for local heads.
+            head_dim (int): Dimension size of each attention head.
+            wq (ColumnParallelLinear): Linear transformation for queries.
+            wk (ColumnParallelLinear): Linear transformation for keys.
+            wv (ColumnParallelLinear): Linear transformation for values.
+            wo (RowParallelLinear): Linear transformation for output.
+            cache_k (torch.Tensor): Cached keys for attention.
+            cache_v (torch.Tensor): Cached values for attention.
+        """
+        super().__init__()
+        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
+        model_parallel_size = fs_init.get_model_parallel_world_size()
+        self.n_local_heads = args.n_heads // model_parallel_size
+        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
+        self.n_rep = self.n_local_heads // self.n_local_kv_heads
+        self.head_dim = args.dim // args.n_heads
+        self.wq = ColumnParallelLinear(
+            args.dim,
+            args.n_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wk = ColumnParallelLinear(
+            args.dim,
+            self.n_kv_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wv = ColumnParallelLinear(
+            args.dim,
+            self.n_kv_heads * self.head_dim,
+            bias=False,
+            gather_output=False,
+            init_method=lambda x: x,
+        )
+        self.wo = RowParallelLinear(
+            args.n_heads * self.head_dim,
+            args.dim,
+            bias=False,
+            input_is_parallel=True,
+            init_method=lambda x: x,
+        )
+        self.cache_k = torch.zeros(
+            (
+                args.max_batch_size,
+                args.max_seq_len,
+                self.n_local_kv_heads,
+                self.head_dim,
+            )
+        ).cuda()
+        self.cache_v = torch.zeros(
+            (
+                args.max_batch_size,
+                args.max_seq_len,
+                self.n_local_kv_heads,
+                self.head_dim,
+            )
+        ).cuda()
+    def forward(
+        self,
+        x: torch.Tensor,
+        start_pos: int,
+        freqs_cis: torch.Tensor,
+        mask: Optional[torch.Tensor]
+    ):
+        """
+        Forward pass of the attention module.
+        Args:
+            x (torch.Tensor): Input tensor.
+            start_pos (int): Starting position for caching.
+            freqs_cis (torch.Tensor): Precomputed frequency tensor.
+            mask (torch.Tensor, optional): Attention mask tensor.
+        Returns:
+            torch.Tensor: Output tensor after attention.
+        """
+        bsz, seqlen, _ = x.shape
+        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
+        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
+        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
+        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
+        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
+        self.cache_k = self.cache_k.to(xq)
+        self.cache_v = self.cache_v.to(xq)
+        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
+        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv
+        keys = self.cache_k[:bsz, : start_pos + seqlen]
+        values = self.cache_v[:bsz, : start_pos + seqlen]
+        # repeat k/v heads if n_kv_heads < n_heads
+        keys = repeat_kv(keys, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
+        values = repeat_kv(values, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
+        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
+        keys = keys.transpose(1, 2)
+        values = values.transpose(1, 2)
+        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if mask is not None:
+            scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
+        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
+        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
+        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
+        return self.wo(output)
+class FeedForward(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        hidden_dim: int,
+        multiple_of: int,
+        ffn_dim_multiplier: Optional[float],
+    ):
+        """
+        Initialize the FeedForward module.
+        Args:
+            dim (int): Input dimension.
+            hidden_dim (int): Hidden dimension of the feedforward layer.
+            multiple_of (int): Value to ensure hidden dimension is a multiple of this value.
+            ffn_dim_multiplier (float, optional): Custom multiplier for hidden dimension. Defaults to None.
+        Attributes:
+            w1 (ColumnParallelLinear): Linear transformation for the first layer.
+            w2 (RowParallelLinear): Linear transformation for the second layer.
+            w3 (ColumnParallelLinear): Linear transformation for the third layer.
+        """
+        super().__init__()
+        hidden_dim = int(2 * hidden_dim / 3)
+        # custom dim factor multiplier
+        if ffn_dim_multiplier is not None:
+            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
+        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
+        self.w1 = ColumnParallelLinear(
+            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
+        )
+        self.w2 = RowParallelLinear(
+            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
+        )
+        self.w3 = ColumnParallelLinear(
+            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
+        )
+    def forward(self, x):
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))
+class MixedTransformerBlock(nn.Module):
+    def __init__(self, layer_id: int, args: ModelArgs):
+        """
+        Initialize a TransformerBlock.
+        Args:
+            layer_id (int): Identifier for the layer.
+            args (ModelArgs): Model configuration parameters.
+        Attributes:
+            n_heads (int): Number of attention heads.
+            dim (int): Dimension size of the model.
+            head_dim (int): Dimension size of each attention head.
+            attention (Attention): Attention module.
+            feed_forward (FeedForward): FeedForward module.
+            layer_id (int): Identifier for the layer.
+            attention_norm (RMSNorm): Layer normalization for attention output.
+            ffn_norm (RMSNorm): Layer normalization for feedforward output.
+        """
+        super().__init__()
+        self.n_heads = args.n_heads
+        self.dim = args.dim
+        self.head_dim = args.dim // args.n_heads
+        self.attention = Attention(args)
+        self.feed_forward = FeedForward(
+            dim=args.dim,
+            hidden_dim=4 * args.dim,
+            multiple_of=args.multiple_of,
+            ffn_dim_multiplier=args.ffn_dim_multiplier,
+        )
+        self.layer_id = layer_id
+        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
+        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)
+    def forward(
+        self,
+        x: torch.Tensor,
+        start_pos: int,
+        freqs_cis: torch.Tensor,
+        mask: Optional[torch.Tensor]
+    ):
+        """
+        Perform a forward pass through the TransformerBlock.
+        Args:
+            x (torch.Tensor): Input tensor.
+            start_pos (int): Starting position for attention caching.
+            freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies.
+            mask (torch.Tensor, optional): Masking tensor for attention. Defaults to None.
+        Returns:
+            torch.Tensor: Output tensor after applying attention and feedforward layers.
+        """
+        h = x + self.attention.forward(
+            self.attention_norm(x), start_pos, freqs_cis, mask
+        )
+        out = h + self.feed_forward.forward(self.ffn_norm(h))
+        return out
+class SuperposedTransformer(nn.Module):
+    def __init__(self, params: ModelArgs):
+        """
+        Initialize a Transformer model.
+        Args:
+            params (ModelArgs): Model configuration parameters.
+        Attributes:
+            params (ModelArgs): Model configuration parameters.
+            vocab_size (int): Vocabulary size.
+            n_layers (int): Number of layers in the model.
+            tok_embeddings (ParallelEmbedding): Token embeddings.
+            layers (torch.nn.ModuleList): List of Transformer blocks.
+            norm (RMSNorm): Layer normalization for the model output.
+            output (ColumnParallelLinear): Linear layer for final output.
+            freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies.
+        """
+        super().__init__()
+        self.params = params
+        self.vocab_size = params.vocab_size
+        self.n_layers = params.n_layers
+        self.tok_embeddings = ParallelEmbedding(
+            params.vocab_size, params.dim, init_method=lambda x: x
+        )
+        self.tok_mixing_embeddings = ColumnParallelLinear(
+            params.vocab_size, params.dim, bias=False, init_method=lambda x: x
+        ) # dims here are formality (what matters is below)
+        self.tok_mixing_embeddings.weight = nn.Parameter(self.tok_embeddings.weight.T)
+        self.layers = torch.nn.ModuleList()
+        for layer_id in range(params.n_layers):
+            self.layers.append(MixedTransformerBlock(layer_id, params))
+        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
+        self.output = ColumnParallelLinear(
+            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
+        )
+        self.freqs_cis = precompute_freqs_cis(
+            # Note that self.params.max_seq_len is multiplied by 2 because the token limit for the Llama 2 generation of models is 4096.
+            # Adding this multiplier instead of using 4096 directly allows for dynamism of token lengths while training or fine-tuning.
+            self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
+        )
+    @torch.inference_mode()
+    def forward(self,
+                tokens: torch.Tensor,
+                start_pos: int,
+                token_weights: Optional[torch.Tensor],
+                verbose: Optional[bool] = False):
+        """
+        Perform a forward pass through the Transformer model.
+        Args:
+            tokens (torch.Tensor): Input token indices.
+            start_pos (int): Starting position for attention caching.
+            token_weights (torch.Tensor): Superposition matrix.
+            verbose (bool): Whether to return intermediate hidden layer states
+        Returns:
+            torch.Tensor or (torch.Tensor, Dict): Output logits after applying the Transformer model.
+        """
+        if verbose:
+            states = {"layers": [], "weights": None}
+        _bsz, seqlen = tokens.shape
+        if token_weights is not None:
+            h = self.tok_mixing_embeddings(token_weights.half()).unsqueeze(1)
+        else:
+            h = self.tok_embeddings(tokens)
+        self.freqs_cis = self.freqs_cis.to(h.device)
+        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]
+        if verbose:
+            states["layers"].append(h)
+            states["weights"] = token_weights
+        mask = None
+        if seqlen > 1:
+            mask = torch.full(
+                (1, 1, seqlen, seqlen), float("-inf"), device=tokens.device
+            )
+            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)
+        for layer in self.layers:
+                h = layer(h, start_pos, freqs_cis, mask)
+                if verbose:
+                    states["layers"].append(h)
+        h = self.norm(h)
+        if verbose:
+            states["layers"].append(h)
+        output = self.output(h).float()
+        if verbose:
+            return output, states
+        else:
+            return output

superposed/llama/tokenizer.py ADDED Viewed

	@@ -0,0 +1,68 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+import os
+from logging import getLogger
+from typing import List
+from sentencepiece import SentencePieceProcessor
+logger = getLogger()
+class Tokenizer:
+    """tokenizing and encoding/decoding text using SentencePiece."""
+    def __init__(self, model_path: str):
+        """
+        Initializes the Tokenizer with a SentencePiece model.
+        Args:
+            model_path (str): The path to the SentencePiece model file.
+        """
+        # reload tokenizer
+        assert os.path.isfile(model_path), model_path
+        self.sp_model = SentencePieceProcessor(model_file=model_path)
+        logger.info(f"Reloaded SentencePiece model from {model_path}")
+        # BOS / EOS token IDs
+        self.n_words: int = self.sp_model.vocab_size()
+        self.bos_id: int = self.sp_model.bos_id()
+        self.eos_id: int = self.sp_model.eos_id()
+        self.pad_id: int = self.sp_model.pad_id()
+        logger.info(
+            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
+        )
+        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()
+    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
+        """
+        Encodes a string into a list of token IDs.
+        Args:
+            s (str): The input string to be encoded.
+            bos (bool): Whether to prepend the beginning-of-sequence token.
+            eos (bool): Whether to append the end-of-sequence token.
+        Returns:
+            List[int]: A list of token IDs.
+        """
+        assert type(s) is str
+        t = self.sp_model.encode(s)
+        if bos:
+            t = [self.bos_id] + t
+        if eos:
+            t = t + [self.eos_id]
+        return t
+    def decode(self, t: List[int]) -> str:
+        """
+        Decodes a list of token IDs into a string.
+        Args:
+            t (List[int]): The list of token IDs to be decoded.
+        Returns:
+            str: The decoded string.
+        """
+        return self.sp_model.decode(t)

superposed/llama/utils.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import torch
+def log_prob_to_prob(log_probs, temp=1):
+    """
+    Convert log probabilities to probability distribution and normalize.
+    Args:
+        log_probs (torch.Tensor): Log probs (n_prompts, n_drafts, vocab_size)
+    Returns:
+        Probability distribution (n_prompts, n_drafts, vocab_size)
+    """
+    # stability constant
+    log_probs = log_probs + torch.max(log_probs, dim=-1, keepdim=True)[0]
+    probs = torch.softmax(log_probs / temp, dim=-1)
+    return probs
+def decode(tokenizer, encoding):
+    """
+    Decode a list of tokens to a string
+    Args:
+        tokenizer (Any): Tokenizer
+        encoding (torch.Tensor): Encoding
+    Returns:
+        decoding (str)
+    """
+    pad_locs = (encoding == -1).nonzero()
+    if len(pad_locs > 0):
+        encoding = encoding[:pad_locs[0].item()]
+    return tokenizer.decode(encoding.to(torch.int32).tolist())
+def print_gen(gens, logprobs, tokenizer, n_drafts, prompt_len, output_file):
+    """
+    Print out generations for debugging.
+    Args:
+        gens (n_prompts * n_drafts, seq_len): Generations to print
+        logprobs (n_prompts * n_drafts): Log probs of each generation
+        tokenizer (any): Tokenizer
+        n_drafts (int): Number of drafts per prompt
+        prompt_len (int): Number of tokens in prompt
+    """
+    n_prompts, n_drafts, seq_len = gens.shape
+    gens = gens.reshape(-1, seq_len)
+    logprobs = logprobs.flatten()
+    count = 0
+    for i in range(len(gens)):
+        d = decode(tokenizer, gens[i])
+        # first draft of this prompt
+        if i % n_drafts == 0:
+            count = 0
+            print("---------------", file=output_file)
+            prompt = decode(tokenizer, gens[i][:prompt_len])
+            print(f"prompt: {prompt}", file=output_file)
+        print(f"logprob: {logprobs[i]} {count}: {d}", file=output_file)
+        count += 1
+def print_probs(next_probs, tokenizer, output_file):
+    """
+    Print out next token options and probabilities for debugging
+    Args:
+        next_probs (torch.Tensor): Next token probabilities (n_prompts, n_drafts, vocab_size)
+        tokenizer (any): Tokenizer
+    """
+    print("\tReminder: At most first n_drafts from seq can be selected.", file=output_file)
+    n_prompts, n_drafts, vocab_size = next_probs.shape
+    for p_idx in range(n_prompts):
+        print(f"\tPrompt {p_idx}:", file=output_file)
+        for d_idx in range(n_drafts):
+            next_token_probs, next_token_idx = next_probs[p_idx, d_idx].topk(n_drafts+2, dim=-1)
+            print(f"\t\tTokens: {[tokenizer.decode([i.item()]) for i in next_token_idx]}", file=output_file)
+            print(f"\t\tLog Probs: {torch.log(next_token_probs)}", file=output_file)
+            print(f"\t\tProbs: {next_token_probs}", file=output_file)

superposed/ngrams/__pycache__/ngram_models.cpython-312.pyc ADDED Viewed

Binary file (5.53 kB). View file

superposed/ngrams/make_corpus.py ADDED Viewed

	@@ -0,0 +1,268 @@

+import multiprocessing
+import argparse
+import os
+import pickle
+import glob
+import json
+from datasets import load_dataset
+from tqdm import tqdm
+from transformers import AutoTokenizer, LlamaTokenizer
+from loguru import logger
+def create_corpuses(
+    ckpt_path,
+    start_doc,
+    end_doc,
+    dataset,
+    tokenizer,
+    train_bigram: bool,
+    train_trigram: bool,
+    train_fourgram: bool,
+    train_fivegram: bool,
+    train_sixgram: bool,
+    train_sevengram: bool
+):
+    bigram_corpus = {}
+    trigram_corpus = {}
+    fourgram_corpus = {}
+    fivegram_corpus = {}
+    sixgram_corpus = {}
+    sevengram_corpus = {}
+    bigram_corpus_counts = {}
+    trigram_corpus_counts = {}
+    fourgram_corpus_counts = {}
+    fivegram_corpus_counts = {}
+    sixgram_corpus_counts = {}
+    sevengram_corpus_counts = {}
+    iterations = end_doc - start_doc
+    for i in tqdm(range(iterations)):
+      t = dataset[start_doc + i]["text"]
+      encoded_text = tokenizer.encode(t)
+      for start_idx in range(1, len(encoded_text)): # count from first real to eos
+        pOne = encoded_text[start_idx-1] if start_idx >= 1 else None
+        pTwo = encoded_text[start_idx-2] if start_idx >= 2 else None
+        pThree = encoded_text[start_idx-3] if start_idx >= 3 else None
+        pFour = encoded_text[start_idx-4] if start_idx >= 4 else None
+        pFive = encoded_text[start_idx-5] if start_idx >= 5 else None
+        pSix = encoded_text[start_idx-6] if start_idx >= 6 else None
+        token = encoded_text[start_idx]
+        # bigram
+        if train_bigram and start_idx >= 1:
+          prior = pOne
+          if prior not in bigram_corpus:
+            bigram_corpus[prior] = {}
+            bigram_corpus_counts[prior] = 0
+          bigram_corpus[prior][token] = bigram_corpus[prior].get(token, 0) + 1
+          bigram_corpus_counts[prior] += 1
+        # trigram
+        if train_trigram and start_idx >= 2:
+          prior = (pTwo, pOne)
+          if prior not in trigram_corpus:
+            trigram_corpus[prior] = {}
+            trigram_corpus_counts[prior] = 0
+          trigram_corpus[prior][token] = trigram_corpus[prior].get(token, 0) + 1
+          trigram_corpus_counts[prior] += 1
+        # fourgram
+        if train_fourgram and start_idx >= 3:
+          prior = (pThree, pTwo, pOne)
+          if prior not in fourgram_corpus:
+            fourgram_corpus[prior] = {}
+            fourgram_corpus_counts[prior] = 0
+          fourgram_corpus[prior][token] = fourgram_corpus[prior].get(token, 0) + 1
+          fourgram_corpus_counts[prior] += 1
+        # fivegram
+        if train_fivegram and start_idx >= 4:
+          prior = (pFour, pThree, pTwo, pOne)
+          if prior not in fivegram_corpus:
+            fivegram_corpus[prior] = {}
+            fivegram_corpus_counts[prior] = 0
+          fivegram_corpus[prior][token] = fivegram_corpus[prior].get(token, 0) + 1
+          fivegram_corpus_counts[prior] += 1
+        # sixgram
+        if train_sixgram and start_idx >= 5:
+          prior = (pFive, pFour, pThree, pTwo, pOne)
+          if prior not in sixgram_corpus:
+            sixgram_corpus[prior] = {}
+            sixgram_corpus_counts[prior] = 0
+          sixgram_corpus[prior][token] = sixgram_corpus[prior].get(token, 0) + 1
+          sixgram_corpus_counts[prior] += 1
+        # sevengram
+        if train_sevengram and start_idx >= 6:
+          prior = (pSix, pFive, pFour, pThree, pTwo, pOne)
+          if prior not in sevengram_corpus:
+            sevengram_corpus[prior] = {}
+            sevengram_corpus_counts[prior] = 0
+          sevengram_corpus[prior][token] = sevengram_corpus[prior].get(token, 0) + 1
+          sevengram_corpus_counts[prior] += 1
+    save_corpus(ckpt_path, bigram_corpus, trigram_corpus, fourgram_corpus, fivegram_corpus, sixgram_corpus, sevengram_corpus, start_doc, end_doc)
+    save_counts(ckpt_path, bigram_corpus_counts, trigram_corpus_counts, fourgram_corpus_counts, fivegram_corpus_counts, sixgram_corpus_counts, sevengram_corpus_counts, start_doc, end_doc)
+def merge_corpus_helper(c1, c2):
+  """
+  Merge the corpuses c1 and c2, returning the merged result.
+  """
+  for prior in c2:
+    # if share prior
+    if prior in c1:
+      c1_prior = c1[prior]
+      c2_prior = c2[prior]
+      for token in c2_prior:
+        # if share token
+        if token in c1_prior:
+          c1_prior[token] += c2_prior[token]
+        # else just use c2's
+        else:
+          c1_prior[token] = c2_prior[token]
+    else:
+      # else just use c2's
+      c1[prior] = c2[prior]
+  return c1
+def merge_counts_helper(c1, c2):
+  """
+  Merge the count corpuses c1 and c2, returning the merged result.
+  """
+  for prior in c2:
+    if prior in c1:
+      c1[prior] += c2[prior]
+    else:
+      c1[prior] = c2[prior]
+  return c1
+def save_corpus(save_dir, b_d, t_d, fo_d, fi_d, si_d, se_d, start_doc, end_doc):
+  """
+  Save corpuses b_d (bigram) to se_d (sevengram), where the corpus contains mappings
+  {prefix : {next_token1: ct, next_token2: ct, ...}}.
+  """
+  prefixes = ["b_d", "t_d", "fo_d", "fi_d", "si_d", "se_d"]
+  for p, corpus in zip(prefixes, [b_d, t_d, fo_d, fi_d, si_d, se_d]):
+    with open(f"{save_dir}/{p}{start_doc}-{end_doc}.pkl", "wb") as f:
+      pickle.dump(corpus, f)
+def save_counts(save_dir, b_ct, t_ct, fo_ct, fi_ct, si_ct, se_ct, start_doc, end_doc):
+  """
+  Save count corpuses b_ct (bigram) to se_ct (sevengram), where each count
+  corpus contains mappings {prefix : total}.
+  """
+  prefixes = ["b_ct", "t_ct", "fo_ct", "fi_ct", "si_ct", "se_ct"]
+  for p, corpus in zip(prefixes, [b_ct, t_ct, fo_ct, fi_ct, si_ct, se_ct]):
+    with open(f"{save_dir}/{p}{start_doc}-{end_doc}.pkl", "wb") as f:
+      pickle.dump(corpus, f)
+def merge_corpuses(ckpt_path):
+  """
+  Helper to merge corpuses in `ckpt_path`, where `ckpt_path` might contain
+  multiple bigram, trigram, etc. corpuses from each process.
+  """
+  prefixes = ["b_d", "t_d", "fo_d", "fi_d", "si_d", "se_d"]
+  for prefix in prefixes:
+    if os.path.exists(f"{ckpt_path}/{prefix}_final.pkl"):
+      os.remove(f"{ckpt_path}/{prefix}_final.pkl")
+    corpus = None
+    for filepath in glob.glob(f"{ckpt_path}/{prefix}*"):
+      with open(filepath, "rb") as f:
+        current = pickle.load(f)
+        if corpus is None:
+          corpus = current
+        else:
+          corpus = merge_corpus_helper(corpus, current)
+      os.remove(filepath)
+    with open(f"{ckpt_path}/{prefix}_final.pkl", "wb") as f:
+      pickle.dump(corpus, f)
+def merge_counts(ckpt_path):
+  """
+  Helper to merge count corpuses in `ckpt_path`, where `ckpt_path` might contain
+  multiple bigram, trigram, etc. count corpuses from each process.
+  """
+  prefixes = ["b_ct", "t_ct", "fo_ct", "fi_ct", "si_ct", "se_ct"]
+  for prefix in prefixes:
+    if os.path.exists(f"{ckpt_path}/{prefix}_final.pkl"):
+      os.remove(f"{ckpt_path}/{prefix}_final.pkl")
+    counts = None
+    for filepath in glob.glob(f"{ckpt_path}/{prefix}*"):
+      with open(filepath, "rb") as f:
+        current = pickle.load(f)
+        if counts is None:
+          counts = current
+        else:
+          counts = merge_counts_helper(counts, current)
+      os.remove(filepath)
+    with open(f"{ckpt_path}/{prefix}_final.pkl", "wb") as f:
+      pickle.dump(counts, f)
+if __name__ == "__main__":
+  # Input arguments
+  parser = argparse.ArgumentParser()
+  parser.add_argument("ckpt_path", type=str, help="Path to store ngram models")
+  parser.add_argument("start_doc", type=str, help="# of first document")
+  parser.add_argument("end_doc", type=str, help="# of last document")
+  parser.add_argument("c", type=int, help="number of processes")
+  parser.add_argument("--tok_name", type=str, help="name of HF tokenizer, or llama", default="llama")
+  for arg_name in ["--bigram", "--trigram", "--fourgram", "--fivegram", "--sixgram", "--sevengram"]:
+    parser.add_argument(arg_name, type=str, help=f"Whether to make a {arg_name} model")
+  parser.add_argument("--dset_name", type=str, help="name of HF dataset")
+  parser.add_argument("--dset_path", type=str, help="path to dataset")
+  # Parse arguments
+  args = parser.parse_args()
+  start_doc_ovr = int(args.start_doc)
+  end_doc_ovr = int(args.end_doc)
+  n_cores = args.c
+  tok_name = args.tok_name
+  ckpt_path = args.ckpt_path
+  dset_name = args.dset_name
+  dset_path = args.dset_path
+  if not dset_name and not dset_path:
+    raise RuntimeError("Please provide a dataset")
+  if not os.path.exists(ckpt_path):
+    os.makedirs(ckpt_path)
+  logger.info(f"{start_doc_ovr} {end_doc_ovr} {n_cores}")
+  # Load dataset and tokenizer
+  if dset_name:
+    ds = load_dataset(dset_name, cache_dir="../../../datasets/")["train"].shuffle(seed=42)
+  else:
+    with open(dset_path, "r") as f:
+      ds = json.load(f)["train"]
+  if tok_name == "llama":
+    # REPLACE WITH YOUR OWN PATH
+    tokenizer = LlamaTokenizer.from_pretrained("../../7B_HF", add_bos_token=False)
+  else:
+    tokenizer = AutoTokenizer.from_pretrained(tok_name)
+  # Start running
+  num_processes = n_cores
+  total_docs = end_doc_ovr - start_doc_ovr
+  docs_per_c = (total_docs) // num_processes
+  processes = []
+  for core in range(n_cores):
+    start_doc = core * docs_per_c # relative start doc
+    end_doc = (core + 1) * docs_per_c if core < n_cores - 1 else total_docs # relative end doc
+    logger.info(f"Starting core {core} from document {start_doc} to {end_doc}")
+    process = multiprocessing.Process(target=create_corpuses,
+                                      args=(ckpt_path,
+                                            start_doc_ovr + start_doc,
+                                            start_doc_ovr + end_doc,
+                                            ds, tokenizer,
+                                            args.bigram,
+                                            args.trigram,
+                                            args.fourgram,
+                                            args.fivegram,
+                                            args.sixgram,
+                                            args.sevengram))
+    processes.append(process)
+    process.start()
+  for process in processes:
+    process.join()
+  logger.info("Finished Saving")
+  logger.info("Merging...")
+  merge_corpuses(ckpt_path)
+  merge_counts(ckpt_path)
+  logger.info("Merged.")

superposed/ngrams/ngram_models.py ADDED Viewed

	@@ -0,0 +1,115 @@

+import pickle
+import sys
+import torch
+class NGram():
+  def __init__(self, corpus, corpus_counts, type):
+    self.corpus = corpus
+    self.counts = corpus_counts
+    self.type = type
+  def prob(self, key, next):
+    """
+    Args:
+      key (tuple): tuple of token ID's forming prior
+      next (int): probability of next token
+    """
+    l = len(key)
+    if self.type == "bigram":
+      assert l == 1
+      key = key[0]
+    elif self.type == "trigram":
+      assert l == 2
+    elif self.type == "fourgram":
+      assert l == 3
+    elif self.type == "fivegram":
+      assert l == 4
+    elif self.type == "sixgram":
+      assert l == 5
+    elif self.type == "sevengram":
+      assert l == 6
+    count = 0
+    if key in self.corpus:
+      count = self.corpus[key].get(next, 0)
+      total = sum(self.corpus[key].values())
+      return count / total
+    else:
+      return -1
+  def ntd(self, key, vocab_size=32000):
+    """
+    Args:
+      key (tuple): tuple of token ID's forming prior
+    Returns:
+      prob_tensor (torch.Tensor): (vocab_size, ) of full next token probabilities
+    """
+    if key in self.corpus:
+      prob_tensor = torch.zeros(vocab_size)
+      total = sum(self.corpus[key].values())
+      for next_token in self.corpus[key]:
+        prob_tensor[next_token] = self.corpus[key][next_token] / total
+      return prob_tensor
+    else:
+      return None
+def make_models(ckpt_path, bigram, trigram, fourgram, fivegram, sixgram, sevengram):
+  """
+  Loads and returns a list correspoding to bigram to sevengram models, containing
+  the models that whose parameters are `True`. See below for expected corpus names.
+  Args:
+    ckpt_path (str): Location of ngram models
+    bigram-sevengram: Which models to load
+  Returns:
+    List of n-gram models
+  """
+  models = []
+  if bigram:
+    print("Making bigram...")
+    with open(f"{ckpt_path}/b_d_final.pkl", "rb") as f:
+        bigram = pickle.load(f)
+    bigram_model = NGram(bigram, None, "bigram")
+    models.append(bigram_model)
+    print(sys.getsizeof(bigram))
+  if trigram:
+    print("Making trigram...")
+    with open(f"{ckpt_path}/t_d_final.pkl", "rb") as f:
+        trigram = pickle.load(f)
+    trigram_model = NGram(trigram, None, "trigram")
+    models.append(trigram_model)
+    print(sys.getsizeof(trigram))
+  if fourgram:
+    print("Making fourgram...")
+    with open(f"{ckpt_path}/fo_d_final.pkl", "rb") as f:
+        fourgram = pickle.load(f)
+    fourgram_model = NGram(fourgram, None, "fourgram")
+    models.append(fourgram_model)
+    print(sys.getsizeof(fourgram))
+  if fivegram:
+    print("Making fivegram...")
+    with open(f"{ckpt_path}/fi_d_final.pkl", "rb") as f:
+        fivegram = pickle.load(f)
+    fivegram_model = NGram(fivegram, None, "fivegram")
+    models.append(fivegram_model)
+    print(sys.getsizeof(fivegram))
+  if sixgram:
+    print("Making sixgram...")
+    with open(f"{ckpt_path}/si_d_final.pkl", "rb") as f:
+        sixgram = pickle.load(f)
+    sixgram_model = NGram(sixgram, None, "sixgram")
+    models.append(sixgram_model)
+    print(sys.getsizeof(sixgram))
+  if sevengram:
+    print("Making sevengram...")
+    with open(f"{ckpt_path}/se_d_final.pkl", "rb") as f:
+        sevengram = pickle.load(f)
+    sevengram_model = NGram(sevengram, None, "sevengram")
+    models.append(sevengram_model)
+  return models

superposed/ngrams/test.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+    "train": [
+        {"text": "Hi my name is"},
+        {"text": "This is a story of"},
+	{"text": "In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying"},
+	{"text": "There is one class of AutoModel for each task, and for each backend (PyTorch, TensorFlow, or Flax)."}
+    ]
+}

superposed/notebooks/custom.ipynb ADDED Viewed

	@@ -0,0 +1,289 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "119805f4-8589-4379-ad87-a7bad4c0e658",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/gscratch/raivn/ethans/miniconda3/envs/llms_12.1/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcWriteOptions size changed, may indicate binary incompatibility. Expected 72 from C header, got 88 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcReadOptions size changed, may indicate binary incompatibility. Expected 96 from C header, got 104 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow._fs.FileInfo size changed, may indicate binary incompatibility. Expected 64 from C header, got 88 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow._fs.FileSelector size changed, may indicate binary incompatibility. Expected 48 from C header, got 72 from PyObject\n",
+      "2024-05-30 03:09:58.230601: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2024-05-30 03:09:58.280835: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2024-05-30 03:10:03.250651: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
+     ]
+    }
+   ],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "\n",
+    "import json\n",
+    "import os\n",
+    "import pickle\n",
+    "from datetime import datetime\n",
+    "\n",
+    "import evaluate\n",
+    "import torch\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "from eval import *\n",
+    "from superposed.llama.metrics import *\n",
+    "from superposed.llama.generation import Llama\n",
+    "from superposed.llama.superposed_generation import SuperposedLlama\n",
+    "from superposed.llama.tokenizer import Tokenizer\n",
+    "from superposed.ngrams.ngram_models import make_models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "51c15900-c8b8-46d9-a884-6842a391ef48",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sup_device = torch.device(\"cuda:0\")\n",
+    "tokenizer = Tokenizer('../../7B/tokenizer.model')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "9817d9a4-ad64-41c6-b87b-b1e422b836a9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Parameters: {'alpha': 0.54, 'temp': 0.06, 'n_drafts': 3, 'prompt_len': 15, 'n_token_sample': 9, 'n_token_consider': 32000, 'mixing_method': 'sample_new_weights_with_score', 'smoothing': 'geom', 'sample_tokens': 0, 'sample_beams': 0, 'i_weights': [0.01, 0.04, 0.15, 0.18, 0.12], 'i_length': [1, 2, 3, 4, 5]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Params\n",
+    "param_file = \"../../params/p15_d3_mixed.json\"\n",
+    "with open(param_file, \"r\") as f:\n",
+    "    params = json.load(f)\n",
+    "    print(f\"Parameters: {params}\")\n",
+    "alpha = params[\"alpha\"]\n",
+    "temp = params[\"temp\"]\n",
+    "n_drafts = params[\"n_drafts\"]\n",
+    "prompt_len = params[\"prompt_len\"]\n",
+    "n_token_sample = params[\"n_token_sample\"]\n",
+    "i_weights = params[\"i_weights\"]\n",
+    "i_length = params[\"i_length\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "9c99098e-a38b-4c78-a0e9-8c80309830bb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Making bigram...\n",
+      "1310800\n",
+      "Making trigram...\n",
+      "671088728\n",
+      "Making fourgram...\n",
+      "2684354648\n",
+      "Making fivegram...\n",
+      "5368709200\n",
+      "Making sixgram...\n",
+      "5368709200\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create ngram models\n",
+    "ngrams = make_models(\"../../ckpts-200k\", bigram=True, trigram=True, fourgram=True, fivegram=True, sixgram=True, sevengram=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c3331332-242c-4e98-9f11-58c6dc0ef581",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "> initializing model parallel with size 1\n",
+      "> initializing ddp with size 1\n",
+      "> initializing pipeline with size 1\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/gscratch/raivn/ethans/miniconda3/envs/llms_12.1/lib/python3.11/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)\n",
+      "  _C._set_default_tensor_type(t)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded in 25.15 seconds\n",
+      "cuda:0\n"
+     ]
+    }
+   ],
+   "source": [
+    "weight_path = \"../../7B/\"\n",
+    "model = SuperposedLlama.build(ckpt_dir=weight_path, \n",
+    "                         tokenizer_path=f'{weight_path}/tokenizer.model', \n",
+    "                         max_seq_len=100, \n",
+    "                         max_batch_size=32,\n",
+    "                         device=sup_device,\n",
+    "                         model_parallel_size=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2b48c23-d6a3-43b1-ad4c-54524aacfda6",
+   "metadata": {},
+   "source": [
+    "# Inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "5093373b-bf76-47e3-8f99-1045b60f29c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def decode(tokenizer, encoding):\n",
+    "    \"\"\"\n",
+    "    Args:\n",
+    "        tokenizer (Any): Tokenizer\n",
+    "        encoding (torch.Tensor): Encoding\n",
+    "    Returns:\n",
+    "        decoding (str)\n",
+    "    \"\"\"\n",
+    "    eos_locs = (encoding == tokenizer.eos_id).nonzero()\n",
+    "    if len(eos_locs > 0):\n",
+    "        encoding = encoding[:eos_locs[0]]\n",
+    "    return tokenizer.decode(encoding.to(torch.int32).tolist())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "18703b19-f3e9-46e4-ab1c-c6d3b403c6d2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompts = [\n",
+    "    \"Hi my name is\",\n",
+    "    \"The Seattle Seahawks were Super Bowl\",\n",
+    "    \"Penguins are birds native to\"\n",
+    "]\n",
+    "tokenized_prompts = tokenizer.encode(prompts, True, False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "d39cd735-9480-4979-ac92-bbd470f75570",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alive_gens, _ = model.sup_generate(prompt_tokens=tokenized_prompts, \n",
+    "                                        smoothing=\"geom\",\n",
+    "                                        max_gen_len=10, \n",
+    "                                        n_token_sample=n_token_sample,\n",
+    "                                        alpha=alpha, \n",
+    "                                        temp=temp,\n",
+    "                                        n_drafts=n_drafts,\n",
+    "                                        i_weights=i_weights,\n",
+    "                                        i_length=i_length,\n",
+    "                                        ngrams=ngrams,\n",
+    "                                        get_time=False,\n",
+    "                                        penalty=200)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "cfefa793-e49e-483a-a504-5cc9e23f619d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gens = alive_gens[0].reshape(len(prompts) * n_drafts, -1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "5abf87ab-2ee0-4204-868b-1215abf0c8aa",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hi\n",
+      "my name\n",
+      "is L\n",
+      "inda,\n",
+      "I am\n",
+      "a \n",
+      "40\n",
+      "year old\n",
+      "woman who\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i in gens:\n",
+    "    print(decode(tokenizer, i))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e73dc3cc-baa5-468d-bdd1-827465bdeb62",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

superposed/notebooks/nq.ipynb ADDED Viewed

	@@ -0,0 +1,417 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The autoreload extension is already loaded. To reload it, use:\n",
+      "  %reload_ext autoreload\n"
+     ]
+    }
+   ],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "\n",
+    "import json\n",
+    "import os\n",
+    "import re\n",
+    "from datetime import datetime\n",
+    "\n",
+    "import torch\n",
+    "from datasets import load_dataset\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "from eval import *\n",
+    "from superposed.llama.metrics import *\n",
+    "from superposed.llama.generation import Llama\n",
+    "from superposed.llama.superposed_generation import SuperposedLlama\n",
+    "from superposed.llama.tokenizer import Tokenizer\n",
+    "from superposed.ngrams.ngram_models import make_models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nq = load_dataset(\"nq_open\")[\"validation\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Parameters: {'alpha': 0.54, 'temp': 0.06, 'n_drafts': 3, 'prompt_len': 15, 'n_token_sample': 9, 'n_token_consider': 32000, 'mixing_method': 'sample_new_weights_with_score', 'smoothing': 'geom', 'sample_tokens': 0, 'sample_beams': 0, 'i_weights': [0.01, 0.04, 0.15, 0.18, 0.12], 'i_length': [1, 2, 3, 4, 5]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Params\n",
+    "param_file = \"../../params/p15_d3_mixed.json\"\n",
+    "with open(param_file, \"r\") as f:\n",
+    "    params = json.load(f)\n",
+    "    print(f\"Parameters: {params}\")\n",
+    "alpha = params[\"alpha\"]\n",
+    "temp = params[\"temp\"]\n",
+    "n_drafts = params[\"n_drafts\"]\n",
+    "prompt_len = params[\"prompt_len\"]\n",
+    "n_token_sample = params[\"n_token_sample\"]\n",
+    "i_weights = params[\"i_weights\"]\n",
+    "i_length = params[\"i_length\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Create Models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Making bigram...\n",
+      "1310800\n",
+      "Making trigram...\n",
+      "671088728\n",
+      "Making fourgram...\n",
+      "2684354648\n",
+      "Making fivegram...\n",
+      "5368709200\n",
+      "Making sixgram...\n",
+      "5368709200\n"
+     ]
+    }
+   ],
+   "source": [
+    "ngrams = make_models(\"../../ckpts-200k\", bigram=True, trigram=True, fourgram=True, fivegram=True, sixgram=True, sevengram=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sup_device = torch.device(\"cuda:0\")\n",
+    "reg_device = torch.device(\"cuda:1\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "> initializing model parallel with size 1\n",
+      "> initializing ddp with size 1\n",
+      "> initializing pipeline with size 1\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/gscratch/raivn/ethans/miniconda3/envs/llms_12.1/lib/python3.11/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)\n",
+      "  _C._set_default_tensor_type(t)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded in 33.68 seconds\n",
+      "cuda:0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# load superposed\n",
+    "weight_path = \"../../7B/\"\n",
+    "sup_model = SuperposedLlama.build(ckpt_dir=weight_path, \n",
+    "                                 tokenizer_path=f'{weight_path}/tokenizer.model', \n",
+    "                                 max_seq_len=1000, \n",
+    "                                 max_batch_size=16,\n",
+    "                                 device=sup_device,\n",
+    "                                 model_parallel_size=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n",
+      "Loaded in 22.47 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# load regular\n",
+    "reg_model = Llama.build(ckpt_dir=weight_path, \n",
+    "                    tokenizer_path=f'{weight_path}/tokenizer.model', \n",
+    "                    max_seq_len=1000, \n",
+    "                    max_batch_size=16,\n",
+    "                    device=reg_device, # reg_device,\n",
+    "                    model_parallel_size=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = Tokenizer(f\"{weight_path}/tokenizer.model\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_types = [\"greedy\", \"superposed\", \"regular\"]\n",
+    "model_type = model_types[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def evaluate_nq(model_type, question, max_gen_len):\n",
+    "    question = \"Answer these questions:\\n\\nQ: \" + question + \"?\\nA:\"\n",
+    "    text_len = len(question) # for truncating\n",
+    "    prompt_len = len(tokenizer.encode([question], True, False)[0]) # for model\n",
+    "    if model_type == \"regular\" or model_type == \"greedy\":\n",
+    "        if model_type == \"regular\":\n",
+    "            input = [question for _ in range(n_drafts)]\n",
+    "            print(input)\n",
+    "            sequences, _ = evaluate_nucleus_losses(data=input,\n",
+    "                                                   model=reg_model,\n",
+    "                                                   tokenizer=tokenizer,\n",
+    "                                                   prompt_len=prompt_len,\n",
+    "                                                   max_gen_len=max_gen_len,\n",
+    "                                                   temp=0.6,\n",
+    "                                                   bsz=8,\n",
+    "                                                   marker=False)\n",
+    "        else:\n",
+    "            sequences, _ = evaluate_nucleus_losses(data=[question],\n",
+    "                                       model=reg_model,\n",
+    "                                       tokenizer=tokenizer,\n",
+    "                                       prompt_len=prompt_len,\n",
+    "                                       max_gen_len=max_gen_len,\n",
+    "                                       temp=0,\n",
+    "                                       bsz=8,\n",
+    "                                       marker=False)\n",
+    "        n_pd, seq_len = sequences.shape\n",
+    "    elif model_type == \"superposed\":\n",
+    "        sequences, _ = evaluate_mixed_losses(data=[question],\n",
+    "                                                   model=sup_model,\n",
+    "                                                   tokenizer=tokenizer,\n",
+    "                                                   prompt_len=prompt_len,\n",
+    "                                                   max_gen_len=max_gen_len,\n",
+    "                                                   alpha=alpha,\n",
+    "                                                   temp=temp,\n",
+    "                                                   n_drafts=n_drafts,\n",
+    "                                                   n_token_sample=n_token_sample,\n",
+    "                                                   smoothing=None, # Use greedy\n",
+    "                                                   bsz=8,\n",
+    "                                                   i_weights=i_weights,\n",
+    "                                                   i_length=i_length,\n",
+    "                                                   ngrams=ngrams,\n",
+    "                                                   marker=False)\n",
+    "        n_p, n_d, seq_len = sequences.shape\n",
+    "    # Process results\n",
+    "    sequences = sequences.reshape(-1, seq_len).tolist()\n",
+    "    for d_idx in range(len(sequences)):\n",
+    "        draft = sequences[d_idx]\n",
+    "        if -1 in draft:\n",
+    "            draft = draft[:draft.index(-1)]\n",
+    "        sequences[d_idx] = draft\n",
+    "    decoded_seq = tokenizer.decode(sequences)\n",
+    "    answers = []\n",
+    "    for s in decoded_seq:\n",
+    "        answers.append(re.split(\"[,.\\n]\", s[text_len:].strip())[0])\n",
+    "    return answers\n",
+    "            "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run evaluation\n",
+    "predictions = []\n",
+    "print(f\"Precision from 1 to {n_drafts}\")\n",
+    "for sample in tqdm(nq):\n",
+    "    # Adaptively determine max generation length\n",
+    "    longest = 0\n",
+    "    shortest = 1000\n",
+    "    for answer in sample[\"answer\"]:\n",
+    "        tmp = tokenizer.encode([answer], False, False)[0]\n",
+    "        if len(tmp) > longest:\n",
+    "            longest = len(tmp)\n",
+    "        if len(tmp) < shortest:\n",
+    "            shortest = len(tmp)\n",
+    "    question = sample[\"question\"]\n",
+    "    answer = evaluate_nq(model_type, question, max_gen_len=shortest+3)\n",
+    "    predictions.append({\"question\": question, \"answer\": answer})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Separate results into precisions\n",
+    "precisions = {}\n",
+    "for i in range(1, n_drafts+1):\n",
+    "    prec = str(i)\n",
+    "    responses = []\n",
+    "    for result in predictions:\n",
+    "        responses.append({\"question\": result[\"question\"], \"answer\": result[\"answer\"][:i]})\n",
+    "    precisions[prec] = responses"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'question': 'when was the last time anyone was on the moon', 'answer': ['2019', '2019', '2019-', '2019-', '1019']}\n",
+      "================\n",
+      "{'question': \"who wrote he ain't heavy he's my brother lyrics\", 'answer': ['The song was written by', 'The lyr was written by', 'The Hol was written by', 'Neil song was written by', 'Neil lyr was written by']}\n",
+      "================\n",
+      "{'question': 'how many seasons of the bastard executioner are there', 'answer': ['1', 'There1', 'there1', '1', 'There1']}\n",
+      "================\n",
+      "{'question': 'when did the eagles win last super bowl', 'answer': ['2018', 'The2018', '1018', '2017', 'the2018']}\n",
+      "================\n",
+      "{'question': \"who won last year's ncaa women's basketball\", 'answer': ['the university of connecticut', 'The university of connecticut', 'university of connecticut', 'the University of connecticut', 'The University of connecticut']}\n",
+      "================\n",
+      "{'question': 'when did the isle of wight become an island', 'answer': ['1207', 'when1207', '1287', '1277', 'when1287']}\n",
+      "================\n",
+      "{'question': 'love yourself by justin bieber is about who', 'answer': ['love yourself by justin b', 'love yourself is justin b', 'Justin yourself by justin b', 'Justin yourself is justin b', 'It yourself by justin b']}\n",
+      "================\n",
+      "{'question': 'who was the ruler of england in 1616', 'answer': ['James I', 'James I of', 'King I', 'j I', 'James I']}\n",
+      "================\n",
+      "{'question': 'what is the hot coffee mod in san andreas', 'answer': ['The Hot Coffee mod is a modification for Grand', 'The Hot Coffee mod is a mod for Grand', 'The hot Coffee mod is a modification for Grand', 'The Hot Coffee mod is a modification that Grand', 'It Hot Coffee mod is a modification for Grand']}\n",
+      "================\n",
+      "{'question': 'what is the maximum data rate for the 802.11a standard select one', 'answer': ['54 Mbps', '54Mbps', '54 mbps', '54 Mbps', '54 Mbps']}\n",
+      "================\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print some results\n",
+    "counter = 0\n",
+    "for k in predictions:\n",
+    "    if counter >= 10:\n",
+    "        break\n",
+    "    print(k)\n",
+    "    counter += 1\n",
+    "    print(\"================\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Saving"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "dict_keys(['1', '2', '3', '4', '5'])\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save results\n",
+    "os.makedirs(\"../../nq/\", exist_ok=True)\n",
+    "print(precisions.keys())\n",
+    "for prec in range(1, n_drafts+1):\n",
+    "    out_path = f\"../nq/eval_{model_type}_{prec}_test.jsonl\"\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        for obj in precisions[str(prec)]:    \n",
+    "            f.write(json.dumps(obj) + \"\\n\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

superposed/notebooks/triviaqa.ipynb ADDED Viewed

	@@ -0,0 +1,404 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/gscratch/raivn/ethans/miniconda3/envs/llms_12.1/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcWriteOptions size changed, may indicate binary incompatibility. Expected 72 from C header, got 88 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow.lib.IpcReadOptions size changed, may indicate binary incompatibility. Expected 96 from C header, got 104 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow._fs.FileInfo size changed, may indicate binary incompatibility. Expected 64 from C header, got 88 from PyObject\n",
+      "<frozen importlib._bootstrap>:241: RuntimeWarning: pyarrow._fs.FileSelector size changed, may indicate binary incompatibility. Expected 48 from C header, got 72 from PyObject\n",
+      "2024-05-30 01:35:17.813978: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2024-05-30 01:35:20.452213: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2024-05-30 01:35:41.833487: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
+     ]
+    }
+   ],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "\n",
+    "import copy\n",
+    "import json\n",
+    "import pickle\n",
+    "import os\n",
+    "import random\n",
+    "import re\n",
+    "import string\n",
+    "import math\n",
+    "from datetime import datetime\n",
+    "\n",
+    "import evaluate\n",
+    "import torch\n",
+    "import numpy as np\n",
+    "from datasets import load_dataset\n",
+    "from transformers import LlamaTokenizer\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "from eval import *\n",
+    "from superposed.llama.metrics import *\n",
+    "from superposed.llama.generation import Llama\n",
+    "from superposed.llama.superposed_generation import SuperposedLlama\n",
+    "from superposed.llama.tokenizer import Tokenizer\n",
+    "from superposed.ngrams.ngram_models import make_models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Parameters: {'alpha': 0.54, 'temp': 0.06, 'n_drafts': 3, 'prompt_len': 15, 'n_token_sample': 9, 'n_token_consider': 32000, 'mixing_method': 'sample_new_weights_with_score', 'smoothing': 'geom', 'sample_tokens': 0, 'sample_beams': 0, 'i_weights': [0.01, 0.04, 0.15, 0.18, 0.12], 'i_length': [1, 2, 3, 4, 5]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Params\n",
+    "param_file = \"../../params/p15_d3_mixed.json\"\n",
+    "with open(param_file, \"r\") as f:\n",
+    "    params = json.load(f)\n",
+    "    print(f\"Parameters: {params}\")\n",
+    "alpha = params[\"alpha\"]\n",
+    "temp = params[\"temp\"]\n",
+    "n_drafts = params[\"n_drafts\"]\n",
+    "prompt_len = params[\"prompt_len\"]\n",
+    "n_token_sample = params[\"n_token_sample\"]\n",
+    "i_weights = params[\"i_weights\"]\n",
+    "i_length = params[\"i_length\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Making bigram...\n",
+      "1310800\n",
+      "Making trigram...\n",
+      "671088728\n",
+      "Making fourgram...\n",
+      "2684354648\n",
+      "Making fivegram...\n",
+      "5368709200\n",
+      "Making sixgram...\n",
+      "5368709200\n"
+     ]
+    }
+   ],
+   "source": [
+    "ngrams = make_models(\"../../ckpts-200k\", bigram=True, trigram=True, fourgram=True, fivegram=True, sixgram=True, sevengram=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sup_device = torch.device(\"cuda:0\")\n",
+    "reg_device = torch.device(\"cuda:1\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "> initializing model parallel with size 1\n",
+      "> initializing ddp with size 1\n",
+      "> initializing pipeline with size 1\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/gscratch/raivn/ethans/miniconda3/envs/llms_12.1/lib/python3.11/site-packages/torch/__init__.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)\n",
+      "  _C._set_default_tensor_type(t)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded in 22.07 seconds\n",
+      "cuda:0\n"
+     ]
+    }
+   ],
+   "source": [
+    "weight_path = \"../../7B/\"\n",
+    "sup_model = SuperposedLlama.build(ckpt_dir=weight_path, \n",
+    "                                 tokenizer_path=f'{weight_path}/tokenizer.model', \n",
+    "                                 max_seq_len=1000, \n",
+    "                                 max_batch_size=16,\n",
+    "                                 device=sup_device,\n",
+    "                                 model_parallel_size=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n",
+      "Loaded in 22.76 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "reg_model = Llama.build(ckpt_dir=weight_path, \n",
+    "                    tokenizer_path=f'{weight_path}/tokenizer.model', \n",
+    "                    max_seq_len=1000, \n",
+    "                    max_batch_size=16,\n",
+    "                    device=reg_device,\n",
+    "                    model_parallel_size=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = Tokenizer(f\"{weight_path}/tokenizer.model\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Length: 7993\n"
+     ]
+    }
+   ],
+   "source": [
+    "trivia_path = \"../../../datasets/qa/wikipedia-dev.json\"\n",
+    "with open(trivia_path, \"r\") as f:\n",
+    "    triviaqa = json.load(f)[\"Data\"]\n",
+    "print(f\"Length: {len(triviaqa)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.set_default_dtype(torch.float32)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_types = [\"superposed\", \"regular\"]\n",
+    "model_type = model_types[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/triviaqa/default.yaml\n",
+    "def evaluate_trivia(model_type, question, max_gen_len):\n",
+    "    question = \"Question: \" + question + \"\\nAnswer:\"\n",
+    "    text_len = len(question) # for truncating\n",
+    "    prompt_len = len(tokenizer.encode([question], True, False)[0]) # for model\n",
+    "    if model_type == \"regular\":\n",
+    "        input = [question for _ in range(n_drafts)]\n",
+    "        sequences, _ = evaluate_nucleus_losses(data=input,\n",
+    "                                               model=reg_model,\n",
+    "                                               tokenizer=tokenizer,\n",
+    "                                               prompt_len=prompt_len,\n",
+    "                                               max_gen_len=max_gen_len,\n",
+    "                                               temp=0.6, # Set to 0 for greedy\n",
+    "                                               bsz=8,\n",
+    "                                               marker=False)\n",
+    "        n_pd, seq_len = sequences.shape\n",
+    "    elif model_type == \"superposed\":\n",
+    "        sequences, _ = evaluate_mixed_losses(data=[question],\n",
+    "                                                   model=sup_model,\n",
+    "                                                   tokenizer=tokenizer,\n",
+    "                                                   prompt_len=prompt_len,\n",
+    "                                                   max_gen_len=max_gen_len,\n",
+    "                                                   alpha=alpha,\n",
+    "                                                   temp=temp,\n",
+    "                                                   n_drafts=n_drafts,\n",
+    "                                                   n_token_sample=n_token_sample,\n",
+    "                                                   smoothing=None, # greedy\n",
+    "                                                   bsz=8,\n",
+    "                                                   i_weights=i_weights,\n",
+    "                                                   i_length=i_length,\n",
+    "                                                   ngrams=ngrams,\n",
+    "                                                   marker=False)\n",
+    "        n_p, n_d, seq_len = sequences.shape\n",
+    "    # Process results\n",
+    "    sequences = sequences.reshape(-1, seq_len).tolist()\n",
+    "    for d_idx in range(len(sequences)):\n",
+    "        draft = sequences[d_idx]\n",
+    "        if -1 in draft:\n",
+    "            draft = draft[:draft.index(-1)]\n",
+    "        sequences[d_idx] = draft\n",
+    "    decoded_seq = tokenizer.decode(sequences)\n",
+    "    answers = []\n",
+    "    for s in decoded_seq:\n",
+    "        # print(s)\n",
+    "        answers.append(re.split(\"[,.\\n]\", s[text_len:].strip())[0])\n",
+    "    return answers\n",
+    "            "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "questions = {}\n",
+    "predictions = {}\n",
+    "print(f\"Precision from 1 to {n_drafts}\")\n",
+    "for sample in tqdm(triviaqa):\n",
+    "    # Adaptively select generation length\n",
+    "    longest = 0\n",
+    "    shortest = 1000\n",
+    "    total = 0\n",
+    "    for answer in sample[\"Answer\"][\"Aliases\"]:\n",
+    "        tmp = tokenizer.encode([answer], False, False)[0]\n",
+    "        if len(tmp) > longest:\n",
+    "            longest = len(tmp)\n",
+    "        if len(tmp) < shortest:\n",
+    "            shortest = len(tmp)\n",
+    "        total += len(tmp)\n",
+    "    # Evaluation code\n",
+    "    id = sample[\"QuestionId\"]\n",
+    "    question = sample[\"Question\"]\n",
+    "    answer = evaluate_trivia(model_type, question, max_gen_len=longest + 3)\n",
+    "    predictions[id] = answer\n",
+    "    questions[id] = question"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save precisions\n",
+    "precisions = {}\n",
+    "for i in range(1, n_drafts+1):\n",
+    "    prec = str(i)\n",
+    "    responses = {k: v[:i] for k, v in predictions.items()}\n",
+    "    precisions[prec] = responses"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Print some results\n",
+    "counter = 0\n",
+    "for k in predictions:\n",
+    "    if counter >= 10:\n",
+    "        break\n",
+    "    print(questions[k])\n",
+    "    print(predictions[k])\n",
+    "    counter += 1\n",
+    "    print(\"================\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save results\n",
+    "os.makedirs(\"../../trivia/\", exist_ok=True)\n",
+    "for prec in range(1, n_drafts+1):\n",
+    "    out_path = f\"../nucleus_extra/trivia_extra/ngram_4trivia_{model_type}_{prec}_4.json\"\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        json.dump(precisions[str(prec)], f, indent=4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}