Custom handler.py and requirements.txt

#36
by michael-newsrx-com - opened

Does anyone have an example handler.py and requirements.txt?

We keep getting:

RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

This is the current handler.py we are testing that is failing:

class EndpointHandler:
    def __init__(self, path="", force_cpu: bool = False):
        import torch
        from transformers import pipeline
        from transformers import AutoTokenizer
        from transformers import AutoModelForCausalLM
        if force_cpu:
            torch.cuda.is_available = _force_not_available
            self.generate_text = pipeline(model=path,  #
                                          torch_dtype=torch.bfloat16,  #
                                          trust_remote_code=True,  #
                                          # low_cpu_mem_usage=True,  #
                                          )
        else:
            self.tokenizer = AutoTokenizer.from_pretrained(  #
                    path, padding_side="left")
            self.model = AutoModelForCausalLM.from_pretrained(  #
                    path,  #
                    torch_dtype=torch.float16,  #
                    trust_remote_code=True,  #
                    load_in_8bit=True,  #
                    device_map="auto",  #
                    low_cpu_mem_usage=True,  #
            )
            from instruct_pipeline import InstructionTextGenerationPipeline
            self.generate_text = InstructionTextGenerationPipeline(  #
                    model=self.model,  #
                    tokenizer=self.tokenizer,  #
            )

    def __call__(self, data: Dict[str, Any]) -> List[Dict[str, any]]:
        # process input
        inputs = data.pop("inputs", data)
        parameters = data.pop("parameters", None)

        # pass inputs with all kwargs in data
        if parameters is not None:
            output = self.generate_text(inputs, **parameters)
        else:
            output = self.generate_text(inputs)

        # return_value: List[Dict[str, any]] = list()
        # postprocess the prediction
        gpu_info = report_gpu_usage()
        output["gpu_info"] = gpu_info
        # return_value.append({"generated_text": prediction, "gpu_info": gpu_info})

        # return return_value
        return output

class BlockTimer(object):
    def __enter__(self):
        import time
        self.start = time.perf_counter()
        return self

    def __exit__(self, typ, value, traceback):
        import time
        self.duration = time.perf_counter() - self.start


def _force_not_available() -> bool:
    return False


def test() -> None:
    import textwrap
    with BlockTimer() as timer:
        print("Model load")
        handler = EndpointHandler(path="databricks/dolly-v2-7b", force_cpu=False)
    print(f"Elapsed: {round(timer.duration, 2)}")
    print()
    parameters: dict[str, any] = {"max_new_tokens": 256,  #
                                  "min_length": 16,  #
                                  }  # parameters for text generation
    payload = {"inputs": f"{wall_of_text()}", "parameters": parameters}
    with BlockTimer() as timer:
        print("Inference")
        results = handler.__call__(payload)
    print(f"Elapsed: {round(timer.duration, 2)}")
    print()
    for entry in results[0].items():
        print()
        print(f"=== {entry[0]}")
        if entry[0] == "gpu_info":
            gpu_info_lines = entry[1].split("\n")
            for line in gpu_info_lines:
                if "Default |" in line:
                    print(line)
        else:
            print(textwrap.fill(str(
                    entry[1]), 140, drop_whitespace=False, replace_whitespace=False))


def wall_of_text() -> str:
    return """
The present invention relates to compositions and methods for the treatment of the 
    Charcot-Marie-Tooth disease and related disorders. Charcot-Marie-Tooth disease (“CMT 
    Mining 
    of publicly available data, describing molecular mechanisms and pathological 
    manifestations 
    of the CMT1A disease, allowed us to prioritize a few functional cellular 
    modules-transcriptional regulation of PMP22 gene, PMP22 protein folding/degradation, 
    Schwann cell proliferation and apoptosis, death of neurons, extra-cellular matrix 
    deposition 
    and remodelling, immune response-as potential legitimate targets for CMT-relevant 
    therapeutic interventions.
""".replace("\n", " ")


if __name__ == '__main__':
    test()

I'm facing the same error on really long input texts.

Even when I specify max_length=2048 and truncation=True for the tokenizer:

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    max_length=2048,
    truncation=True,
)

It's weird because the base model accepts 5120 tokens if you look at the config.json file.

Databricks org

It's 2048 actually, see https://huggingface.co/databricks/dolly-v2-12b/discussions/10 for discussion of the issue though

Sorry if I wasn't clear, I talked about the model config.json file: https://huggingface.co/databricks/dolly-v2-12b/blob/6d35f0d536712a5fd765b028b1a61af924d3d94b/config.json#L16

It's similar to the one used by the EleutherAI/pythia-12b model which accepts 5120 tokens in input.

I saw this tokenizer parameter, but it is useless. Keeping this number during tokenization means there is no max_length, which is false because you will get an error if you try to feed the EleutherAI base model with an input that is more than 5120 tokens.

Dolly v2 12B seems to be fine-tuned on 2048 tokens inputs, so now the model accepts a maximum of 2048 tokens even if the hidden_size layer is still 5120.

The problem I'm trying to understand is why I keep getting an input tensor of 2049 when I specify a max_length of 2048 to my tokenizer. 🤗

CleanShot 2023-04-20 at 15.56.20.png

Databricks org

Isn't hidden_size just the dimension of the encoding layers? I don't think that's the same thing.
I think we can fold this into https://huggingface.co/databricks/dolly-v2-12b/discussions/10

srowen changed discussion status to closed

So exactly what is the actual fix that I can implement in a handler.py then? I've never gotten this to work, even setting truncation to 1024 tokens in the tokenizer configuration.

michael-newsrx-com changed discussion status to open
Databricks org

Did you see the discussion in the other thread? Not sure how to change your current code, but it explains why you're getting this. You can't use 2048 tokens even, due to prompting and generation needs too.

@srowen Sorry for the confusion here. I mixed the hidden_size parameter and the max_position_embeddings parameter!

Did you see the discussion in the other thread? Not sure how to change your current code, but it explains why you're getting this. You can't use 2048 tokens even, due to prompting and generation needs too.

Can this be set in the tokenizer for truncation or something? Or how do I go about figuring out the actual tokenized length the model is getting so that I can test things?

Databricks org

You can set the pipeline to truncate or truncate yourself. The context window is a fixed property of the model though

I have this in my current code and I'm still getting the 2049 vs 2048 issue?

            self.tokenizer = AutoTokenizer.from_pretrained(  #
                    path,  #
                    padding_side="left",  #
                    truncation=True,  #
                    max_length=1024)
            self.model = AutoModelForCausalLM.from_pretrained(  #
                    path,  #
                    torch_dtype=torch.float16,  #
                    trust_remote_code=True,  #
                    # load_in_8bit=True,  #
                    device_map="auto",  #
                    low_cpu_mem_usage=True,  #
            )
            from instruct_pipeline import InstructionTextGenerationPipeline
            self.generate_text = InstructionTextGenerationPipeline(  #
                    model=self.model,  #
                    tokenizer=self.tokenizer,  #
            )
Databricks org

What's your input like when this fails and how long is the output? I wouldn't really expect you'd bump up against the context window limit with these settings.

It seems the tokenizer is ignoring the max_length parameter and isn't truncating? The following is generating an input_ids size of 1998 for the below text.

self.tokenizer = AutoTokenizer.from_pretrained(  #
                    path,  #
                    padding_side="left",  #
                    truncation=True,  #
                    max_length=1024)
def wall_of_text() -> str:
    return """
    Create a ten to fifteen word intriguing headline for the following article.
    
    The present invention relates to compositions and methods for the treatment of the 
    Charcot-Marie-Tooth disease and related disorders. Charcot-Marie-Tooth disease (“CMT 
    Mining 
    of publicly available data, describing molecular mechanisms and pathological 
    manifestations 
    of the CMT1A disease, allowed us to prioritize a few functional cellular 
    modules-transcriptional regulation of PMP22 gene, PMP22 protein folding/degradation, 
    Schwann cell proliferation and apoptosis, death of neurons, extra-cellular matrix 
    deposition 
    and remodelling, immune response-as potential legitimate targets for CMT-relevant 
    therapeutic interventions. The combined impact of these deregulated functional modules on 
    onset and progression of pathological manifestations of Charcot-Marie-Tooth justifies a 
    potential efficacy of combinatorial CMT treatment. International patent application No. 
    PCT/EP2008/066457 describes a method of identifying drug candidates for the treatment of 
    the 
    Charcot-Marie-Tooth disease by building a dynamic model of the pathology and targeting 
    functional cellular pathways which are relevant in the regulation of CMT disease. 
    International patent application No. PCT/EP2008/066468 describes compositions for the 
    treatment of the Charcot-Marie-Tooth disease which comprise at least two compounds 
    selected 
    from the group of multiple drug candidates. The purpose of the present invention is to 
    provide new therapeutic combinations for treating CMT and related disorders. The invention 
    thus relates to compositions and methods for treating CMT and related disorders, 
    in particular toxic or traumatic neuropathy and amyotrophic lateral sclerosis, 
    using particular drug combinations. An object of this invention more specifically 
    relates to 
    a composition comprising baclofen, sorbitol and a compound selected from pilocarpine, 
    methimazole, mifepristone, naltrexone, rapamycin, flurbiprofen and ketoprofen, salts or 
    prodrugs thereof, for simultaneous, separate or sequential administration to a mammalian 
    subject. A particular object of the present invention relates to a composition comprising 
    baclofen, sorbitol and naltrexone, for simultaneous, separate or sequential administration 
    to a mammalian subject. Another object of the invention relates to a composition 
    comprising 
    (a) rapamycin, (b) mifepristone or naltrexone, and © a PMP22 modulator, for simultaneous, 
    separate or sequential administration to a mammalian subject. In a particular embodiment, 
    the PMP22 modulator is selected from acetazolamide, albuterol, amiloride, 
    aminoglutethimide, 
    amiodarone, aztreonam, baclofen, balsalazide, betaine, bethanechol, bicalutamide, 
    bromocriptine, bumetanide, buspirone, carbachol, carbamazepine, carbimazole, cevimeline, 
    ciprofloxacin, clonidine, curcumin, cyclosporine A, diazepam, diclofenac, dinoprostone, 
    disulfiram, D-sorbitol, dutasteride, estradiol, exemestane, felbamate, fenofibrate, 
    finasteride, flumazenil, flunitrazepam, flurbiprofen, furosemide, gabapentingabapentin, 
    galantamine, haloperidol, ibuprofen, isoproterenol, ketoconazole, ketoprofen, L-carnitine, 
    liothyronine (T3), lithium, losartan, loxapine, meloxicam, metaproterenol, metaraminol, 
    metformin, methacholine, methimazole, methylergonovine, metoprolol, metyrapone, 
    miconazole, 
    mifepristone, nadolol, naloxone, naltrexone; norfloxacin, pentazocine, phenoxybenzamine, 
    phenylbutyrate, pilocarpine, pioglitazone, prazosin, propylthiouracil, raloxifene, 
    rapamycin, rifampin, simvastatin, spironolactone, tacrolimus, tamoxifen, trehalose, 
    trilostane, valproic acid, salts or prodrugs thereof. 1. A method of improving nerve 
    regeneration in a human subject suffering from amyotrophic lateral sclerosis, 
    or a neuropathy selected from an idiopathic neuropathy, diabetic neuropathy, 
    a toxic neuropathy, a neuropathy induced by a drug treatment, a neuropathy provoked by 
    HIV, 
    a neuropathy provoked by radiation, a neuropathy provoked by heavy metals, a neuropathy 
    provoked by vitamin deficiency states, or a traumatic neuropathy, comprising administering 
    to the human subject an amount of a composition effective to improve nerve regeneration; 
    and 
    wherein the composition comprises baclofen or a pharmaceutically acceptable salt thereof 
    in 
    an amount from 1 to 300 mg/kg of the human subject per day; D-sorbitol or a 
    pharmaceutically 
    acceptable salt thereof; and naltrexone or a pharmaceutically acceptable salt thereof in 
    an 
    amount from 1 to 100 mg/kg of the human subject per day. 2. The method of claim 1, 
    wherein the composition further comprises a pharmaceutically suitable excipient or 
    carrier. 
    3. The method of claim 2, wherein the composition is formulated with a drug eluting 
    polymer, 
    a biomolecule, a micelle or liposome-forming lipids or oil in water emulsions, 
    or pegylated 
    or solid nanoparticles or microparticles for oral or parenteral or intrathecal 
    administration. 4. The method of claim 1, wherein the subject suffers from a traumatic 
    neuropathy arising from brain injury, spinal cord injury, or an injury to peripheral 
    nerves. 
    5. The method of claim 1, wherein the D-sorbitol or a pharmaceutically acceptable salt 
    thereof is D-sorbitol. 6. The method of claim 1, wherein the composition is formulated for 
    oral administration. 7. The method of claim 6, wherein the composition is a liquid 
    formulation. 8. The method of claim 1, wherein baclofen or a pharmaceutically acceptable 
    salt thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone 
    or a 
    pharmaceutically acceptable salt thereof are the sole active ingredients. 9. The method of 
    claim 1, comprising administering to the human subject baclofen or a pharmaceutically 
    acceptable salt thereof in an amount from 10 to 200 mg/kg of the human subject per day and 
    naltrexone or a pharmaceutically acceptable salt thereof in an amount from 1 to 50 mg/kg 
    of 
    the human subject per day. 10. The method of claim 1, comprising administering to the 
    human 
    subject baclofen or a pharmaceutically acceptable salt thereof in an amount from 10 to 200 
    mg/kg of the human subject per day and naltrexone or a pharmaceutically acceptable salt 
    thereof in an amount from 1 to 50 mg/kg of the human subject per day. 11. The method of 
    claim 1, comprising administering to the human subject baclofen or a pharmaceutically 
    acceptable salt thereof in an amount from 60 mg to 18 mg per day and naltrexone or a 
    pharmaceutically acceptable salt thereof in an amount from 60 mg to 6 mg per day. 12. The 
    method of claim 1, comprising administering to the human subject baclofen or a 
    pharmaceutically acceptable salt thereof in an amount from 60 mg to 12 mg per day and 
    naltrexone or a pharmaceutically acceptable salt thereof in an amount from 60 mg to 3 mg 
    per 
    day. 13. The method of claim 10, wherein baclofen or a pharmaceutically acceptable salt 
    thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a 
    pharmaceutically acceptable salt thereof are administered orally to the human subject. 14. 
    The method of claim 10, wherein baclofen or a pharmaceutically acceptable salt thereof, 
    D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a 
    pharmaceutically acceptable salt thereof are administered separately to the human subject. 
    15. The method of claim 13, wherein baclofen or a pharmaceutically acceptable salt 
    thereof, 
    D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a 
    pharmaceutically acceptable salt thereof are formulated in a liquid formulation. 16. The 
    method of claim 15, wherein baclofen or a pharmaceutically acceptable salt thereof, 
    D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a 
    pharmaceutically acceptable salt thereof are administered to the human subject in divided 
    doses. 17. The method of claim 15, wherein baclofen or a pharmaceutically acceptable salt 
    thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a 
    pharmaceutically acceptable salt thereof are administered to the human subject in divided 
    doses two times daily.
""".replace("\n", " ")
Databricks org

Yeah, what comes out of the tokenizer in this case, its length?

1998 for input_ids

Databricks org

And I haven't counted but your input is longer than that in tokens right? it's not limiting to 1024 tokens though, clearly. This I honestly don't quite know, but I'm aware that this setting has caused some questions: https://huggingface.co/databricks/dolly-v2-12b/blob/main/tokenizer_config.json#L5 Seems like it should be lower, and we've discussed this elsewhere. But I wonder if you somehow need to set model_max_length instead to 1024? this is new territory for me but it's a decent next guess

Yes, the tokens I'm submitting are 1998 in size, when combined with the built-in prompt and chained output exceed the 2048 limit.

I tried that, after creating the object, it seems to ignore that as well. I'm a bit perplexed trying to figure out to configure the tokenizer to actually do this.

I'm trying to decipher the InstructPipeline this morning to see if I can find a way to properly trim the instruction and context to a length that takes into account the pipeline injected prompt and the max new tokens output count.

So, based on my lack of progress, I'm guessing I'll need to truncate manually to some max value < 2048 based on my max output desired.

srowen changed discussion status to closed

Sign up or log in to comment