Controlled_Chat_CPU

Sleeping

abrakjamson commited on Oct 14, 2024

Commit

434becc

1 Parent(s): bd9fdbb

Reducing default max new tokens to speed up responses on CPU

Files changed (1) hide show

app.py CHANGED Viewed

@@ -568,7 +568,7 @@ with gr.Blocks(
         else:
             gr.Markdown("""# 🧠 LLM Mind Control ((Llama 3.2 1B))
-    *Warning: although using a small model, running on CPU will still be very slow*""")
         gr.Markdown("""Unlike prompting, direct weight manipulation lets you fine-tune the amount of a personality
     trait or topic. Enabled through [Representation Engineering](https://arxiv.org/abs/2310.01405)
     via the [repeng](https://pypi.org/project/repeng) library.
@@ -670,7 +670,7 @@ with gr.Blocks(
                                 </div>
                             """)
                             max_new_tokens = gr.Number(
-                                value=192,
                                 precision=0,
                                 step=10,
                                 show_label=False

         else:
             gr.Markdown("""# 🧠 LLM Mind Control ((Llama 3.2 1B))
+    *Warning: although using a small model, running on CPU will still be very slow (30+ seconds to first token)*""")
         gr.Markdown("""Unlike prompting, direct weight manipulation lets you fine-tune the amount of a personality
     trait or topic. Enabled through [Representation Engineering](https://arxiv.org/abs/2310.01405)
     via the [repeng](https://pypi.org/project/repeng) library.
                                 </div>
                             """)
                             max_new_tokens = gr.Number(
+                                value=128,
                                 precision=0,
                                 step=10,
                                 show_label=False