Spaces:

HF-test-lab
/

bulk_embeddings

Runtime error

App Files Files Community

nbroad HF staff commited on Jul 15, 2023

Commit

595c4bf

1 Parent(s): 9f6b9a6

add option to download

Browse files

Files changed (1) hide show

app.py +34 -15

app.py CHANGED Viewed

@@ -39,7 +39,23 @@ desc2opt = {v: k for k, v in opt2desc.items()}
 optimization_options = list(opt2desc.values())
-def run(
     ds_name,
     ds_config,
     column_name,
@@ -84,14 +100,10 @@ with gr.Blocks(title="Bulk embeddings") as demo:
         """
         This Space allows you to embed a large dataset easily. For instance, this can easily create vectors for Wikipedia \
         articles -- taking about __ hours and costing approximately $__.
         This utilizes state-of-the-art open-source embedding models, \
         and optimizes them for inference using Hugging Face [optimum](https://github.com/huggingface/optimum). There are various \
         levels of optimizations that can be applied - the quality of the embeddings will degrade as the optimizations increase.
         Currently available options: O2/O3/O4 on T4/A10 GPUs using onnx runtime.
         Future options:
           - OpenVino for CPU inference
           - TensorRT for GPU inference
@@ -100,22 +112,16 @@ with gr.Blocks(title="Bulk embeddings") as demo:
           - Text splitting options
           - More control about which rows to embed (skip some, stop early)
           - Dynamic padding
         ## Steps
         1. Upload the dataset to the Hugging Face Hub.
         2. Enter dataset details into the form below.
         3. Choose a model. These are taken from the top of the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
         4. Enter optimization level. See [here](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization#optimization-configuration) for details.
         5. Choose a name for the new dataset.
         6. Hit run!
         ### Note:
         If you have short documents, O3 will be faster than O4. If you have long documents, O4 will be faster than O3. \
             O4 requires the tokenized documents to be padded to max length.
         """
     )
@@ -172,12 +178,25 @@ with gr.Blocks(title="Bulk embeddings") as demo:
         )
     with gr.Row():
-        btn = gr.Button(value="Embed texts!")
         last = gr.Textbox(value="")
-    btn.click(
-        fn=run,
         inputs=[
             ds_name,
             ds_config,
@@ -194,4 +213,4 @@ with gr.Blocks(title="Bulk embeddings") as demo:
 if __name__ == "__main__":
-    demo.queue(concurrency_count=20).launch(show_error=True)

 optimization_options = list(opt2desc.values())
+def download(
+    ds_name,
+    ds_config,
+    ds_split,
+    progress=gr.Progress(),
+):
+    if progress is not None:
+        progress(0.5, "Loading dataset...")
+    ds = load_hf_dataset(ds_name, ds_config, ds_split)
+    return f"Downloaded! It has {len(ds)} docs."
+def embed(
     ds_name,
     ds_config,
     column_name,
         """
         This Space allows you to embed a large dataset easily. For instance, this can easily create vectors for Wikipedia \
         articles -- taking about __ hours and costing approximately $__.
         This utilizes state-of-the-art open-source embedding models, \
         and optimizes them for inference using Hugging Face [optimum](https://github.com/huggingface/optimum). There are various \
         levels of optimizations that can be applied - the quality of the embeddings will degrade as the optimizations increase.
         Currently available options: O2/O3/O4 on T4/A10 GPUs using onnx runtime.
         Future options:
           - OpenVino for CPU inference
           - TensorRT for GPU inference
           - Text splitting options
           - More control about which rows to embed (skip some, stop early)
           - Dynamic padding
         ## Steps
         1. Upload the dataset to the Hugging Face Hub.
         2. Enter dataset details into the form below.
         3. Choose a model. These are taken from the top of the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
         4. Enter optimization level. See [here](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization#optimization-configuration) for details.
         5. Choose a name for the new dataset.
         6. Hit run!
         ### Note:
         If you have short documents, O3 will be faster than O4. If you have long documents, O4 will be faster than O3. \
             O4 requires the tokenized documents to be padded to max length.
         """
     )
         )
     with gr.Row():
+        download_btn = gr.Button(value="Download dataset!")
+        embed_btn = gr.Button(value="Embed texts!")
         last = gr.Textbox(value="")
+    download_btn.click(
+        fn=download,
+        inputs=[
+            ds_name,
+            ds_config,
+            column_name,
+            ds_split,
+        ],
+        outputs=last,
+    )
+    embed_btn.click(
+        fn=embed,
         inputs=[
             ds_name,
             ds_config,
 if __name__ == "__main__":
+    demo.queue(concurrency_count=20).launch(show_error=True, debug=True)