Spaces:

davanstrien
/

synthetic-data-workshop

Running

File size: 5,665 Bytes

c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcb80bc
c1c33a0
 
 
 
 
 
8fd5383
 
 
c1c33a0
 
 
 
 
99c3ad0
c1c33a0
99c3ad0
c1c33a0
 
 
 
 
99c3ad0
c1c33a0
 
 
 
99c3ad0
aedb445
99c3ad0
 
 
aedb445
 
c1c33a0
 
 
 
 
 
99c3ad0
73ba593
 
d9f977d
99c3ad0
d9f977d
c1c33a0
aedb445
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1c33a0

{% extends "page.html" %} {% block stylesheet %}
<style>
  .left-align {
    text-align: left;
  }
  .center-align {
    text-align: center;
  }
  .container {
    margin: 20px auto;
    max-width: 800px;
  }
  h2,
  h3,
  h4 {
    margin-top: 20px;
  }
  p {
    line-height: 1.6;
  }
  ul {
    margin-left: 20px;
  }
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
  <div class="center-align">
    <img
      src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
      alt="Space Logo"
      style="width: 80%"
    />
    <p>
      This Space is designed to provide you with an easy way to get started
      generating synthetic datasets using Spaces compute to host open LLMs. The
      Space comes with a ready-to-go environment and a series of notebooks
      showing various examples of generating synthetic datasets.

      You can read more about the aims of the Space in this <a href="https://huggingface.co/blog/davanstrien/synthetic-data-workshop" target="_blank">blog post</a>.
    </p>
    </p>
  </div>
  <div class="left-align">
    <h2>What's covered?</h2>
    <p>Currently this Space has notebooks covering the following topics:</p>
    <h3>Creating synthetic text similarity datasets</h3>
    <p>
      A set of notebooks covering the steps for creating a synthetic dataset for
      fine-tuning a sentence similarity model. These notebooks cover:
    </p>
    <ul>
      <li>
        How to do structured generation using the
        <a href="https://github.com/outlines-dev/outlines" target="_blank">outlines</a> library
        to have more control on the outputs generated by a LLM.
      </li>
      <li>
        How to use
        <a href="https://docs.llamaindex.ai/en/stable/" target="_blank">Llama-index</a> to chunk
        texts to fit into the context length of sentence embedding models.
            </li>
            <li>
        Using <a href="https://github.com/vllm-project/vllm" target="_blank">vLLM</a> to
        efficiently create a dataset that can be used to fine-tune a Sentence
        similarity model.
      </li>
    </ul>
  </div>
  <div class="center-align">
    <h2>Using the Space</h2>
    <p>
    To use this Space, you should <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true" target="_blank">duplicate it</a>.
    To ensure your work is saved it's suggested to enable  persistent storage for your Space.
    To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to run larger LLMs or generate more data.
      <b>Reminder</b> you can preview the notebooks in the Space without running
      them. You can find the Jupyter Notebooks in the  <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop/tree/main/notebooks" target="_blank">notebooks folder </a>.
    </p>
    </p>
    <h2>Duplicate the Space to run your own instance</h2>
    <br />
    <a
      class="duplicate-button"
      style="display: inline-block"
      target="_blank"
      href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true"
    >
      <img
        style="margin: 0"
        src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&amp;style=flat&amp;logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&amp;logoWidth=14"
        alt="Duplicate Space"
      />
    </a>
    <br />
    <br />
    <h4>The default token is <span style="color: orange">huggingface</span></h4>
  </div>
  {% if login_available %}
  <div class="center-align">
    <form
      action="{{base_url}}login?next={{next}}"
      method="post"
      class="form-inline"
    >
      {{ xsrf_form_html() | safe }} {% if token_available %}
      <label for="password_input"
        ><strong>{% trans %}Token:{% endtrans %}</strong></label
      >
      {% else %}
      <label for="password_input"
        ><strong>{% trans %}Password:{% endtrans %}</strong></label
      >
      {% endif %}
      <input
        type="password"
        name="password"
        id="password_input"
        class="form-control"
      />
      <button type="submit" class="btn btn-default" id="login_submit">
        {% trans %}Log in{% endtrans %}
      </button>
    </form>
  </div>
  {% else %}
  <div class="center-align">
    <p>
      {% trans %}No login available, you shouldn't be seeing this page.{%
      endtrans %}
    </p>
  </div>
  {% endif %}
  <div class="center-align" style="font-size: 0.8em; color: #888">
    <p>
      This template was created by
      <a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
      <a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
      contributions of
      <a href="https://huggingface.co/osanseviero" target="_blank"
        >osanseviero</a
      >
      and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
    </p>
  </div>
  {% if message %}
  <div class="row">
    {% for key in message %}
    <div class="message {{key}}">{{message[key]}}</div>
    {% endfor %}
  </div>
  {% endif %} {% if token_available %} {% block token_message %} {% endblock
  token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}