Spaces:

davanstrien
/

synthetic-data-workshop

Running

File size: 4,128 Bytes

c1c33a0

{% extends "page.html" %} {% block stylesheet %}
<style>
  .left-align {
    text-align: left;
  }
  .center-align {
    text-align: center;
  }
  .container {
    margin: 20px auto;
    max-width: 800px;
  }
  h2,
  h3,
  h4 {
    margin-top: 20px;
  }
  p {
    line-height: 1.6;
  }
  ul {
    margin-left: 20px;
  }
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
  <div class="center-align">
    <img
      src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
      alt="Space Logo"
      style="width: 75%"
    />
    <p>
      This Space is designed to provide you with an easy way to get started
      generating synthetic datasets using Spaces compute to host open LLMs. The
      Space comes with a ready-to-go environment and a series of notebooks
      showing various examples of generating synthetic datasets.
    </p>
  </div>
  <div class="left-align">
    <h2>What's covered?</h2>
    <p>Currently this Space has notebooks covering the following topics:</p>
    <h3>Creating synthetic text similarity datasets</h3>
    <p>
      A set of notebooks covering the steps for creating a synthetic dataset for
      fine-tuning a sentence similarity model. These notebooks cover:
    </p>
    <ul>
      <li>
        How to do structured generation using the
        <a href="https://github.com/outlines-dev/outlines">outlines</a> library
        to have more control on the outputs generated by a LLM.
      </li>
      <li>
        How to use
        <a href="https://docs.llamaindex.ai/en/stable/">Llama-index</a> to chunk
        texts to fit into the context length of sentence embedding models
      </li>
      <li>
        Using <a href="https://github.com/vllm-project/vllm">vLLM</a> to
        efficiently create a dataset that can be used to fine tune a Sentence
        similarity model
      </li>
     
    </ul>
  </div>
  <div class="center-align">
    <h2>Using the Space</h2>
    <p>
      To use this Space, use the duplicate button. You'll want to enable
      persistent storage so you can save your work. To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to use bigger models for generating data.
    </p>
    <h2>Duplicate the Space to run your own instance</h4>
    <h4>The default token is <span style="color: orange">huggingface</span></h4>
  </div>
  {% if login_available %}
  <div class="center-align">
    <form
      action="{{base_url}}login?next={{next}}"
      method="post"
      class="form-inline"
    >
      {{ xsrf_form_html() | safe }} {% if token_available %}
      <label for="password_input"
        ><strong>{% trans %}Token:{% endtrans %}</strong></label
      >
      {% else %}
      <label for="password_input"
        ><strong>{% trans %}Password:{% endtrans %}</strong></label
      >
      {% endif %}
      <input
        type="password"
        name="password"
        id="password_input"
        class="form-control"
      />
      <button type="submit" class="btn btn-default" id="login_submit">
        {% trans %}Log in{% endtrans %}
      </button>
    </form>
  </div>
  {% else %}
  <div class="center-align">
    <p>
      {% trans %}No login available, you shouldn't be seeing this page.{%
      endtrans %}
    </p>
  </div>
  {% endif %}
  <div class="center-align" style="font-size: 0.8em; color: #888">
    <p>
      This template was created by
      <a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
      <a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
      contributions of
      <a href="https://huggingface.co/osanseviero" target="_blank"
        >osanseviero</a
      >
      and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
    </p>
  </div>
  {% if message %}
  <div class="row">
    {% for key in message %}
    <div class="message {{key}}">{{message[key]}}</div>
    {% endfor %}
  </div>
  {% endif %} {% if token_available %} {% block token_message %} {% endblock
  token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}