File size: 4,128 Bytes
c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
{% extends "page.html" %} {% block stylesheet %}
<style>
  .left-align {
    text-align: left;
  }
  .center-align {
    text-align: center;
  }
  .container {
    margin: 20px auto;
    max-width: 800px;
  }
  h2,
  h3,
  h4 {
    margin-top: 20px;
  }
  p {
    line-height: 1.6;
  }
  ul {
    margin-left: 20px;
  }
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
  <div class="center-align">
    <img
      src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
      alt="Space Logo"
      style="width: 75%"
    />
    <p>
      This Space is designed to provide you with an easy way to get started
      generating synthetic datasets using Spaces compute to host open LLMs. The
      Space comes with a ready-to-go environment and a series of notebooks
      showing various examples of generating synthetic datasets.
    </p>
  </div>
  <div class="left-align">
    <h2>What's covered?</h2>
    <p>Currently this Space has notebooks covering the following topics:</p>
    <h3>Creating synthetic text similarity datasets</h3>
    <p>
      A set of notebooks covering the steps for creating a synthetic dataset for
      fine-tuning a sentence similarity model. These notebooks cover:
    </p>
    <ul>
      <li>
        How to do structured generation using the
        <a href="https://github.com/outlines-dev/outlines">outlines</a> library
        to have more control on the outputs generated by a LLM.
      </li>
      <li>
        How to use
        <a href="https://docs.llamaindex.ai/en/stable/">Llama-index</a> to chunk
        texts to fit into the context length of sentence embedding models
      </li>
      <li>
        Using <a href="https://github.com/vllm-project/vllm">vLLM</a> to
        efficiently create a dataset that can be used to fine tune a Sentence
        similarity model
      </li>
     
    </ul>
  </div>
  <div class="center-align">
    <h2>Using the Space</h2>
    <p>
      To use this Space, use the duplicate button. You'll want to enable
      persistent storage so you can save your work. To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to use bigger models for generating data.
    </p>
    <h2>Duplicate the Space to run your own instance</h4>
    <h4>The default token is <span style="color: orange">huggingface</span></h4>
  </div>
  {% if login_available %}
  <div class="center-align">
    <form
      action="{{base_url}}login?next={{next}}"
      method="post"
      class="form-inline"
    >
      {{ xsrf_form_html() | safe }} {% if token_available %}
      <label for="password_input"
        ><strong>{% trans %}Token:{% endtrans %}</strong></label
      >
      {% else %}
      <label for="password_input"
        ><strong>{% trans %}Password:{% endtrans %}</strong></label
      >
      {% endif %}
      <input
        type="password"
        name="password"
        id="password_input"
        class="form-control"
      />
      <button type="submit" class="btn btn-default" id="login_submit">
        {% trans %}Log in{% endtrans %}
      </button>
    </form>
  </div>
  {% else %}
  <div class="center-align">
    <p>
      {% trans %}No login available, you shouldn't be seeing this page.{%
      endtrans %}
    </p>
  </div>
  {% endif %}
  <div class="center-align" style="font-size: 0.8em; color: #888">
    <p>
      This template was created by
      <a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
      <a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
      contributions of
      <a href="https://huggingface.co/osanseviero" target="_blank"
        >osanseviero</a
      >
      and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
    </p>
  </div>
  {% if message %}
  <div class="row">
    {% for key in message %}
    <div class="message {{key}}">{{message[key]}}</div>
    {% endfor %}
  </div>
  {% endif %} {% if token_available %} {% block token_message %} {% endblock
  token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}