File size: 5,665 Bytes
c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcb80bc
c1c33a0
 
 
 
 
 
8fd5383
 
 
c1c33a0
 
 
 
 
99c3ad0
c1c33a0
99c3ad0
c1c33a0
 
 
 
 
99c3ad0
c1c33a0
 
 
 
99c3ad0
aedb445
99c3ad0
 
 
aedb445
 
c1c33a0
 
 
 
 
 
99c3ad0
73ba593
 
d9f977d
99c3ad0
d9f977d
c1c33a0
aedb445
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
{% extends "page.html" %} {% block stylesheet %}
<style>
  .left-align {
    text-align: left;
  }
  .center-align {
    text-align: center;
  }
  .container {
    margin: 20px auto;
    max-width: 800px;
  }
  h2,
  h3,
  h4 {
    margin-top: 20px;
  }
  p {
    line-height: 1.6;
  }
  ul {
    margin-left: 20px;
  }
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
  <div class="center-align">
    <img
      src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
      alt="Space Logo"
      style="width: 80%"
    />
    <p>
      This Space is designed to provide you with an easy way to get started
      generating synthetic datasets using Spaces compute to host open LLMs. The
      Space comes with a ready-to-go environment and a series of notebooks
      showing various examples of generating synthetic datasets.

      You can read more about the aims of the Space in this <a href="https://huggingface.co/blog/davanstrien/synthetic-data-workshop" target="_blank">blog post</a>.
    </p>
    </p>
  </div>
  <div class="left-align">
    <h2>What's covered?</h2>
    <p>Currently this Space has notebooks covering the following topics:</p>
    <h3>Creating synthetic text similarity datasets</h3>
    <p>
      A set of notebooks covering the steps for creating a synthetic dataset for
      fine-tuning a sentence similarity model. These notebooks cover:
    </p>
    <ul>
      <li>
        How to do structured generation using the
        <a href="https://github.com/outlines-dev/outlines" target="_blank">outlines</a> library
        to have more control on the outputs generated by a LLM.
      </li>
      <li>
        How to use
        <a href="https://docs.llamaindex.ai/en/stable/" target="_blank">Llama-index</a> to chunk
        texts to fit into the context length of sentence embedding models.
            </li>
            <li>
        Using <a href="https://github.com/vllm-project/vllm" target="_blank">vLLM</a> to
        efficiently create a dataset that can be used to fine-tune a Sentence
        similarity model.
      </li>
    </ul>
  </div>
  <div class="center-align">
    <h2>Using the Space</h2>
    <p>
    To use this Space, you should <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true" target="_blank">duplicate it</a>.
    To ensure your work is saved it's suggested to enable  persistent storage for your Space.
    To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to run larger LLMs or generate more data.
      <b>Reminder</b> you can preview the notebooks in the Space without running
      them. You can find the Jupyter Notebooks in the  <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop/tree/main/notebooks" target="_blank">notebooks folder </a>.
    </p>
    </p>
    <h2>Duplicate the Space to run your own instance</h2>
    <br />
    <a
      class="duplicate-button"
      style="display: inline-block"
      target="_blank"
      href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true"
    >
      <img
        style="margin: 0"
        src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&amp;style=flat&amp;logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&amp;logoWidth=14"
        alt="Duplicate Space"
      />
    </a>
    <br />
    <br />
    <h4>The default token is <span style="color: orange">huggingface</span></h4>
  </div>
  {% if login_available %}
  <div class="center-align">
    <form
      action="{{base_url}}login?next={{next}}"
      method="post"
      class="form-inline"
    >
      {{ xsrf_form_html() | safe }} {% if token_available %}
      <label for="password_input"
        ><strong>{% trans %}Token:{% endtrans %}</strong></label
      >
      {% else %}
      <label for="password_input"
        ><strong>{% trans %}Password:{% endtrans %}</strong></label
      >
      {% endif %}
      <input
        type="password"
        name="password"
        id="password_input"
        class="form-control"
      />
      <button type="submit" class="btn btn-default" id="login_submit">
        {% trans %}Log in{% endtrans %}
      </button>
    </form>
  </div>
  {% else %}
  <div class="center-align">
    <p>
      {% trans %}No login available, you shouldn't be seeing this page.{%
      endtrans %}
    </p>
  </div>
  {% endif %}
  <div class="center-align" style="font-size: 0.8em; color: #888">
    <p>
      This template was created by
      <a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
      <a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
      contributions of
      <a href="https://huggingface.co/osanseviero" target="_blank"
        >osanseviero</a
      >
      and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
    </p>
  </div>
  {% if message %}
  <div class="row">
    {% for key in message %}
    <div class="message {{key}}">{{message[key]}}</div>
    {% endfor %}
  </div>
  {% endif %} {% if token_available %} {% block token_message %} {% endblock
  token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}