File size: 5,665 Bytes
c1c33a0 fcb80bc c1c33a0 8fd5383 c1c33a0 99c3ad0 c1c33a0 99c3ad0 c1c33a0 99c3ad0 c1c33a0 99c3ad0 aedb445 99c3ad0 aedb445 c1c33a0 99c3ad0 73ba593 d9f977d 99c3ad0 d9f977d c1c33a0 aedb445 c1c33a0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
{% extends "page.html" %} {% block stylesheet %}
<style>
.left-align {
text-align: left;
}
.center-align {
text-align: center;
}
.container {
margin: 20px auto;
max-width: 800px;
}
h2,
h3,
h4 {
margin-top: 20px;
}
p {
line-height: 1.6;
}
ul {
margin-left: 20px;
}
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
<div class="center-align">
<img
src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
alt="Space Logo"
style="width: 80%"
/>
<p>
This Space is designed to provide you with an easy way to get started
generating synthetic datasets using Spaces compute to host open LLMs. The
Space comes with a ready-to-go environment and a series of notebooks
showing various examples of generating synthetic datasets.
You can read more about the aims of the Space in this <a href="https://huggingface.co/blog/davanstrien/synthetic-data-workshop" target="_blank">blog post</a>.
</p>
</p>
</div>
<div class="left-align">
<h2>What's covered?</h2>
<p>Currently this Space has notebooks covering the following topics:</p>
<h3>Creating synthetic text similarity datasets</h3>
<p>
A set of notebooks covering the steps for creating a synthetic dataset for
fine-tuning a sentence similarity model. These notebooks cover:
</p>
<ul>
<li>
How to do structured generation using the
<a href="https://github.com/outlines-dev/outlines" target="_blank">outlines</a> library
to have more control on the outputs generated by a LLM.
</li>
<li>
How to use
<a href="https://docs.llamaindex.ai/en/stable/" target="_blank">Llama-index</a> to chunk
texts to fit into the context length of sentence embedding models.
</li>
<li>
Using <a href="https://github.com/vllm-project/vllm" target="_blank">vLLM</a> to
efficiently create a dataset that can be used to fine-tune a Sentence
similarity model.
</li>
</ul>
</div>
<div class="center-align">
<h2>Using the Space</h2>
<p>
To use this Space, you should <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true" target="_blank">duplicate it</a>.
To ensure your work is saved it's suggested to enable persistent storage for your Space.
To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to run larger LLMs or generate more data.
<b>Reminder</b> you can preview the notebooks in the Space without running
them. You can find the Jupyter Notebooks in the <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop/tree/main/notebooks" target="_blank">notebooks folder </a>.
</p>
</p>
<h2>Duplicate the Space to run your own instance</h2>
<br />
<a
class="duplicate-button"
style="display: inline-block"
target="_blank"
href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true"
>
<img
style="margin: 0"
src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&logoWidth=14"
alt="Duplicate Space"
/>
</a>
<br />
<br />
<h4>The default token is <span style="color: orange">huggingface</span></h4>
</div>
{% if login_available %}
<div class="center-align">
<form
action="{{base_url}}login?next={{next}}"
method="post"
class="form-inline"
>
{{ xsrf_form_html() | safe }} {% if token_available %}
<label for="password_input"
><strong>{% trans %}Token:{% endtrans %}</strong></label
>
{% else %}
<label for="password_input"
><strong>{% trans %}Password:{% endtrans %}</strong></label
>
{% endif %}
<input
type="password"
name="password"
id="password_input"
class="form-control"
/>
<button type="submit" class="btn btn-default" id="login_submit">
{% trans %}Log in{% endtrans %}
</button>
</form>
</div>
{% else %}
<div class="center-align">
<p>
{% trans %}No login available, you shouldn't be seeing this page.{%
endtrans %}
</p>
</div>
{% endif %}
<div class="center-align" style="font-size: 0.8em; color: #888">
<p>
This template was created by
<a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
<a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
contributions of
<a href="https://huggingface.co/osanseviero" target="_blank"
>osanseviero</a
>
and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
</p>
</div>
{% if message %}
<div class="row">
{% for key in message %}
<div class="message {{key}}">{{message[key]}}</div>
{% endfor %}
</div>
{% endif %} {% if token_available %} {% block token_message %} {% endblock
token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}
|