|
{% extends "page.html" %} {% block stylesheet %} |
|
<style> |
|
.left-align { |
|
text-align: left; |
|
} |
|
.center-align { |
|
text-align: center; |
|
} |
|
.container { |
|
margin: 20px auto; |
|
max-width: 800px; |
|
} |
|
h2, |
|
h3, |
|
h4 { |
|
margin-top: 20px; |
|
} |
|
p { |
|
line-height: 1.6; |
|
} |
|
ul { |
|
margin-left: 20px; |
|
} |
|
</style> |
|
{% endblock %} {% block site %} |
|
<div id="jupyter-main-app" class="container"> |
|
<div class="center-align"> |
|
<img |
|
src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg" |
|
alt="Space Logo" |
|
style="width: 80%" |
|
/> |
|
<p> |
|
This Space is designed to provide you with an easy way to get started |
|
generating synthetic datasets using Spaces compute to host open LLMs. The |
|
Space comes with a ready-to-go environment and a series of notebooks |
|
showing various examples of generating synthetic datasets. |
|
</p> |
|
</div> |
|
<div class="left-align"> |
|
<h2>What's covered?</h2> |
|
<p>Currently this Space has notebooks covering the following topics:</p> |
|
<h4>Creating synthetic text similarity datasets:</h4> |
|
<p> |
|
A series of notebooks covering the steps for creating a synthetic dataset for |
|
fine-tuning a sentence similarity model. These notebooks cover: |
|
</p> |
|
<ul> |
|
<li> |
|
How to do structured generation using the |
|
<a href="https://github.com/outlines-dev/outlines">outlines</a> library |
|
to have more control on the outputs generated by a LLM. |
|
</li> |
|
<li> |
|
How to use |
|
<a href="https://docs.llamaindex.ai/en/stable/">Llama-index</a> to chunk |
|
texts to fit into the context length of sentence embedding models. |
|
</li> |
|
<li> |
|
Using <a href="https://github.com/vllm-project/vllm">vLLM</a> to |
|
efficiently create a dataset that can be used to fine-tune a Sentence |
|
similarity model. |
|
</li> |
|
</ul> |
|
</div> |
|
<div class="center-align"> |
|
<h2>Using the Space</h2> |
|
<p> |
|
To use this Space, you should <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true">duplicate it</a>. |
|
To ensure your work is saved it's suggested to enable persistent storage for your Space. |
|
To start, you may want to use a smaller GPU like the T4 and switch out to a bigger GPU when you want to run larger LLMs or generate more data. |
|
<b>Reminder</b> you can preview the notebooks in the Space without running |
|
them. You can find the Jupyter Notebooks in the <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop/tree/main/notebooks">notebooks folder </a>. |
|
</p> |
|
</p> |
|
<h2>Duplicate the Space to run your own instance</h2> |
|
<br /> |
|
<a |
|
class="duplicate-button" |
|
style="display: inline-block" |
|
target="_blank" |
|
href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true" |
|
> |
|
<img |
|
style="margin: 0" |
|
src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&logoWidth=14" |
|
alt="Duplicate Space" |
|
/> |
|
</a> |
|
<br /> |
|
<br /> |
|
<h4>The default token is <span style="color: orange">huggingface</span></h4> |
|
</div> |
|
{% if login_available %} |
|
<div class="center-align"> |
|
<form |
|
action="{{base_url}}login?next={{next}}" |
|
method="post" |
|
class="form-inline" |
|
> |
|
{{ xsrf_form_html() | safe }} {% if token_available %} |
|
<label for="password_input" |
|
><strong>{% trans %}Token:{% endtrans %}</strong></label |
|
> |
|
{% else %} |
|
<label for="password_input" |
|
><strong>{% trans %}Password:{% endtrans %}</strong></label |
|
> |
|
{% endif %} |
|
<input |
|
type="password" |
|
name="password" |
|
id="password_input" |
|
class="form-control" |
|
/> |
|
<button type="submit" class="btn btn-default" id="login_submit"> |
|
{% trans %}Log in{% endtrans %} |
|
</button> |
|
</form> |
|
</div> |
|
{% else %} |
|
<div class="center-align"> |
|
<p> |
|
{% trans %}No login available, you shouldn't be seeing this page.{% |
|
endtrans %} |
|
</p> |
|
</div> |
|
{% endif %} |
|
<div class="center-align" style="font-size: 0.8em; color: #888"> |
|
<p> |
|
This template was created by |
|
<a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and |
|
<a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with |
|
contributions of |
|
<a href="https://huggingface.co/osanseviero" target="_blank" |
|
>osanseviero</a |
|
> |
|
and <a href="https://huggingface.co/azzr" target="_blank">azzr</a> |
|
</p> |
|
</div> |
|
{% if message %} |
|
<div class="row"> |
|
{% for key in message %} |
|
<div class="message {{key}}">{{message[key]}}</div> |
|
{% endfor %} |
|
</div> |
|
{% endif %} {% if token_available %} {% block token_message %} {% endblock |
|
token_message %} {% endif %} |
|
</div> |
|
{% endblock %} {% block script %} {% endblock %} |
|
|