File size: 9,019 Bytes
8773ff3
 
32014a1
fc828f1
8773ff3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32014a1
8773ff3
32014a1
 
 
8773ff3
fc828f1
 
 
 
798f8ba
 
3379fc5
fc828f1
3379fc5
 
 
 
 
 
 
fc828f1
 
3379fc5
 
 
fc828f1
798f8ba
8773ff3
 
3379fc5
 
 
 
8773ff3
 
798f8ba
 
 
 
 
 
 
fc828f1
 
 
798f8ba
 
fc828f1
 
 
798f8ba
fc828f1
 
 
 
798f8ba
 
 
 
 
 
 
8773ff3
fc828f1
 
 
 
 
 
 
 
 
3379fc5
 
 
 
fc828f1
 
 
3379fc5
 
 
 
 
 
 
 
 
 
 
fc828f1
 
798f8ba
fc828f1
7503ca9
fc828f1
3379fc5
 
 
 
 
 
 
fc828f1
 
 
3379fc5
 
 
 
798f8ba
 
3379fc5
 
 
 
8773ff3
 
 
 
 
 
3379fc5
8773ff3
 
798f8ba
8773ff3
32014a1
 
8773ff3
 
 
32014a1
 
 
 
fc828f1
 
 
 
 
 
 
32014a1
 
 
 
fc828f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfd3683
 
32014a1
3379fc5
 
32014a1
8773ff3
32014a1
3379fc5
32014a1
fc828f1
3379fc5
 
32014a1
8773ff3
3379fc5
32014a1
 
3379fc5
fc828f1
9675a52
32014a1
3379fc5
 
 
dfd3683
 
 
 
 
fc828f1
 
32014a1
 
 
3379fc5
 
32014a1
8773ff3
7503ca9
 
 
 
 
 
 
 
 
 
 
 
 
 
32014a1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
import streamlit as st

from defaults import ARGILLA_URL
from hub import push_pipeline_params
from utils import project_sidebar

st.set_page_config(
    page_title="Domain Data Grower",
    page_icon="🧑‍🌾",
)

project_sidebar()

################################################################################
# HEADER
################################################################################

st.header("🧑‍🌾 Domain Data Grower")
st.divider()
st.subheader("Step 3. Run the pipeline to generate synthetic data")
st.write("Define the distilabel pipeline for generating the dataset.")

hub_username = st.session_state.get("hub_username")
project_name = st.session_state.get("project_name")
hub_token = st.session_state.get("hub_token")

###############################################################
# CONFIGURATION
###############################################################

st.divider()

st.markdown("## 🧰 Data Generation Pipeline")

st.markdown(
    """
            Now we need to define the configuration for the pipeline that will generate the synthetic data.
            The pipeline will generate synthetic data by combining self-instruction and domain expert responses.
            The self-instruction step generates instructions based on seed terms, and the domain expert step generates \
            responses to those instructions. Take a look at the [distilabel docs](https://distilabel.argilla.io/latest/sections/learn/tasks/text_generation/#self-instruct) for more information.
            """
)

###############################################################
# INFERENCE
###############################################################

st.markdown("#### 🤖 Inference configuration")

st.write(
    """Add the url of the Huggingface inference API or endpoint that your pipeline should use to generate instruction and response pairs. \
    Some domain tasks may be challenging for smaller models, so you may need to iterate over your task definition and model selection. \
    This is a part of the process of generating high-quality synthetic data, human feedback is key to this process. \
    You can find compatible models here:"""
)

with st.expander("🤗 Recommended Models"):
    st.write("All inference endpoint compatible models can be found via the link below")
    st.link_button(
        "🤗 Inference compaptible models on the hub",
        "https://huggingface.co/models?pipeline_tag=text-generation&other=endpoints_compatible&sort=trending",
    )
    st.write("🔋Projects with sufficient resources could take advantage of LLama3 70b")
    st.code(
        "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
    )

    st.write("🪫Projects with less resources could take advantage of LLama 3 8b")
    st.code(
        "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
    )

    st.write("🍃Projects with even less resources could use Phi-3-mini-4k-instruct")
    st.code(
        "https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct"
    )

    st.write("Note Hugggingface Pro gives access to more compute resources")
    st.link_button(
        "🤗 Huggingface Pro",
        "https://huggingface.co/pricing",
    )


self_instruct_base_url = st.text_input(
    label="Model base URL for instruction generation",
    value="https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct",
)
domain_expert_base_url = st.text_input(
    label="Model base URL for domain expert response",
    value="https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct",
)

###############################################################
# PARAMETERS
###############################################################

st.divider()
st.markdown("#### 🧮 Parameters configuration")

st.write(
    "⚠️ Model and parameter choices significantly affect the quality of the generated data. \
    We reccomend that you start with generating a few samples and review the data. Then scale up from there. \
    You can run the pipeline multiple times with different configurations and append it to the same Argilla dataset."
)

st.markdown(
    "Number of generations are the samples that each model will generate for each seed term, \
    so if you have 10 seed terms, 2 instruction generations, and 2 response generations, you will have 40 samples in total."
)

self_intruct_num_generations = st.slider(
    "Number of generations for self-instruction", 1, 10, 2
)
domain_expert_num_generations = st.slider(
    "Number of generations for domain expert response", 1, 10, 2
)

st.markdown(
    "Temperature is a hyperparameter that controls the randomness of the generated text. \
        Lower temperatures will generate more deterministic text, while higher temperatures \
        will add more variation to generations."
)

self_instruct_temperature = st.slider("Temperature for self-instruction", 0.1, 1.0, 0.9)
domain_expert_temperature = st.slider("Temperature for domain expert", 0.1, 1.0, 0.9)

###############################################################
# ARGILLA API
###############################################################

st.divider()
st.markdown("#### 🔬 Argilla API details to push the generated dataset")
st.markdown(
    "Here you can define the Argilla API details to push the generated dataset to your Argilla space. \
        These are the defaults that were set up for the project. You can change them if needed."
)
argilla_url = st.text_input("Argilla API URL", ARGILLA_URL)
argilla_api_key = st.text_input("Argilla API Key", "owner.apikey")
argilla_dataset_name = st.text_input("Argilla Dataset Name", project_name)
st.divider()

###############################################################
# Pipeline Run
###############################################################

st.markdown("## Run the pipeline")

st.markdown(
    "Once you've defined the pipeline configuration above, you can run the pipeline from your local machine."
)


if all(
    [
        argilla_api_key,
        argilla_url,
        self_instruct_base_url,
        domain_expert_base_url,
        self_intruct_num_generations,
        domain_expert_num_generations,
        self_instruct_temperature,
        domain_expert_temperature,
        hub_username,
        project_name,
        hub_token,
        argilla_dataset_name,
    ]
) and st.button("💾 Save Pipeline Config"):
    with st.spinner("Pushing pipeline to the Hub..."):
        push_pipeline_params(
            pipeline_params={
                "argilla_api_key": argilla_api_key,
                "argilla_api_url": argilla_url,
                "argilla_dataset_name": argilla_dataset_name,
                "self_instruct_base_url": self_instruct_base_url,
                "domain_expert_base_url": domain_expert_base_url,
                "self_instruct_temperature": self_instruct_temperature,
                "domain_expert_temperature": domain_expert_temperature,
                "self_intruct_num_generations": self_intruct_num_generations,
                "domain_expert_num_generations": domain_expert_num_generations,
            },
            hub_username=hub_username,
            hub_token=hub_token,
            project_name=project_name,
        )

    st.success(
        f"Pipeline configuration pushed to the dataset repo {hub_username}/{project_name} on the Hub."
    )

    st.markdown(
        "To run the pipeline locally, you need to have the `distilabel` library installed. \
            You can install it using the following command:"
    )

    st.code(
        body="""
        # Install the distilabel library
        pip install distilabel
        """,
        language="bash",
    )

    st.markdown("Next, you'll need to clone the pipeline code and install dependencies:")

    st.code(
        """
        git clone https://github.com/huggingface/data-is-better-together
        cd data-is-better-together/domain-specific-datasets/distilabel_pipelines
        pip install -r requirements.txt
        huggingface-cli login
        """,
        language="bash",
    )

    st.markdown("Finally, you can run the pipeline using the following command:")

    st.code(
        f"""
        python domain_expert_pipeline.py {hub_username}/{project_name}""",
        language="bash",
    )
    st.markdown(
        "👩‍🚀 If you want to customise the pipeline take a look in `domain_expert_pipeline.py` \
            and the [distilabel docs](https://distilabel.argilla.io/)"
    )

    st.markdown(
        "🚀 Once you've run the pipeline your records will be available in the Argilla space"
    )

    st.link_button("🔗 Argilla Space", argilla_url)

    st.markdown("Once you've reviewed the data, you can publish it on the next page:")

    st.page_link(
        page="pages/4_🔍 Review Generated Data.py",
        label="Review Generated Data",
        icon="🔍",
    )

else:
    st.info("Please fill all the required fields.")