import gradio as gr def create_help_tab(): gr.Markdown( label="Readme", value=""" # Dataset Metrics Explorer ## Features: - Inspect datasets throught various metrics computed using datatrove - Search for datasets containing certain metrics ## Metrics View Usage: 1) Specify Metrics location (Stats block `output_folder`) and click "Fetch Datasets" 2) Select datasets you are interested in using the dropdown or regex filter 3) Specify Grouping (histogram/summary/fqdn/suffix) and Metric name 4) Click "Render Metric", adjust Graph settings and see the result ### Groupings: - **histogram**: Creates a line plot of values with their frequencies. * normalize: Normalize the histogram to sum to 1 * CDF: Show the plot as cumulative distribution function * %: Show the plot as percentage of the total - **(fqdn/suffix)**: Creates a bar plot of the avg. values of the metric for full qualifed domain name/suffix of domain. * k: the number of groups to show * Top/Bottom/Most frequent (n_docs): Groups with the top/bottom k values/most prevalant docs are shown - **summary**: Shows the average value of given metric for every dataset * show_stds: Show the standard deviation from mean for every datasets ## Reverse Metrics Search Usage: To search for datasets containing a grouping and certain metric, use the Reverse search section. Specify the search parameters and click "Search". This will show you found datasets in the "Found datasets" textbox. You can modify the selection after search by removing unwanted lines and clicking "Add to selection". ## Note: The data might not be 100% representative, due to the sampling and optimistic merging of the metrics (fqdn/suffix). """ )