README.md · Canstralian/DockerTester at 0b8afaf6e636f7f3439aabb65f6c42059c6a501e

metadata

title: DockerTester
emoji: 📚
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit

RedPajama Dataset API

A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset

Overview

This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.

Features 1. Retrieve Dataset Chunks Fetch smaller, manageable subsets of the dataset to explore or preprocess. 2. Search Data Search for specific keywords in the dataset and retrieve relevant results. 3. Dataset Summary Get an overview of the dataset’s structure, including available splits.

Endpoints

Endpoint Method Parameters Description / GET None Displays a welcome message. /get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset. /search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword. /data_summary/ GET None Displays a summary of the dataset.

Getting Started

Prerequisites • Python 3.8+ • Pip for dependency management

Setup 1. Clone the repository:

git clone https://huggingface.co/spaces/Canstralian/DockerTester cd DockerTester

2.	Install dependencies:

pip install -r requirements.txt

3.	Run the application:

uvicorn app:app --host 0.0.0.0 --port 8000

4.	Access the API in your browser or using tools like Postman at:

http://127.0.0.1:8000

Example Usage 1. Retrieve a Small Chunk of Data Fetch 5 examples from the dataset:

curl "http://127.0.0.1:8000/get_data/?chunk_size=5"

2.	Search the Dataset

Search for the keyword example and return up to 3 results:

curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"

3.	View Dataset Summary

Get an overview of available splits:

curl "http://127.0.0.1:8000/data_summary/"

Technologies Used • FastAPI: For building the API. • Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset. • Uvicorn: For running the ASGI server. • Python: Backend language.

Future Enhancements • Add support for advanced filtering (e.g., by metadata or specific fields). • Implement user authentication for restricted dataset access. • Add visualization endpoints for dataset insights.

License

This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.

Feel free to reach out for questions, feature requests, or contributions!