Spaces:
Runtime error
title: DockerTester
emoji: 📚
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit
RedPajama Dataset API
A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset
Overview
This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.
Features 1. Retrieve Dataset Chunks Fetch smaller, manageable subsets of the dataset to explore or preprocess. 2. Search Data Search for specific keywords in the dataset and retrieve relevant results. 3. Dataset Summary Get an overview of the dataset’s structure, including available splits.
Endpoints
Endpoint Method Parameters Description / GET None Displays a welcome message. /get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset. /search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword. /data_summary/ GET None Displays a summary of the dataset.
Getting Started
Prerequisites • Python 3.8+ • Pip for dependency management
Setup 1. Clone the repository:
git clone https://huggingface.co/spaces/Canstralian/DockerTester cd DockerTester
2. Install dependencies:
pip install -r requirements.txt
3. Run the application:
uvicorn app:app --host 0.0.0.0 --port 8000
4. Access the API in your browser or using tools like Postman at:
Example Usage 1. Retrieve a Small Chunk of Data Fetch 5 examples from the dataset:
curl "http://127.0.0.1:8000/get_data/?chunk_size=5"
2. Search the Dataset
Search for the keyword example and return up to 3 results:
curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"
3. View Dataset Summary
Get an overview of available splits:
curl "http://127.0.0.1:8000/data_summary/"
Technologies Used • FastAPI: For building the API. • Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset. • Uvicorn: For running the ASGI server. • Python: Backend language.
Future Enhancements • Add support for advanced filtering (e.g., by metadata or specific fields). • Implement user authentication for restricted dataset access. • Add visualization endpoints for dataset insights.
License
This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.
Feel free to reach out for questions, feature requests, or contributions!