Spaces:
Running
Running
title: "FAKE NEWS DETECTION" | |
emoji: "π" | |
colorFrom: "#FF69B4" | |
colorTo: "#FF1493" | |
sdk: "gradio" | |
sdk_version: "5.8.0" | |
app_file: "application.py" | |
pinned: false | |
# [Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation | |
## **Getting Started** | |
1. **Clone the repository:** | |
```bash | |
git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection | |
``` | |
2. **Set up the environment:** | |
Using virtual environment: | |
```bash | |
python -m venv .venv | |
source .venv/bin/activate | |
``` | |
3. **Install dependencies:** | |
- Torch: https://pytorch.org/get-started/locally/ | |
- Others | |
```bash | |
pip install -r requirements.txt | |
``` | |
1. **API Keys** (optional) | |
- Obtain API keys for the corresponding models and insert them into the `SimLLM.py` file: | |
- ChatGPT: [OpenAI API](https://openai.com/index/openai-api/) | |
- Gemini: [Google Gemini API](https://ai.google.dev/gemini-api/docs/api-key) | |
- Other LLMs: [Together API](https://api.together.ai/) | |
5. **Run the project:** | |
- Text only: | |
```bash | |
python SimLLM.py | |
``` | |
### Parameters | |
- `LLMs`: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is `['ChatGPT', 'Yi', 'OpenChat']`. | |
- `train_indexes`: List of LLM indexes for training. Default is `[0, 1, 2]`. | |
- `test_indexes`: List of LLM indexes for testing. Default is `[0]`. | |
- `num_samples`: Number of samples. Default is 5000. | |
### Examples | |
- Running with default parameters: | |
`python SimLLM.py` | |
- Running with customized parameters: | |
`python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0` | |
## Dataset | |
The `dataset.csv` file contains both human and generated texts from 12 large language models, including: | |
ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna. | |
## Citation | |
```bibtex | |
@inproceedings{nguyen2024SimLLM, | |
title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation}, | |
author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji}, | |
booktitle={The Conference on Empirical Methods in Natural Language Processing}, | |
year={2024} | |
} | |
``` | |
## Acknowledgements | |
- BARTScore: [BARTScore GitHub Repository](https://github.com/neulab/BARTScore) | |