SA-Yur-or commited on
Commit
acb2824
1 Parent(s): b9844ba

[fix]: change license and remove code from preformance table

Browse files
Files changed (2) hide show
  1. LICENSE +24 -0
  2. README.md +63 -17
LICENSE ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SuperAnnotate AI Public License (SAIPL)
2
+ Version 1.0
3
+ This License governs the use, modification, and distribution of AI models provided by SuperAnnotate AI on the Hugging Face platform, ensuring open access and collaboration in the AI community under the principles of the GNU Affero General Public License.
4
+ 1. Definitions
5
+ "This License" refers to version 1.0 of the SuperAnnotate AI Public License.
6
+ "The Model" refers to the AI model, including scripts, data, documentation, and any associated media provided by SuperAnnotate AI under this License.
7
+ "Modify" means to adapt or change the Model to create a derivative work.
8
+ "You" means any individual or entity exercising permissions granted by this License.
9
+ 2. Source Code
10
+ The source code includes all the contents that SuperAnnotate AI provides to modify the Model, including trained models, training scripts, and relevant data sets.
11
+ You must make all source code of the Model, and of any modifications, available to any user interacting with the Model remotely through a network.
12
+ 3. Commercial Use
13
+ Commercial use of the Model is permitted under this License. You may charge a fee for the physical act of transferring a copy, and you may also offer support or warranty protection for a fee.
14
+ If you use the Model commercially, you must disclose the source and make the entire source code available to your users, either commercially or freely under the terms of this License.
15
+ 4. Copyleft
16
+ All modified versions of the Model, and any derivative works thereof, must be licensed under this License or a compatible open source license that includes the same conditions, particularly regarding the availability of source code.
17
+ 5. Network Use is Distribution
18
+ If you make the Model or any modified version available to interact with users over a network, you must provide all users with access to the source code of the Model and any modifications, under the terms of this License.
19
+ 6. Attribution
20
+ You must give appropriate credit to SuperAnnotate AI, provide a link to this license, and indicate if changes were made. Such notice may not be removed or altered from any source distribution.
21
+ 7. Disclaimer of Warranty and Limitation of Liability
22
+ There is no warranty for the Model, to the extent permitted by applicable law. Except when otherwise stated in writing, the copyright holders and/or other parties provide the Model "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the Model is with you. Should the Model prove defective, you assume the cost of all necessary servicing, repair, or correction.
23
+ In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who modifies and/or conveys the Model as permitted above, be liable to you for damages, including any general, special, incidental, or consequential damages arising out of the use or inability to use the Model (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the Model to operate with any other programs), even if such holder or other party has been advised of the possibility of such damages.
24
+ End of Terms and Conditions
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  pipeline_tag: text-classification
@@ -11,7 +11,6 @@ tags:
11
  datasets:
12
  - Hello-SimpleAI/HC3
13
  - tum-nlp/IDMGSP
14
- - mlabonne/Evol-Instruct-Python-26k
15
  library_name: transformers
16
  ---
17
 
@@ -28,7 +27,8 @@ Fine-Tuned RoBERTa Large<br/>
28
  ## Description
29
 
30
  The model designed to detect generated/synthetic text. \
31
- At the moment, such functionality is critical for check your training data and detecting fraud and cheating in scientific and educational areas.
 
32
 
33
  ## Model Details
34
 
@@ -43,18 +43,20 @@ At the moment, such functionality is critical for check your training data and d
43
 
44
  - **Repository:** [GitHub](https://github.com/superannotateai/generated_text_detector) for HTTP service
45
 
46
- ### Training data
47
 
48
- The training data was sourced from three open datasets with different proportions and underwent filtering:
49
 
50
- 1. [**HC3**](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | **50%**
51
- 1. [**IDMGSP**](https://huggingface.co/datasets/tum-nlp/IDMGSP) | **30%**
52
- 1. [**Evol-Instruct-Python-26k**](https://huggingface.co/datasets/mlabonne/Evol-Instruct-Python-26k) | **20%**
53
 
54
- As a result, the training dataset contained approximately ***25k*** pairs of text-label with an approximate balance of classes. \
55
  It's worth noting that the dataset's texts follow a logical structure: \
56
  Human-written and model-generated texts refer to a single prompt/instruction, though the prompts themselves were not used during training.
57
 
 
 
 
58
  ### Peculiarity
59
 
60
  During training, one of the priorities was not only maximizing the quality of predictions but also avoiding overfitting and obtaining an adequately confident predictor. \
@@ -64,7 +66,51 @@ We are pleased to achieve the following state of model calibration:
64
 
65
  ## Usage
66
 
67
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## Performance
70
 
@@ -73,10 +119,10 @@ However, there are no direct intersections of samples between the training data
73
  The benchmark comprises 1k samples, with 200 samples per category. \
74
  The model's performance is compared with open-source solutions and popular API detectors in the table below:
75
 
76
- | Model/API | Wikipedia | Reddit QA | SA instruction | Papers | Code | Average |
77
- |--------------------------------------------------------------------------------------------------|----------:|----------:|---------------:|-------:|-------:|--------:|
78
- | [Hello-SimpleAI](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta) | **0.97**| 0.95 | 0.82 | 0.69 | 0.47 | 0.78 |
79
- | [RADAR](https://huggingface.co/spaces/TrustSafeAI/RADAR-AI-Text-Detector) | 0.47 | 0.84 | 0.59 | 0.82 | 0.65 | 0.68 |
80
- | [GPTZero](https://gptzero.me) | 0.72 | 0.79 | **0.90**| 0.67 | 0.74 | 0.76 |
81
- | [Originality.ai](https://originality.ai) | 0.91 | **0.97**| 0.77 |**0.93**| 0.46 | 0.81 |
82
- | [LLM content detector](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector) | 0.88 | 0.95 | 0.84 | 0.81 |**0.96**| **0.89**|
 
1
  ---
2
+ license: other
3
  language:
4
  - en
5
  pipeline_tag: text-classification
 
11
  datasets:
12
  - Hello-SimpleAI/HC3
13
  - tum-nlp/IDMGSP
 
14
  library_name: transformers
15
  ---
16
 
 
27
  ## Description
28
 
29
  The model designed to detect generated/synthetic text. \
30
+ At the moment, such functionality is critical for determining the author of the text. It's critical for your training data, detecting fraud and cheating in scientific and educational areas. \
31
+ Couple of articles about this problem: [*Problems with Synthetic Data*](https://www.aitude.com/problems-with-synthetic-data/) | [*Risk of LLMs in Education*](https://publish.illinois.edu/teaching-learninghub-byjen/risk-of-llms-in-education/)
32
 
33
  ## Model Details
34
 
 
43
 
44
  - **Repository:** [GitHub](https://github.com/superannotateai/generated_text_detector) for HTTP service
45
 
46
+ ### Training Data
47
 
48
+ The training data was sourced from two open datasets with different proportions and underwent filtering:
49
 
50
+ 1. [**HC3**](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | **63%**
51
+ 1. [**IDMGSP**](https://huggingface.co/datasets/tum-nlp/IDMGSP) | **37%**
 
52
 
53
+ As a result, the training dataset contained approximately ***20k*** pairs of text-label with an approximate balance of classes. \
54
  It's worth noting that the dataset's texts follow a logical structure: \
55
  Human-written and model-generated texts refer to a single prompt/instruction, though the prompts themselves were not used during training.
56
 
57
+ > [!NOTE]
58
+ > Furthermore, key n-grams (n ranging from 2 to 5) that exhibited the highest correlation with target labels were identified and subsequently removed from the training data utilizing the chi-squared test.
59
+
60
  ### Peculiarity
61
 
62
  During training, one of the priorities was not only maximizing the quality of predictions but also avoiding overfitting and obtaining an adequately confident predictor. \
 
66
 
67
  ## Usage
68
 
69
+ **Pre-requirements**: install `generated_text_detector` \
70
+ Run following command `pip install generated_text_detector@git+https://github.com/superannotateai/generated_text_detector@releases/tag/v1.0.0`
71
+
72
+ ```python
73
+ from generated_text_detector.utils.model.roberta_classificator import RobertaClassificator
74
+ from transformers import AutoTokenizer
75
+ import torch.nn.functional as F
76
+
77
+
78
+ model = RobertaClassificator.from_pretrained("SuperAnnotate/roberta-large-llm-content-detector")
79
+ tokenizer = AutoTokenizer.from_pretrained("SuperAnnotate/roberta-large-llm-content-detector")
80
+
81
+ text_example = "It's not uncommon for people to develop allergies or intolerances to certain foods as they get older. It's possible that you have always had a sensitivity to lactose (the sugar found in milk and other dairy products), but it only recently became a problem for you. This can happen because our bodies can change over time and become more or less able to tolerate certain things. It's also possible that you have developed an allergy or intolerance to something else that is causing your symptoms, such as a food additive or preservative. In any case, it's important to talk to a doctor if you are experiencing new allergy or intolerance symptoms, so they can help determine the cause and recommend treatment."
82
+
83
+ tokens = tokenizer.encode_plus(
84
+ text_example,
85
+ add_special_tokens=True,
86
+ max_length=512,
87
+ padding='longest',
88
+ truncation=True,
89
+ return_token_type_ids=True,
90
+ return_tensors="pt"
91
+ )
92
+
93
+ _, logits = model(**tokens)
94
+
95
+ proba = F.sigmoid(logits).squeeze(1).item()
96
+
97
+ print(proba)
98
+ ```
99
+
100
+ ## Training Detailes
101
+
102
+ A custom architecture was chosen for its ability to perform binary classification while providing a single model output, as well as for its customizable settings for smoothing integrated into the loss function.
103
+
104
+ **Training Arguments**:
105
+
106
+ - **Base Model**: [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large)
107
+ - **Epochs**: 10
108
+ - **Learning Rate**: 5e-04
109
+ - **Weight Decay**: 0.05
110
+ - **Label Smoothing**: 0.1
111
+ - **Warmup Epochs**: 4
112
+ - **Optimizer**: SGD
113
+ - **Scheduler**: Linear schedule with warmup
114
 
115
  ## Performance
116
 
 
119
  The benchmark comprises 1k samples, with 200 samples per category. \
120
  The model's performance is compared with open-source solutions and popular API detectors in the table below:
121
 
122
+ | Model/API | Wikipedia | Reddit QA | SA instruction | Papers | Average |
123
+ |--------------------------------------------------------------------------------------------------|----------:|----------:|---------------:|-------:|--------:|
124
+ | [Hello-SimpleAI](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta) | **0.97**| 0.95 | 0.82 | 0.69 | 0.86 |
125
+ | [RADAR](https://huggingface.co/spaces/TrustSafeAI/RADAR-AI-Text-Detector) | 0.47 | 0.84 | 0.59 | 0.82 | 0.68 |
126
+ | [GPTZero](https://gptzero.me) | 0.72 | 0.79 | **0.90**| 0.67 | 0.77 |
127
+ | [Originality.ai](https://originality.ai) | 0.91 | **0.97**| 0.77 |**0.93**|**0.89** |
128
+ | [LLM content detector](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector) | 0.88 | 0.95 | 0.84 | 0.81 | 0.87 |