nguyenbh commited on
Commit
c36d6a2
1 Parent(s): 8f3232c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -247,6 +247,20 @@ Developers should apply responsible AI best practices, including mapping, measur
247
  * Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
248
  * Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  ## Software
251
  * [PyTorch](https://github.com/pytorch/pytorch)
252
  * [Transformers](https://github.com/huggingface/transformers)
 
247
  * Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
248
  * Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
249
 
250
+ ## Safety Evaluation and Red-Teaming
251
+
252
+ We leveraged various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets to
253
+ evaluate Phi-3.5 models' propensity to produce undesirable outputs across multiple languages and risk categories.
254
+ Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety
255
+ post-training that was done as detailed in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833) had a positive impact across multiple languages and risk categories as observed by
256
+ refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Note, however, while comprehensive red team evaluations were conducted
257
+ across all models in the prior release of Phi models, red teaming was largely focused on Phi-3.5 MOE across multiple languages and risk categories for this release as
258
+ it is the largest and more capable model of the three models. Details on prior red team evaluations across Phi models can be found in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833).
259
+ For this release, insights from red teaming indicate that the models may refuse to generate undesirable outputs in English, even when the request for undesirable output
260
+ is in another language. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings
261
+ highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages,
262
+ and risk areas that account for cultural nuances where those languages are spoken.
263
+
264
  ## Software
265
  * [PyTorch](https://github.com/pytorch/pytorch)
266
  * [Transformers](https://github.com/huggingface/transformers)