CSET:2025年先进人工智能系统恶意利用可能性的评估方法研究(英文版).pdf |
下载文档 |
资源简介
The simplest way to test whether there is a risk of system X being used for malicious behavior Y is to see if X can do Y, just once. Red-teamers and stress testers adopt an adversary’s mindset and probe an AI system for “identification of harmful capabilities, outputs, or infrastructure threats.”6 If a model does not produce harmful behavior on the first try, the next step is to iterate. Researchers use different techniques, including improving prompts (the input fed to the model, such a
已阅读到文档的结尾了