The evaluation employs datasets from the WMDP benchmark for biology and cybersecurity. The threat model assumes white-box access to an unlearned model, allowing weight modification and activation ...
Some results have been hidden because they may be inaccessible to you