AI Safety Prize

Winners

Four final submissions received prizes. Below are summaries of the winning submissions.

Synthetic Cancer - Augmenting Worms with LLMs

David Zollikofer and Benjamin Zimmerman (CHF 7'500)

Abstract

With the rise of LLMs, a completely new threat landscape has emerged, ranging from LLMs used for fraudulent purposes (Erzberger, 2023) to initial ideas of integrating LLMs in malware (Labs, 2023), and even first prototypes for LLM-based worms (Bil, 2023).
We propose a novel type of metamorphic malware that utilizes LLMs in two key areas:
(I) code rewriting, and (II) targeted spreading combined with social engineering.

Disclosure The findings presented in this paper are intended solely for scientific and
research purposes. We are fully aware that this paper presents a malware type with great potential for abuse. We are publishing this in good faith and in an effort to raise awareness. We strictly prohibit any non-scientific use of these findings.

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando and Florian Tramèr (CHF 1'500)

Abstract

Reinforcement Learning from Human Feedback (RLHF) is used to align large language
models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a “jailbreak backdoor” into the model. The backdoor embeds a trigger word into the model that acts like a universal sudo command: adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Vulnerability of Face Recognition Systems to Template Inversion Attacks: From Simulation to Practical Evaluation

Hatef Otroshi Shahreza and Sébastien Marcel (CHF 1'500)

Abstract

Face recognition systems are increasingly being used in different applications. In
such systems, some features (also known as embeddings or templates) are extracted
from each face image. Then, the extracted templates are stored in the system’s
database during the enrollment stage and are later used for recognition. In this
work, we focus on template inversion attacks against face recognition systems,
where an adversary gains access to the templates stored in the database of the
system and tries to reconstruct underlying face from facial templates. We introduce
a novel method (dubbed GaFaR) to reconstruct 3D face from facial templates using
a geometry-aware generator network based on generative neural radiance fields
(GNeRF). We learn a mapping from facial templates to the intermediate latent
space of a pretrained generator network with a semi-supervised learning approach,
using real and synthetic images simultaneously, within a Generative Adversarial
Network (GAN) based framework. In addition, during the inference stage, we
also propose optimization on the camera parameters to generate face images to
improve the success attack rate. We evaluate the performance of our method in
the whitebox and blackbox attacks against state-of-the-art face recognition models
on different datasets. Moreover, we perform practical presentation attacks on real
face recognition systems using the digital screen replay and printed photographs,
and evaluate the vulnerability of face recognition systems to template inversion

attacks. As a matter of fact, the reconstructed face images jeopardize both secu-
rity and privacy of users: the adversary can use the reconstructed face image to

impersonate and enter the system (security threat). In addition, the reconstructed
face image not only reveal privacy-sensitive information of the enrolled user, such
as age, gender, ethnicity, etc, but also provide a good estimation of subject’s
face (privacy threat). Our experimental results demonstrate a critical vulnerability
in face recognition systems, and encourage the scientific community to develop
and propose the next generation of safe and protected face recognition systems.
The project page is available at: https://www.idiap.ch/paper/gafar

Adversarial Attacks on GPT-4 via Simple Random Search

Maksym Andriushchenko (CHF 500)

Abstract

In a recent announcement from December 15th, OpenAI made the predicted probabilities of their models available via API. In this short paper, we use them to implement an adversarial attack on the latest GPT-4 Turbo (gpt-4-1106-preview) model based on simple random search. We append a short adversarial string to a harmful request that is by default rejected by the model with high probability due to safety,
ethical or legal concerns. This is sufficient to “jailbreak” the model and make it answer the harmful or undesirable request. We show examples of corresponding conversations without and with adversarial suffixes. Interestingly, iterative optimization via simple random search is highly effective: we can iteratively increase the probability of a desired starting token from ≈ 1% to above 50%. We note that this is a very simple approach that only scratches the surface of what is possible with much more advanced optimization methods and, especially, full white-box access to the model. Finally, we discuss implications and potential defenses against such attacks. Our code notebook is available at https://github.com/max-andr/adversarial-random-search-gpt4.

AI Safety Prize

Winners

Synthetic Cancer - Augmenting Worms with LLMs

David Zollikofer and Benjamin Zimmerman (CHF 7'500)

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando and Florian Tramèr (CHF 1'500)

Vulnerability of Face Recognition Systems to Template Inversion Attacks: From Simulation to Practical Evaluation

Hatef Otroshi Shahreza and Sébastien Marcel (CHF 1'500)

Adversarial Attacks on GPT-4 via Simple Random Search

Maksym Andriushchenko (CHF 500)

Partners