AI Safety Prize

Submission Guidelines

Vulnerabilities in large language models

LEARN MORE

Vulnerabilities in classical and specialised machine learning models

LEARN MORE

Language models

Vulnerabilities in large language models

Large language models are trained to adhere to safety and ethics guidelines (see also the background section below). Nonetheless, it has been demonstrated that they can produce undesired behaviour in violation of said guidelines. Submissions in this category reveal novel vulnerabilities in large language models which lead to undesired behaviour.

Instructions

1. Suggested reading: We suggest you read the GPT-4 system card (especially Section 2). You may also want to browse our FAQs and additional resources.

2. Brainstorm ideas for how to find vulnerabilities in a large language model, and how you might test those ideas.

3. Write down your best idea. This should include up to a 500-word summary, written in English, with:

a) Your idea for finding a vulnerability in a large language model.

b) Documentation that demonstrates the idea in action (e.g. how GPT-4 reacts to the vulnerability)

c) A suggestion of how the model provider addresses the vulnerability.

d) A description of the limitations of your idea, assumptions it relies on, and ways it could fail.

In addition to your summary, you may submit a PDF with a longer text with up to 5’000 words, and any further material you deem necessary (prompts, code, graphics etc.).

4. Submit your proposals by clicking on the button below. The three best final submissions will be awarded as follows: up to CHF 10’000 for the winner, up to CHF 2’000 each for second and third place.

SUBMIT

Background

Software vulnerabilities are weaknesses or flaws in computer programs that can be exploited by malicious actors to compromise the integrity, confidentiality, or availability of a system. Just as software can have vulnerabilities that hackers exploit, large language models possess vulnerabilities due to the complexity of language comprehension and generation. These vulnerabilities arise from factors such as biased training data, limited context understanding, susceptibility to adversarial attacks, and challenges in verifying the accuracy of information generated. Similar to how software vulnerabilities can lead to security breaches, the vulnerabilities in large language models can result in biased, misleading, or harmful outputs, emphasizing the need for ongoing research and safeguards to enhance their reliability and ethical use.

Before releasing GPT-4, OpenAI subjected the model to safety testing, which included harmful content that they tried to train the model to avoid. This includes:

Advice or encouragement for self harm behaviours
Graphic material such as erotic or violent content
Harassing, demeaning, and hateful content
Content useful for planning attacks or violence
Instructions for finding illegal content

The above list can be taken as examples of what vulnerabilities could be.

Alternatively, vulnerabilities could relate to evidence of the following misbehaviours:

Specification gaming: When AI models find shortcuts or loopholes to achieve their goals, often leading to unintended consequences.
Deception: When AI models purposefully mislead users by providing false information or making misleading claims.
Sandbagging: When AI models are more likely to support widely-held misconceptions if they perceive their user to be less knowledgeable.
Sycophancy: When AI models answer subjective questions in a way that aligns with and flatters the user’s stated beliefs, regardless of their accuracy.
Bias: When AI models unintentionally favour certain groups or ideas over others, leading to unfair outcomes.

Submission Criteria

Our judging panel will judge submissions based on the following criteria:

01

Novelty, severity and complexity of the vulnerability in a large language model. Novelty and complexity of the method used to elicit the vulnerability. We are most interested in the newer models per company or provider (e.g. GPT-4 or GPT-3.5, Claude, Llama-2). See the background section above for examples of vulnerabilities.

Novelty: Whether the judges have seen similar vulnerabilities, or methods used to elicit vulnerabilities, before.
Severity: How costly the consequences of this vulnerability would be if left unfixed.
Complexity: How technically advanced the judges deem the method to be.

02

Degree of dissemination of the model, globally and in Switzerland.

Potential deployment of the model in critical infrastructure.

03

Optional: suggestions on how to address the vulnerability. (No demonstration needed).

An example of the kind of finding that might win a prize could be a prompt to GPT-4 which elicits an explanation for how to assemble an explosive device. See our FAQ (below) for more information.

We require English as the submission language. You are welcome to reach our to us at info@safetyprize.ch if you have any further questions.

Legal recourse is excluded. Pour Demain is not responsible for any damage or liability in the context of the prize. This includes all stages of the prize, including work on identifying submissions. Pour Demain does not condone any activity that goes against terms of services or legal sources. See our official rules for more information.

Submissions are dealt with according to standard data privacy laws in Switzerland. See our legal disclaimer for more information. See also this factsheet for ethical hackers, provided by the Federal Data Protection and Information Commissioner.

Prizes and honourable mentions

Rough Draft Prize and honourable mentions:

The winner of the prize for best Rough Draft submission is Daniel Paleka, with his entry “Adversarial Decoding – Preliminary Exploration”. (CHF 800)

Three further Rough Draft submissions receive honourable mentions for their impressive ideas:

Fan Shi for his work on the control problem on robots.
David Zollikofer, Paul Loisel and Benjamin Zimmermann for their idea on augmenting worms with LLMs.
Hatef Otroshi Shahreza and Sébastien Marcel for their work on vulnerabilities of face recognition systems.

Main prizes: See the winners of the Final Prize here.

Prizes

Additional Resources

Machine learing

Vulnerabilities in classical and specialised machine learning models

Various classical and specialised machine learning models are currently deployed in Swiss and European critical infrastructure. This range of models also suffers from various vulnerabilities. Submissions in this category reveal novel vulnerabilities in a classical or specialised ML model, with a particular focus on those used in critical infrastructure such as healthcare (e.g. an image classifier in radiology), energy (e.g. an LSTM network for anomaly detection in power plants) and government (e.g. a decision tree for fraud detection).

Instructions

1. Suggested reading: We suggest you read this paper on adversarial examples. You may also want to browse our FAQs and additional resources.

2. Brainstorm ideas for how to find vulnerabilities in a classical or specialised ML model, and how you might test those ideas.

3. Write down your best idea. This should include up to a 500-word summary, written in English, with:

a) Your idea for finding a vulnerability in a classical or specialised ML model.

b) Documentation that demonstrates the idea in action (e.g. how the model reacts to the vulnerability)

c) A suggestion of how to address the vulnerability.

d) A description of the limitations of your idea, assumptions it relies on, and ways it could fail.

In addition to your summary, you may submit a PDF with a longer text with up to 5’000 words, and any further material you deem necessary (prompts, code, graphics etc.).

4. Submit your proposals by clicking on the button below. The three best final submissions will be awarded as follows: up to CHF 10’000 for the winner, up to CHF 2’000 each for second and third place.

SUBMIT

Background

Software vulnerabilities are weaknesses or flaws in computer programs that can be exploited by malicious actors to compromise the integrity, confidentiality, or availability of a system. Just as software can have vulnerabilities that hackers exploit, classical and specialised machine learning models come with their own vulnerabilities to consider. These include factors such as adversarial attacks, data poisoning or lack of robustness amongst others.

Similar to how software vulnerabilities can lead to security breaches, the vulnerabilities in classical and specialised ML models can result in biased, misleading, or harmful outputs, emphasizing the need for ongoing research and safeguards to enhance their reliability and ethical use.

To provide further clarity regarding the type of submission we are looking for, we provide a few examples:

An algorithm used in smart grids that might malfunction due to a lack of robustness against variations in input data.
An image recognition system that provides incorrect predictions or classifications due to an adversarial attack involving the introduction of carefully crafted perturbations to input data.

Clarification on the terms “classical ML” and specialised ML”:

For the scope of this AI Safety Prize, by “classical ML” we are referring to traditional machine learning algorithms and techniques that predate the widespread use of deep learning methods. This category includes algorithms like linear regression, decision trees, random forests, support vector machines, k-nearest neighbors, gradient boosting machines (such as XGBoost and LightGBM), and various clustering algorithms, amongst others.
For the scope of this AI Safety Prize, by “specialised ML” we are referring to ML systems which encompass a diverse range of models that serve distinct purposes and functions, differentiating them from the broad capabilities of large language models. These models are designed to excel in specific domains or tasks, leveraging focused architectures and tailored training methodologies. Some examples include:
- Image recognition: convolutional neural networks for object detection and classification in images and videos.
- Natural Language Processing (NLP) tasks: specialized models for sentiment analysis, named entity recognition, machine translation, etc.

Submission Criteria

Our judging panel will judge submissions based on the following criteria:

01

Novelty, severity and complexity of the vulnerability in a classical or specialised ML model. Novelty and complexity of the method used to elicit the vulnerability. See the background section above for examples of vulnerabilities.

Novelty: Whether the judges have seen similar vulnerabilities, or methods used to elicit vulnerabilities, before.
Severity: How costly the consequences of this vulnerability would be if left unfixed.
Complexity: How technically advanced the judges deem the method to be.

02

Degree of dissemination of the model, globally and in Switzerland.

Potential deployment of the model in critical infrastructure.

03

Optional: suggestions on how to address the vulnerability. (No demonstration needed).

See our FAQ (below) for more information. We require English as the submission language. You are welcome to reach our to us at info@safetyprize.ch if you have any further questions.

Legal recourse is excluded. Pour Demain is not responsible for any damage or liability in the context of the prize. This includes all stages of the prize, including work on identifying submissions. Pour Demain does not condone any activity that goes against terms of services or legal sources. See our official rules for more information.

Submissions are dealt with according to standard data privacy laws in Switzerland. See our legal disclaimer for more information. See also this factsheet for ethical hackers, provided by the Federal Data Protection and Information Commissioner.

Prizes and honourable mentions

Rough Draft Prize and honourable mentions:

The winner of the prize for best Rough Draft submission is Daniel Paleka, with his entry “Adversarial Decoding – Preliminary Exploration”. (CHF 800)

Three further Rough Draft submissions receive honourable mentions for their impressive ideas:

Fan Shi for his work on the control problem on robots.
David Zollikofer, Paul Loisel and Benjamin Zimmermann for their idea on augmenting worms with LLMs.
Hatef Otroshi Shahreza and Sébastien Marcel for their work on vulnerabilities of face recognition systems.

Main prizes: See the winners of the Final Prize here.

Additional Resources

FAQ

While companies are working tirelessly to develop ever more powerful AI models and applications, leading AI experts from research and business are surprised and increasingly concerned by the speed of these developments and warn of the potential negative impact on society.