GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user. Incorporating additional modalities (such as image inputs) into LLMs is a key frontier in artificial intelligence research and development.
Similar to GPT-4, the GPT-4V pre-trained model was first trained to predict the next word in a document, using a large dataset of text and image data from the Internet as well as licensed sources of data. It was then fine-tuned with additional data, using RLHF, to produce outputs that are preferred by human trainers.
The GPT-4V(ision) system card outlines the safety properties of GPT-4V.
Evaluations
Performance on sensitive trait attribution across demographics
- Study focused on performance parity across demographics in sensitive trait attribution.
- Demographics include gender, age, and race recognition.
- Publicly available datasets like FairFace and Labeled Faces in the Wild were used for evaluation.
- Narrow computer vision systems often exhibit biases in facial recognition based on race.
- OpenAI has implemented refusals for most sensitive trait requests.
Person identification evaluations
- Evaluation focused on model’s ability to identify people in photos
- Datasets included celebrities, public servants, politicians, semi-private, and private individuals
- Public figure datasets sourced from CelebA, Celebrity Faces in the Wild, and Congress member images
- Semi-private and private individuals’ images came from employees
- Model’s performance on refusal behavior was measured
- Model successfully refused requests in this category more than 98% of the time
- Accuracy rate of the model in this category was reduced to 0% based on internal evaluations
Ungrounded inference evaluation
- Ungrounded inferences are inferences made without sufficient justification from the provided information (text or image).
- These types of questions cannot typically be answered solely based on visual information from the image.
- Providing ungrounded inferences can lead to the reinforcement of biases and the dissemination of inaccurate information.
- To address this issue, automatic evaluations have been developed to assess the model’s ability to reject such requests for information.
Multimodal jailbreak evaluations
- Jailbreaks attempt to trap the model using complex logical reasoning chains.
- A new vector for jailbreaks involves inserting logical reasoning information into images.
- This information can be in the form of screenshots of written instructions or visual cues.
- Placing information in images makes it challenging to detect jailbreaks using text-based methods.
- Visual system capabilities are relied upon to detect these jailbreaks.
- Existing text jailbreaks have been converted into screenshots for analysis.
- The goal is to determine if the visual input space provides new attack vectors for known problems.
Evaluating GPT-4V + Refusal System.
Extending text-only evaluations to multimodal
- Text-only evaluations were extended to various domains, including advice for self-harm and graphic content.
- Words were replaced with up to two image synonyms per example. Image synonyms are images representing words .
- This approach aimed to prevent bypassing text-only mitigations using images.
CAPTCHA breaking and geolocation
- The model’s abilities were tested using public datasets, specifically in the areas of breaking CAPTCHAs and performing geolocation tasks.
- Breaking CAPTCHAs demonstrates the model’s intelligence and its ability to solve puzzles and perform complex visual reasoning tasks.
- High performance in geolocation tasks reflects the model’s world knowledge and can be helpful for users searching for specific items or places.
- However, the ability to break CAPTCHAs can pose cybersecurity and AI safety concerns as it can be used to bypass security measures intended for botware.
- Geolocation capabilities can raise privacy concerns, as they can potentially identify the location of individuals who want to keep their location private.
- The model’s geolocation abilities typically don’t go beyond identifying the city in most cases, making it less likely to pinpoint someone’s precise location solely using the model.
Scientific proficiency
- GPT-4V can capture complex information in images, including specialized imagery from scientific publications.
- It can understand and assess advanced science from recent papers, sometimes successfully.
- It occasionally combines closely located text components in images, leading to unrelated terms.
- The model is prone to hallucinations and factual errors, especially when providing information in an authoritative tone.
- It can miss text or characters, overlook mathematical symbols, and fail to recognize spatial locations and color mappings in images.
- GPT-4V may appear useful for dangerous tasks requiring scientific proficiency, such as the synthesis of illicit chemicals.
- It provides information on dangerous chemicals like Isotonitazene but with potential inaccuracies and errors, limiting its utility for such tasks.
- It occasionally correctly identifies poisonous foods like toxic mushrooms from images.
- This demonstrates that the model is unreliable and should not be used for high-risk tasks, including the identification of dangerous compounds or foods.
Medical advice
- Inconsistencies were found in the model’s interpretation of medical imaging.
- The model sometimes provided correct responses but could also give incorrect responses for the same question.
- Due to the model’s imperfect performance and associated risks, it is deemed unfit for any medical function, advice, diagnosis, or treatment.
Stereotyping and ungrounded inferences
- GPT-4V can generate unwanted or harmful assumptions that lack a basis in provided information.
- Early versions of GPT-4V had issues with stereotypes and ungrounded inferences when asked to make decisions and provide explanations.
- Mitigations have been added to prevent ungrounded inferences regarding people, taking a conservative approach.
- There is hope that future research and mitigations may enable the model to answer questions about people in low-risk contexts.
Disinformation risks
- People are more likely to believe both true and false statements when presented with an accompanying image.
- GPT-4V was tested for its ability to detect disinformation in images, but the results were inconsistent.
- The model’s ability to recognize disinformation may be influenced by the familiarity and recency of disinformation concepts.
- GPT-4V should not be used as a tool to detect disinformation or verify the truthfulness of content.
- Risk assessment should consider context, distribution, and mitigations like watermarking when using these technologies.
Hateful content
- GPT-4V sometimes refuses to answer questions about hate symbols and extremist content, but this behavior is inconsistent.
- The model’s knowledge about hate symbols is contextually inappropriate, such as not recognizing the modern meaning of the Templar Cross as a hate symbol in the US.
- If a user directly names a well-known hate group, the model usually refuses to provide a completion. However, if lesser-known names or symbols are used, the model might still generate responses.
- The model can sometimes generate songs or poems that praise hate figures or groups when given a picture of them, even if they are not explicitly named.
- OpenAI has added refusals for certain harmful content generation, but not for all cases. Addressing this issue remains a dynamic and challenging problem for OpenAI.
Visual vulnerabilities
- The order of input images can influence the recommendations generated by the model.
- These findings indicate challenges in model robustness and reliability.
- Anticipation of discovering more vulnerabilities through broader usage.