How To Use OpenAI GPT Turbo Vision API With Python

gpt-4 vision api

OpenAI GPT Vision API is the latest and (arguably) the most powerful model released on November 7 2023 during OpenAI’s DevDay presentation and it has been the talk of social media merely hours after it became available.

Developers have already created apps that actively recognize what’s happening during a web live stream in real-time.

Or this person that incorporated just about every OpenAI API to analyze Messi highlights video using gpt-4-vision-preview model, create voiceover script based on the video frames then generate audio using OpenAI Text-to-Speech

Using GPT Vision API With Python

First install the openai pip module

pip install --upgrade openai #to ensure we are using the latest version

Next, we initialize OpenAI object with our API key and create an instance.

How to obtain your OpenAI API key?

Go to API keys – OpenAI API
Click “Create new secret key”
Provide a name and be sure to copy your new key.

client = OpenAI(api_key="sk_YOUR_OPENAI_KEY")

Making your first call with gpt-4-vision-preview Model

import os
from openai import OpenAI
from dotenv import load_dotenv
import base64
import mimetypes
load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def image_to_base64(image_path):
    # Guess the MIME type of the image
    mime_type, _ = mimetypes.guess_type(image_path)
    
    if not mime_type or not mime_type.startswith('image'):
        raise ValueError("The file type is not recognized as an image")
    
    # Read the image binary data
    with open(image_path, 'rb') as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
    
    # Format the result with the appropriate prefix
    image_base64 = f"data:{mime_type};base64,{encoded_string}"
    
    return image_base64

base64_string = image_to_base64("image1.jpg")


response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the attached image"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_string,
                        "detail": "low"
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

Stepping Through the Code

The image_to_base64 function is defined to convert an image file to a base64-encoded string. It first checks if the file is an image and then reads and encodes the binary data of the image. Use this option to submit local images.

def image_to_base64(image_path):
    ...

You may also submit image URLs using the below example:

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [    
                {
                    "type": "text",
                    "text": "What’s in these images? Is there any difference between them?",
                },
                {
                    "type": "image_url",
                    "image_url": "https://www.goodfreephotos.com/albums/other-landscapes/rover-and-landscape-scenery.jpg",
                }
            ],
        }
    ],
    max_tokens=300,
)

The Completion parameters

You may be familiar with the message object If you have used the OpenAI API in the past. With the release of gpt-4-vision-preview model, OpenAI has introduced a new message type; “image_url”.

The url parameter accepts base 64 encoded images or image URLs.

Adjusting the ‘detail‘ parameter in the GPT-4 Vision API affects the resolution and token budget used to interpret images. Setting ‘detail’ to ‘low’ restricts the model to a 512 x 512 pixel, low-resolution version of the image, using 65 tokens. This mode is quicker and more token-efficient, suitable for scenarios where detailed analysis isn’t critical. Opting for ‘high’ engages the high-resolution mode, which initially presents the model with the low-resolution image, followed by high-resolution 512px segments of the image, with each segment allocated a 130-token budget for a thorough examination.

messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the attached image"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_string,
                        "detail": "low"
                    }
                },
            ],
        }
    ]

GPT Vision API Pricing

As shown below, OpenAI prices the vision capability based on image resolution. For example a 512×512 image will cost $0.00255 in “high” detail more, or 0.00085 for “low” model.

GPT-4 Vision API Pricing

This may not seem like a lot but cost does add up when processing hundreds or thousands of images. To control your input image sizes, use the below code to resize them:

Ensure you have the Pillow library installed

pip install Pillow

from PIL import Image
import os

def resize_image(image_path, new_width, new_height):
    with Image.open(image_path) as img:
        # Resize the image
        img = img.resize((new_width, new_height), Image.ANTIALIAS)

        # Save the resized image
        base, ext = os.path.splitext(image_path)
        new_image_path = f"{base}_resized{ext}"
        img.save(new_image_path)

        print(f"Image saved as {new_image_path}")

Real World Use of GPT-4 Vision API: Enhancing Web Experience with a Chrome Extension

Ok so GPT-4 Vision API is cool and all – people have used it to seamlessly create soccer highlight commentary and interact with Webcams but let’s put the gpt-4-vision-preview to the test and see how it fairs with real world problems.

Browser Extension – GPT Vision Assistant

To explore various use cases of the new GPT-4 Vision API, we built a small Chrome Extension allowing us to capture screenshots of websites and submit them to OpenAI with a prompt to analyze and ask questions.

The Chrome extension is designed to harness the GPT-4 Vision API works in a streamlined three-step process:

First, it captures a screenshot of the current tab.

Second, the user is prompted to input a specific question or instruction, articulating what they seek to learn or accomplish with regard to the captured image.

Finally, this input, along with the screenshot, is sent as a request to the OpenAI API. The gpt-4-vision-preview models returns its findings or outputs directly within the extension interface, displaying the results for the user to review. This seamless integration offers a powerful tool for real-time image analysis and interaction, all without leaving the browser tab.

Other Use Cases With GPT-4 Vision API

The GPT-4 Vision API’s capabilities extend beyond simple image recognition and analysis; they open up a world of possibilities for enhancing and streamlining our interactions with digital content. For this purpose, we created a Chrome Extension that leverages this advanced API stands as a testament to its practical utility in everyday web usage. Here are other use cases:

Quality Assurance for Websites: Website developers and QA testers can use the extension to take screenshots of web pages and submit them directly to the GPT-4 Vision API with prompts like “Identify any visual inconsistencies across these screenshots” or “Does this layout comply with web accessibility standards?” The API can analyze the images for color contrast, element alignment, responsive design issues, and more, providing quick feedback that would traditionally require meticulous manual review.
UX/UI Design Feedback: Designers can capture snapshots of their work and ask the model questions such as “What improvements can be made to this user interface to enhance user experience?” or “What are the best practices missing from this design?” This not only speeds up the iterative process of design but also injects an objective, data-driven perspective into creative workflows.
Content Management and Moderation: For content managers and online moderators, the extension could be used to screen website content. By taking screenshots of various posts, images, or comments and querying the GPT-4 Vision API, the system could assist in identifying inappropriate content or copyright issues, streamlining the moderation process.
Educational Content Interaction: Students and educators could use the extension to capture diagrams, equations, or charts from educational websites and ask the model to explain or solve them. This interactive approach could enhance online learning by providing instant assistance and explanations for complex visual information.

GPT-4 Vision API answering a math question

Competitor Analysis: Marketing professionals might employ the extension to capture the layout of competitors’ websites, asking the model to analyze and compare branding consistency, messaging clarity, and call-to-action placements. This competitive intelligence can be invaluable for strategic planning.
Accessibility Auditing: The extension could also be used to assist with ensuring web accessibility. Users could take snapshots of websites and ask the model to check if the images contain proper alt text or if the color schemes used are suitable for color-blind individuals.
Automated Documentation: IT professionals and developers could use the extension to document the behavior of web applications. By taking screenshots and prompting the API to describe the process flow or detect anomalies, they could generate documentation or troubleshooting guides more efficiently.
Code Generation from Designs: A revolutionary use case would be for front-end developers to send UX designs to the OpenAI API and ask it to generate the necessary HTML/CSS/JavaScript code. This could potentially reduce development time by providing a starting point for building out web interfaces.

Hire Nearshore developers with Next Idea Tech

If interested in the extension or other custom solutions using OpenAI, please email info@nextideated.com

Zak Elmeghni