OpenAI GPT Vision API is the latest and (arguably) the most powerful model released on November 7 2023 during OpenAI’s DevDay presentation and it has been the talk of social media merely hours after it became available.
Hire the best developers in Latin America. Get a free quote today!
Contact Us Today!Developers have already created apps that actively recognize what’s happening during a web live stream in real-time.
Or this person that incorporated just about every OpenAI API to analyze Messi highlights video using gpt-4-vision-preview model, create voiceover script based on the video frames then generate audio using OpenAI Text-to-Speech
First install the openai pip module
pip install --upgrade openai #to ensure we are using the latest version
Next, we initialize OpenAI object with our API key and create an instance.
How to obtain your OpenAI API key?
client = OpenAI(api_key="sk_YOUR_OPENAI_KEY")
import os
from openai import OpenAI
from dotenv import load_dotenv
import base64
import mimetypes
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def image_to_base64(image_path):
# Guess the MIME type of the image
mime_type, _ = mimetypes.guess_type(image_path)
if not mime_type or not mime_type.startswith('image'):
raise ValueError("The file type is not recognized as an image")
# Read the image binary data
with open(image_path, 'rb') as image_file:
encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
# Format the result with the appropriate prefix
image_base64 = f"data:{mime_type};base64,{encoded_string}"
return image_base64
base64_string = image_to_base64("image1.jpg")
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the attached image"},
{
"type": "image_url",
"image_url": {
"url": base64_string,
"detail": "low"
}
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
Stepping Through the Code
The image_to_base64
function is defined to convert an image file to a base64-encoded string. It first checks if the file is an image and then reads and encodes the binary data of the image. Use this option to submit local images.
def image_to_base64(image_path):
...
You may also submit image URLs using the below example:
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in these images? Is there any difference between them?",
},
{
"type": "image_url",
"image_url": "https://www.goodfreephotos.com/albums/other-landscapes/rover-and-landscape-scenery.jpg",
}
],
}
],
max_tokens=300,
)
The Completion parameters
You may be familiar with the message object If you have used the OpenAI API in the past. With the release of gpt-4-vision-preview model, OpenAI has introduced a new message type; “image_url”.
The url parameter accepts base 64 encoded images or image URLs.
Adjusting the ‘detail‘ parameter in the GPT-4 Vision API affects the resolution and token budget used to interpret images. Setting ‘detail’ to ‘low’ restricts the model to a 512 x 512 pixel, low-resolution version of the image, using 65 tokens. This mode is quicker and more token-efficient, suitable for scenarios where detailed analysis isn’t critical. Opting for ‘high’ engages the high-resolution mode, which initially presents the model with the low-resolution image, followed by high-resolution 512px segments of the image, with each segment allocated a 130-token budget for a thorough examination.
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the attached image"},
{
"type": "image_url",
"image_url": {
"url": base64_string,
"detail": "low"
}
},
],
}
]
As shown below, OpenAI prices the vision capability based on image resolution. For example a 512×512 image will cost $0.00255 in “high” detail more, or 0.00085 for “low” model.
This may not seem like a lot but cost does add up when processing hundreds or thousands of images. To control your input image sizes, use the below code to resize them:
Ensure you have the Pillow library installed
pip install Pillow
from PIL import Image
import os
def resize_image(image_path, new_width, new_height):
with Image.open(image_path) as img:
# Resize the image
img = img.resize((new_width, new_height), Image.ANTIALIAS)
# Save the resized image
base, ext = os.path.splitext(image_path)
new_image_path = f"{base}_resized{ext}"
img.save(new_image_path)
print(f"Image saved as {new_image_path}")
Ok so GPT-4 Vision API is cool and all – people have used it to seamlessly create soccer highlight commentary and interact with Webcams but let’s put the gpt-4-vision-preview to the test and see how it fairs with real world problems.
To explore various use cases of the new GPT-4 Vision API, we built a small Chrome Extension allowing us to capture screenshots of websites and submit them to OpenAI with a prompt to analyze and ask questions.
The Chrome extension is designed to harness the GPT-4 Vision API works in a streamlined three-step process:
First, it captures a screenshot of the current tab.
Second, the user is prompted to input a specific question or instruction, articulating what they seek to learn or accomplish with regard to the captured image.
Finally, this input, along with the screenshot, is sent as a request to the OpenAI API. The gpt-4-vision-preview models returns its findings or outputs directly within the extension interface, displaying the results for the user to review. This seamless integration offers a powerful tool for real-time image analysis and interaction, all without leaving the browser tab.
The GPT-4 Vision API’s capabilities extend beyond simple image recognition and analysis; they open up a world of possibilities for enhancing and streamlining our interactions with digital content. For this purpose, we created a Chrome Extension that leverages this advanced API stands as a testament to its practical utility in everyday web usage. Here are other use cases:
If interested in the extension or other custom solutions using OpenAI, please email info@nextideated.com
Digital transformation of business operations worldwide is driving demand for technically talented workers. However, organizations…
This post provides readers with a framework for evaluating Next Idea Tech's potential financial impact…
Generative AI promises to rewrite the way software is built and maintained and technology leaders…
A nearshore LatAm Development Centre is a dedicated facility located in Latin America that as…
Building a software development team, regardless of location, presents its own unique challenges. The prospect…
Outsourcing software developers from LatAm can be a real game-changer for fast-growing small and medium-sized…
This website uses cookies.