OpenAI PDF Extraction and Processing

Mar 28, 2024

This post demonstrates how to use the OpenAI API to extract and process data from PDF files. We'll cover setting up the environment, reading PDF files, and using OpenAI's GPT models to analyze the content.

Setup

First, let's install the necessary libraries:

pip install --upgrade openai

Next, import the required modules:

import openai
from openai import OpenAI
import PyPDF2
import os

Setting up OpenAI Client

Set up your OpenAI API credentials:

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

Make sure to replace "OPENAI_API_KEY" with your actual API key or set it as an environment variable.

Defining the Prompt

Create a prompt that instructs the AI on how to process the PDF content:

deployment_id = "#ID#"
prompt = """
Read the below content and Generate Table
"freight"
columns origin_icd, origin_port, via_port, destination_port, destination_icd, service_type, cargo_type, commodity, transit_time, origin_free_time_type, origin_free_time, destination_free_time_type, destination_free_time, inclusions, contract_number, 20GP, 40GP, 40HC, currency, remarks, start_date, and expiry_date.
 "charges"
columns charge_description, charges_leg, 20GP, 40GP, 40HC, currency, origin_port and destination_port.
Follow the below instructions:
...
"""

Reading PDF Content

Here's a function to read PDF content (commented out in the original code):

def read_pdf(file_path):
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfFileReader(file)
        content = ""
        for page_num in range(reader.numPages):
            page = reader.getPage(page_num)
            content += page.extractText()
    return content

# file_path = "/path/to/your/pdf/file.pdf"
# file_content = read_pdf(file_path)

Processing with OpenAI

Send the PDF content to OpenAI for processing:

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt},
        {"role": "user", "content": file_content},
    ]
)

generated_text = response.choices[0].message.content
print("ChatGPT:", generated_text)

Follow-up Processing

You can perform additional processing or verification:

response1 = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Was the last output only from the content shared? If not, update it properly and reshare the output"}
    ]
)

generated_text1 = response1.choices[0].message.content
print("ChatGPT1:", generated_text1)

Creating an OpenAI Assistant

You can also create an OpenAI assistant for more specialized tasks:

assistant = client.beta.assistants.create(
  name="MSC PDF 1",
  instructions="#Instructions#",
  tools=[{"type": "code_interpreter"}],
  model="gpt-4o",
)

Conclusion

This guide demonstrates how to use OpenAI's API to process and extract data from PDF files. Remember to handle your API keys securely and not expose them in public repositories or shared notebooks. Also, consider the structure of your PDF files when crafting prompts and adjust the processing code accordingly.