Introduction
This post demonstrates how to use LlamaParse to parse Excel files and structure the extracted content using Python. We'll cover the process of setting up LlamaParse, parsing an Excel file, and analyzing the structure of the parsed data.
Setup
First, let's install the necessary libraries:
pip install llama-index
pip install llama-parse
Next, we'll import the required modules and set up our environment:
import nest_asyncio
from llama_parse import LlamaParse
nest_asyncio.apply()
api_key = "#token#" # get from cloud.llamaindex.ai
Parsing Excel Files with LlamaParse
Let's create a LlamaParse instance and use it to parse an Excel file:
parser = LlamaParse(
api_key=api_key,
result_type="markdown"
)
docs = parser.load_data("/Users/Apple/Downloads/Excel.xlsx")
This will parse the Excel file and return the content in markdown format.
Analyzing the Parsed Content
Let's examine the structure of the parsed content:
print(type(docs[0]))
Output:
llama_index.core.schema.Document
We can see that each parsed document is an instance of the Document
class from llama_index.
Accessing Parsed Content
You can access the parsed content using the get_content()
method:
content = docs[0].get_content()
print(content[:1000]) # Print first 1000 characters
Output:
# Rate
|CAN : |FAK | | | | |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|-----------|----------------------------------------------|---------|---------------------------|
|Validity: |01.09.2024 - 30.09.2024 | | | | |
| | | | | | |
|NIGIRO |DESTINATION |T/S mode |20'DC |40'DC,HC |CUR |
Structuring the Content
The parsed content is in a markdown table format. We can further structure this data by splitting it into rows and columns:
# Split the content into lines
lines = content.split('\n')
# Find the header row
header_row = next(i for i, line in enumerate(lines) if line.startswith('|ORIGIN'))
# Extract headers
headers = [header.strip() for header in lines[header_row].split('|') if header.strip()]
# Extract data rows
data_rows = [
[cell.strip() for cell in line.split('|') if cell.strip()]
for line in lines[header_row+1:] if line.startswith('|') and not all(cell.strip() == '' for cell in line.split('|'))
]
# Create a list of dictionaries
structured_data = [dict(zip(headers, row)) for row in data_rows]
# Print the first few structured data entries
import json
print(json.dumps(structured_data[:3], indent=2))
Output:
[
{
"ORIGIN": "Suhraa",
"DESTINATION": "Turieb *)",
"T/S mode": "feeder only",
"20'DC": "1100",
"40'DC,HC": "1250",
"CUR": "REU"
},
...
]
Analyzing the Structured Data
Now that we have structured the data, we can perform various analyses:
- Count the number of unique origins:
unique_origins = set(row['ORIGIN'] for row in structured_data)
print(f"Number of unique origins: {len(unique_origins)}")
- Find the average rate for 20'DC containers:
rates_20dc = [float(row['20\'DC']) for row in structured_data if row['20\'DC'].isdigit()]
avg_rate_20dc = sum(rates_20dc) / len(rates_20dc)
print(f"Average rate for 20'DC containers: {avg_rate_20dc:.2f}")
- List all destinations for a specific origin:
origin = "Suhraa"
destinations = [row['DESTINATION'] for row in structured_data if row['ORIGIN'] == origin]
print(f"Destinations from {origin}: {', '.join(destinations)}")
Conclusion
This guide demonstrates how to use LlamaParse to extract content from Excel files and structure the parsed data for further analysis. By converting the markdown table format into a list of dictionaries, we can easily perform various data analysis tasks on the structured data.
Remember to handle your API keys securely and not expose them in public repositories or shared notebooks. Also, consider the structure of your Excel files when parsing and adjust the structuring code accordingly.