Forecasting with Prophet: A PySpark and Pandas Approach

Sep 8, 2024

Introduction

In this article, we'll walk through the process of loading data from a Spark SQL database, filtering it, and then using Facebook's Prophet library to create a time series forecast. We'll be utilizing PySpark, Pandas, and Prophet to accomplish this task, demonstrating a powerful approach to big data time series analysis.

Understanding Univariate Prediction

Before diving into the code, let's briefly discuss univariate prediction:

  • Univariate prediction focuses on forecasting future values of a single variable based on its past values.
  • It assumes future values depend on past values and the passage of time.
  • Methods like Prophet can capture trends, seasonal patterns, and holiday effects.
  • While simpler than multivariate models, univariate prediction doesn't account for external factors.

Setting Up the Environment

First, let's import the necessary libraries and create a Spark session:

import pandas as pd
from prophet import Prophet
from pyspark.sql import SparkSession
from prophet.plot import plot_plotly, plot_components_plotly

# Create a Spark session
spark = SparkSession.builder.appName("DataFrameFilter").getOrCreate()

Loading and Filtering Data

Now, we'll load data from a Spark SQL database and filter it:

# SQL query to select all data from the specified table
query = """
SELECT * FROM `db_name`.`table_name`
"""

# Execute the query and convert to Pandas DataFrame
df = spark.sql(query)
pandas_df = df.toPandas()

# Filter the DataFrame for a specific filter condition
filtered_df = pandas_df[pandas_df['port_combination'] == 'filter_value']

# Display the first few rows of the filtered DataFrame
print(filtered_df.head())

# Convert the filtered DataFrame back to a Spark DataFrame and create a temporary view
filtered_df_spark = spark.createDataFrame(filtered_df)
filtered_df_spark.createOrReplaceTempView("filtered_table_name")

Preparing Data for Prophet

Prophet requires data in a specific format. Let's prepare our data:

# Select relevant columns and rename them for Prophet
filtered_df1 = filtered_df.iloc[:, [3,5]]
filtered_df1.columns = ['ds', 'y']  

# Display the first few rows of the prepared data
print(filtered_df1.head())

Creating and Fitting the Prophet Model

Now we can create and fit our Prophet model:

# Create and fit the Prophet model
m = Prophet()
m.fit(filtered_df1)

Making Predictions

Let's create a future dataframe and make predictions:

# Create a future dataframe for 120 periods ahead
future = m.make_future_dataframe(periods=120)

# Make predictions
forecast = m.predict(future)

# Display the last few rows of the forecast
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

Visualizing the Results

Finally, let's visualize our results:

# Plot the forecast
fig1 = m.plot(forecast)

# Plot the forecast components
fig2 = m.plot_components(forecast)

# Create an interactive Plotly visualization
plot_plotly(m, forecast)

Conclusion

In this article, we've demonstrated a powerful approach to time series forecasting that combines the big data capabilities of PySpark with the forecasting abilities of Prophet. This method allows for scalable and accurate univariate time series predictions, particularly useful when dealing with large datasets and strong, stable time series.

Key takeaways:

  1. PySpark enables efficient handling of large datasets from SQL databases.
  2. Prophet automates the detection of trends and seasonality in time series data.
  3. Univariate prediction can be powerful but doesn't account for external factors.
  4. Visualizations are crucial for understanding forecast results and components.

Remember, while this approach is effective for many scenarios, always consider whether other variables or external factors might significantly impact your predictions. In such cases, exploring multivariate prediction methods might be beneficial.