Merge Script

Feb 27, 2020

Overview

This Jupyter notebook contains a Python script for merging and processing CSV files related to shipping and logistics data. The script reads multiple CSV files from different directories, merges them, and exports the result to a single CSV file.

Dependencies

The script relies on the following Python libraries:

  • pandas
  • glob

File Structure

The notebook processes files of similar structure from their respective directories

Key Functions and Operations

Reading CSV Files

import pandas as pd
import glob

def read_csv_files(directory):
    all_files = glob.glob(f"{directory}/*.csv")
    li = []
    for filename in all_files:
        df = pd.read_csv(filename, delimiter='\n', skiprows=[0,1,2,3,9])
        li.append(df)
    return pd.concat(li, axis=0, ignore_index=True, sort=True)

Merging DataFrames

merged = [f, f1, f2, f3]
result = pd.concat(merged, sort=False)

Exporting Result

result.to_csv('inlandCarrier.csv')

Usage

  1. Ensure all required CSV files are in their respective directories.
  2. Run the notebook cells sequentially.
  3. The script will read all CSV files, merge them, and export the result to 'inlandCarrier.csv'.

Additional Notes

  • The notebook includes operations on a file named 'Documents.csv', which contains JSON data. These operations are incomplete in the current version.
  • Some file paths and naming conventions are hardcoded and may need adjustment for different environments.
  • The notebook contains print statements and DataFrame displays, likely used for debugging and data inspection.

Potential Improvements

  1. Parameterize file paths and directory names for better flexibility.
  2. Add error handling for file reading and merging operations.
  3. Implement logging instead of print statements for better debugging.
  4. Complete the processing of 'Documents.csv' if required for the overall workflow.
  5. Add data validation steps to ensure data integrity after merging.

Conclusion

This notebook provides a foundation for merging multiple CSV files containing shipping and logistics data. It can be expanded and refined to fit into a larger data processing pipeline for analyzing shipping routes, prices, or other logistics-related information.