Read and write Apache Parquet files using MuleSoft

UPDATE: 10/24/2022 – The Connector has been updated to support InputStream so there’s no need read from a local file store. You can read or write directly from a stream to an endpoint like AWS S3.

Apache Parquet is a file format designed to support fast data processing for complex data. Unlike row-based formats like CSV, Parquet is column-oriented – meaning the values of each table column are stored next to each other rather than those of each record. Included in the Parquet file is metadata that includes the schema and structure of each file making it a self-describing format.

Using Parquet format has two advantages

  • Reduced storage
  • Query performance

Out-of-the-box, MuleSoft doesn’t currently support reading and writing Parquet files today but the flexibility of the platform allows developers to easily extended it to support Parquet. Using the Mule SDK and the Apache Parquet libraries, I created a community connector which you can find here.

https://github.com/djuang1/parquet

Check out this video that walks you through an example Mule application that leverages the Parquet connector.


Posted

in

by

Comments