Fixing small files performance issues in Apache Spark, using DataFlint

Lecturer: Meni Shmueli 10.4.2024

One of the big challenges in big data is interacting with the storage layer, especially in the data lake where we are the one who manages the files and partitions.

One of the most common performance problems in data lakes is working with small files.

In this lecture we will learn about:
* Why it’s important to read and write files in best-practice size
* How Apache Spark under the hood interact with files, and how it relates to Spark Tasks
* How we can easily detect and fix small files problem (by using the open source library DataFlint)
* How to handle small files problems when using storage formats such as delta lake & iceberg

Lecturer: Meni Shmueli- founder and author of DataFlint (https://github.com/dataflint/spark).
Ex-81 unit, Ex-Ziprecruiter and Ex-Granulate.
Passionate about everything related to Big Data, and about working with data teams to solve their day-to-day challenges.
Over the years helped dozens of companies improve performance, debug issues and improve dev velocity in the big data world, and is currently trying to solve performance observability in big data with DataFlint.

Video

Slides


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty:

Leave a Reply

Discover more from Big Data Demystified

Subscribe now to keep reading and get access to the full archive.

Continue reading