Spark ParserModes comparison

FailFast, Permissive, and Drop Malformed all behave differently

Hard choice!

For the use cases here and below we will use this CSV file:

We can spot that for the two header columns row 4 and 6 have extra separators thus will break our parsing 🚫

Spark will be told to use schema age STRING, listen_mom STRING which definitely should cause some troubles, let’s see how different modes parse it.

In this mode we expect an exception to be thrown on the first malformed record

Yes, Spark and Kotlin — friends forever

And there we go, Spark recommends us to use PERMISSIVE mode!

That’s called `die trying!`

This mode will output you something though

Based on your business case you can decide if that’ allowed or not at all

This mode had dropped some portion of a file that didn’t match the schema

Use it when you want to be sure no one gets bad data for you

Spark says — Spark does! it had dropped the values which didn’t comply with the schema

Use parse mode wisely

  • Critical data -> Fail Fast
  • Lame analysis mode -> Permissive
  • Get Only Valid Data -> Drop Malformed