FailFast, Permissive, and Drop Malformed all behave differently
For the use cases here and below we will use this CSV file:
We can spot that for the two header columns row 4 and 6 have extra separators thus will break our parsing 🚫
Spark will be told to use schema
age STRING, listen_mom STRING which definitely should cause some troubles, let’s see how different modes parse it.
In this mode we expect an exception to be thrown on the first malformed record
And there we go, Spark recommends us to use PERMISSIVE mode!
This mode will output you something though
Based on your business case you can decide if that’ allowed or not at all
Use it when you want to be sure no one gets bad data for you
Use parse mode wisely
- Critical data -> Fail Fast
- Lame analysis mode -> Permissive
- Get Only Valid Data -> Drop Malformed