Using Spark on Saagie means launching Spark jobs on Kubernetes. An example of a correct job submission with Kubernetes would be :
--driver-memory 2G \
--class <ClassName of the Spark Application to launch> \
--conf spark.executor.memory=3G \
--conf spark.executor.cores=4 \
--conf spark.executor.instances=3 \
spark.executor.memory represents the amount of memory for each executor (request and limit)
spark.executor.cores represents the number of CPU cores requested for each executor
spark.kubernetes.executor.limit.cores represents the limit of CPU cores for each executor
spark.executor.instances represents the number of executors for the whole application
In the example above, the total cluster provisioned would be 3 executors of 4 cores and 3G memory each = 12 CPU / 9G in total.
- CPU : It is a known good practice to provision between 2 and 4 cores by executor (depending on your cluster topology : if you only have nodes in your cluster with 4 CPU, Kubernetes will have trouble finding a node totally idle to spawn an executor with 4 cores so you might want to limit it to 2 cores in that case)
- Memory : A minimum of 4GB / executor should be provisioned ideally
- Driver Memory : unless you are fetching large amount of data from the executors back to the driver, you don't need to change the default configuration as the driver's role is just to orchestrate the different jobs in your Spark application.
Here's a list of useful articles that can help you understand performance tuning in Spark, how to detect performance issues and best practices to avoir slowness or bottlenecks in your workflow.