Resource configuration
Spark on Saagie is using Mesos as orchestrator in V1, and Kubernetes in V2 (Projects and Jobs). In Spark, configuration parameters are different whether you're using Mesos, Kubernetes, Standalone or YARN. You should then be careful when configuring the resources usage of your job in your spark-submit command depending on the Saagie version you're using.
Mesos (V1)
In V1, Saagie is configured to launch Spark on Mesos in coarse-grained mode. By default, Spark application will be launched with :
- Driver memory = 1G
- Driver core = 1
- Executor memory = 4G
- Executor cores = all the available cores on the worker in Mesos coarse-grained mode.
- # of executors = all the resources available in Mesos
As you can see, forgetting to configure the executor resources might result in an overuse of the cluster resources. It is thus a good practice to specify the resources needed for each application in regards to the data volume needed to processed (more details here).
When launching your Spark job, you can pass parameters to Spark with the 2 following syntaxes :
--executor-memory 2G
or
--conf spark.executor.memory=2G
We recommend the second version as the option name matches exactly the one in the Spark documentation https://spark.apache.org/docs/latest/configuration.html#available-properties (the option --total-executor-cores corresponds to --conf spark.cores.max for example)
An example of a correct job submission with Mesos would be :
spark-submit \
--driver-memory 2G \
--class <ClassName of the Spark Application to launch> \
--conf spark.executor.memory=3G \
--conf spark.executor.cores=4 \
--conf spark.cores.max=12 \
{file}
where
spark.executor.memory represents the amount of memory for each Executor
spark.executor.cores represents the number of CPU cores for each Executor
spark.cores.max represents the total number of CPU cores for the whole application
In the example above, the total cluster provisioned would be 3 executors (spark.cores.max/spark.executor.cores) of 4 cores and 3G memory each = 12 CPU / 9G in total.
Other properties are available (see the official documentation) and must be specified with the same syntax as the example above.
Important : your Saagie Job memory must be configured with at least the same amount of memory configured in your commande line (with --driver-memory). Forgetting to do so might result with performance issues or timeout errors.
Kubernetes (V2)
In V2 (any job created within Projects and Jobs), Saagie is configured to launch Spark on Kubernetes. An example of a correct job submission with Kubernetes would be :
spark-submit \
--driver-memory 2G \
--class <ClassName of the Spark Application to launch> \
--conf spark.executor.memory=3G \
--conf spark.executor.cores=4 \
--conf spark.executor.instances=3 \
{file}
where
spark.executor.memory represents the amount of memory for each Executor
spark.executor.cores represents the number of CPU cores for each Executor
spark.executor.instances represents the number of executors for the whole application
In the example above, the total cluster provisioned would be 3 executors of 4 cores and 3G memory each = 12 CPU / 9G in total.
Recommendations
- CPU : It is a known good practice to provision between 2 and 4 cores by executor.
- Memory : A minimum of 4GB / executor should be provisioned ideally
- Driver Memory : unless you are fetching large amount of data from the executors back to the driver, you don't need to change the default configuration as the driver's role is just to orchestrate the different jobs in your Spark application. Don't forget to configure your Saagie job with at least 1Go of memory.
Performance tuning
Here's a list of useful articles that can help you understand performance tuning in Spark, how to detect performance issues and best practices to avoir slowness or bottlenecks in your workflow.
- Spark best practices
- Spark performance tuning tips
- Joins in Spark, Part 1 - Part 2 - Part 3
Comments
0 comments
Article is closed for comments.