Spark 2.x External Packages
The bane of using bleeding edge technology is very less or hidden information of new features in the latest version. We at Unnati use bleeding edge releases of many data science tools for various research and production systems. In this post we explain how to add external
jars to Apache Spark 2.x application.
Starting Spark 2.x, we can use the
--package option to pass additional jars to
spark-submit. Spark will look through the local
ivy2 repository for the jar, if it is missing, it will pull the dependency from the central maven server.
$SPARK_HOME/bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0 <py-file>
In the above example, we are adding
mongodb-spark connector. This works perfectly fine. However, there are scenarios where spark is used as part of the python application. In this case, we will use
SparkContext to specify the configuration.
There is no way to set packages option using
We need to use the
spark-defaults.conf to specify the external jar. Add the following to the file
Now run your pyspark application as usual