AWS Big Data Blog

Using Python 3.4 on EMR Spark Applications

Bruno Faria is a Big Data Support Engineer for Amazon Web Services

Many data scientists choose Python when developing on Spark. With last month’s Amazon EMR release 4.6, we’ve made it even easier to use Python: Python 3.4 is installed on your EMR cluster by default. You’ll still find Python 2.6 and 2.7 on your cluster, but the inclusion of 3.4 means you no longer have to configure custom bootstrap actions to install Python 3 on EMR.

An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:

[
    {
        "Classification": "spark-env",
        "Properties": {},
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "PYSPARK_PYTHON": "python34"
                },
                "Configurations": []
            }
        ]
    }
]

This will set the PYSPARK_PYTHON environment variable in the /etc/spark/conf/spark-env.sh file. After that, Spark Python applications will use Python 3.4 as the default interpreter. The screenshot below shows PySpark using Python 3.4 on an EMR 4.6 cluster:

If you have questions or suggestions, please leave a comment below.

———————————-

Related

Crunching Statistics at Scale with SparkR on Amazon EMR