How to install PySpark

2022-04-19

2022-04-22

pyspark

Preparation

installing spark
need python3
if you are first using python, install anaconda

Installing JAVA

Installing file: Java SE 8 Archive Downloads (JDK 8u211 and later)
Need to login Oracle
Run the download file as admin → Click Next button → Changing the path on file (Space between words like Program Files can be problem during installation)
Changing Path
Same changes to folders in the JAVA runtime environment folder (Click ‘Change’ and modify)
Create and save jre folder in the path right after the C dirve

Installing Spark

Installing site: https://spark.apache.org/downloads.html
Download installation file
- After clicking Download Spark: [spark-3.2.0-bin-hadoop3.2.tgz](https://www.apache.org/dyn/closer.lua/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz), you can download it by clicking the HTTP 하단 page like picture below
  - Installation URL: https://www.apache.org/dyn/closer.lua/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz (2022.01)
Download WinRAR Program
- You need to install WinRAR, to unzip .tgz file.
- Installation file: https://www.rarlab.com/download.htm
- Install what fits your computer
Create Spark folder and move files
- Moving files
  - Copy all the file in spark-3.2.0-bin-hadoop3.2 folder
- After that, create spark folder below C drive and move all of them to it.

Modify log4j.properties file

• Open the fileconf - [log4j.properties](http://log4j.properties)

Open the log file as notebook and change INFO → ERROR just like example below.

During the process, all the output values can be removed.

# Set everything to be logged to the console
# log4j.rootCategory=INFO, console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Installing winutils

This time, we need program that makes local computer mistakes Sparks for Hadoop.
- Installing file: https://github.com/cdarlint/winutils
  - Download winutils programs that fit installation version.
- I downloaded version 3.2.0
Create winutils/bin folder on C drive and save the downloaded file.
Ensure this file is authorized to be used so that it can be executed without errors whne running Spark
- This time, open CMD as admin and run the file
- If ChangeFileModeByMask error (3) occurs, create tmp\hive folder below C drive.

1 2	C:\Windows\system32>cd c:\winutils\bin c:\winutils\bin> winutils.exe chmod 777 \tmp\hive

Setting environment variables

Set the system environment variable
- Click the 사용자 변수 - 새로 만들기 button on each user account
Set SPARK_HOME variable
Set JAVA_HOME variable
Set HADOOP_HOME variable
Edit PATH variable. Add the code below.
Add code below
- %SPARK_HOME%\bin
- %JAVA_HOME%\bin

Testing Spark

Open CMD file, set the path as c:\spark folder
- if the logo appears when input ‘spark’, success

Check whether the code below works

>>> rd = sc.textFile("README.md")
>>> rd.count()
109
>>>

PythonPyspark