Post

Apache Spark 02 - Introduction to Dataframe

02 - Introduction to Dataframe

Jupyter Notebook

1
2
3
4
import findspark
findspark.init("/opt/spark") # spark home

from pyspark.sql import SparkSession
Create spark session
  • application name (required)
  • local mod (required)
  • 2 core/thread
1
2
3
4
spark = SparkSession.builder \
.appName("Introduction to Dataframe") \
.master("local[2]") \
.getOrCreate()
Create dataframe
1
2
3
4
df = spark.createDataFrame(
  [("Java", 20000), ("Python", 100000), ("Scala", 3000)], #rows
  ["language","users_count"] # columns
)
Show dataframe
1
2
3
4
5
6
7
8
9
df.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+
Show with number
1
2
3
4
5
6
7
8
df.show(1)

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
+--------+-----------+
only showing top 1 row
Show with turncate
1
2
3
4
5
6
7
8
9
df.show(n=3, truncate=False)

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+
1
2
3
4
5
df.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: long (nullable = true)
Convert pandas dataframe

Spark dataframe is a distributed

Pandas dataframe is not a distributed

Warning!!!! You have to use limit() before using toPandas() oherwise all data would rush to driver.

1
2
df_pd = df.limit(5).toPandas()
df_pd
languageusers_count
0Java20000
1Python100000
2Scala3000
Pandas dataframe length
1
2
3
len(df_pd)

    3
Pandas dataframe type (class)
1
2
3
type(df_pd)

    pandas.core.frame.DataFrame
Spark dataframe type (class)
1
2
3
type(df)

    pyspark.sql.dataframe.DataFrame
Spark session stop
1
spark.stop()
This post is licensed under CC BY 4.0 by the author.