Apache Spark 02 - Introduction to Dataframe

Posted Apr 23, 2024

Raven's Nest overlook at sunset, Acadia National Park, Schoodic Peninsula, Maine, USA

By Yunus Emre Karakas

1 min read

02 - Introduction to Dataframe

Jupyter Notebook

  
import findspark
findspark.init("/opt/spark") # spark home

from pyspark.sql import SparkSession

Create spark session

application name (required)
local mod (required)
2 core/thread

  
spark = SparkSession.builder \
.appName("Introduction to Dataframe") \
.master("local[2]") \
.getOrCreate()

Create dataframe

  
df = spark.createDataFrame(
  [("Java", 20000), ("Python", 100000), ("Scala", 3000)], #rows
  ["language","users_count"] # columns
)

Show dataframe

  
df.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

Show with number

  
df.show(1)

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
+--------+-----------+
only showing top 1 row

Show with turncate

  
df.show(n=3, truncate=False)

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

Print schema

  
df.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: long (nullable = true)

Convert pandas dataframe

Spark dataframe is a distributed

Pandas dataframe is not a distributed

Warning!!!! You have to use limit() before using toPandas() oherwise all data would rush to driver.

  
df_pd = df.limit(5).toPandas()
df_pd

	language	users_count
0	Java	20000
1	Python	100000
2	Scala	3000

Pandas dataframe length

  
len(df_pd)

    3

Pandas dataframe type (class)

  
type(df_pd)

    pandas.core.frame.DataFrame

Spark dataframe type (class)

  
type(df)

    pyspark.sql.dataframe.DataFrame

Spark session stop

  
spark.stop()

Blogging, Data Engineering

apache spark

This post is licensed under CC BY 4.0 by the author.

02 - Introduction to Dataframe

Create spark session

Create dataframe

Show dataframe

Show with number

Show with turncate

Print schema

Convert pandas dataframe

Pandas dataframe length

Pandas dataframe type (class)

Spark dataframe type (class)

Spark session stop

Trending Tags