Apache Spark 02 - Introduction to Dataframe
02 - Introduction to Dataframe
Jupyter Notebook
1
2
3
4
import findspark
findspark.init("/opt/spark") # spark home
from pyspark.sql import SparkSession
Create spark session
- application name (required)
- local mod (required)
- 2 core/thread
1
2
3
4
spark = SparkSession.builder \
.appName("Introduction to Dataframe") \
.master("local[2]") \
.getOrCreate()
Create dataframe
1
2
3
4
df = spark.createDataFrame(
[("Java", 20000), ("Python", 100000), ("Scala", 3000)], #rows
["language","users_count"] # columns
)
Show dataframe
1
2
3
4
5
6
7
8
9
df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Show with number
1
2
3
4
5
6
7
8
df.show(1)
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
+--------+-----------+
only showing top 1 row
Show with turncate
1
2
3
4
5
6
7
8
9
df.show(n=3, truncate=False)
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Print schema
1
2
3
4
5
df.printSchema()
root
|-- language: string (nullable = true)
|-- users_count: long (nullable = true)
Convert pandas dataframe
Spark dataframe is a distributed
Pandas dataframe is not a distributed
Warning!!!! You have to use limit() before using toPandas() oherwise all data would rush to driver.
1
2
df_pd = df.limit(5).toPandas()
df_pd
language | users_count | |
---|---|---|
0 | Java | 20000 |
1 | Python | 100000 |
2 | Scala | 3000 |
Pandas dataframe length
1
2
3
len(df_pd)
3
Pandas dataframe type (class)
1
2
3
type(df_pd)
pandas.core.frame.DataFrame
Spark dataframe type (class)
1
2
3
type(df)
pyspark.sql.dataframe.DataFrame
Spark session stop
1
spark.stop()
This post is licensed under CC BY 4.0 by the author.