Chapter 16 : Join operations¶
Chapter Learning Objectives¶
Various join operations on data frame.
Chapter Outline¶
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
from IPython.display import display_html
import pandas as pd
import numpy as np
def display_side_by_side(*args):
html_str=''
for df in args:
html_str+=df.to_html(index=False)
html_str+= "\xa0\xa0\xa0"*10
display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10
import panel as pn
css = """
div.special_table + table, th, td {
border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])
Chapter Outline - Gallery¶
click on |
any image |
---|---|
Lets first understand the syntax
Syntax
pyspark.sql.DataFrame.join(other, on=None, how=None)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
‘’’
At the top level there are mainly 3 types of joins:
INNER OUTER CROSS
INNER JOIN - fetches data if present in both the tables.
OUTER JOIN are of 3 types:
LEFT OUTER JOIN - fetches data if present in the left table. RIGHT OUTER JOIN - fetches data if present in the right table. FULL OUTER JOIN - fetches data if present in either of the two tables.
CROSS JOIN, as the name suggests, does [n X m] that joins everything to everything. Similar to scenario where we simply lists the tables for joining (in the FROM clause of the SELECT statement), using commas to separate them.
1a. Inner Join¶
df_left = spark.createDataFrame([(1001,1,100),(1002,2,200),(1003,3,300),
(1004,1,200),(1005,6,200)
],
["order_id","customer_id","amount"])
df_left.show(truncate=False)
df_right = spark.createDataFrame([(1,"john"), (2,"mike"),(3,"tony"),(4,"kent")],
["customer_id","name"])
df_right.show()
+--------+-----------+------+
|order_id|customer_id|amount|
+--------+-----------+------+
|1001 |1 |100 |
|1002 |2 |200 |
|1003 |3 |300 |
|1004 |1 |200 |
|1005 |6 |200 |
+--------+-----------+------+
+-----------+----+
|customer_id|name|
+-----------+----+
| 1|john|
| 2|mike|
| 3|tony|
| 4|kent|
+-----------+----+
Inner Join
The inner join is the default join in Spark SQL. It selects rows that have matching values in both relations.
df_left.join(df_right,on="customer_id",how="inner").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
+-----------+--------+------+----+
1b. Left Join¶
Left Join
A left join returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also referred to as a left outer join. LEFT JOIN and LEFT OUTER JOIN are equivalent.
LEFT JOIN - fetches data if present in the left table and only matching records from the right table.
df_left.join(df_right,on="customer_id",how="left").toPandas()#show()
customer_id | order_id | amount | name | |
---|---|---|---|---|
0 | 6 | 1005 | 200 | None |
1 | 1 | 1001 | 100 | john |
2 | 1 | 1004 | 200 | john |
3 | 3 | 1003 | 300 | tony |
4 | 2 | 1002 | 200 | mike |
df_left.join(df_right,on="customer_id",how="left_outer").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 6| 1005| 200|null|
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
+-----------+--------+------+----+
df_left.join(df_right,on="customer_id",how="leftouter").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 6| 1005| 200|null|
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
+-----------+--------+------+----+
1c. Right Join¶
Right Join
RIGHT JOIN - fetches data if present in the right table even if there is no matching records in the right table.
df_left.join(df_right,on="customer_id",how="right").toPandas()#show()
customer_id | order_id | amount | name | |
---|---|---|---|---|
0 | 1 | 1001.0 | 100.0 | john |
1 | 1 | 1004.0 | 200.0 | john |
2 | 3 | 1003.0 | 300.0 | tony |
3 | 2 | 1002.0 | 200.0 | mike |
4 | 4 | NaN | NaN | kent |
df_left.join(df_right,on="customer_id",how="right_outer").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
| 4| null| null|kent|
+-----------+--------+------+----+
df_left.join(df_right,on="customer_id",how="rightouter").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
| 4| null| null|kent|
+-----------+--------+------+----+
1d. Full Join¶
FULL JOIN - fetches data if present in either of the two tables.
df_left.join(df_right,on="customer_id",how="full").toPandas()#show()
customer_id | order_id | amount | name | |
---|---|---|---|---|
0 | 6 | 1005.0 | 200.0 | None |
1 | 1 | 1001.0 | 100.0 | john |
2 | 1 | 1004.0 | 200.0 | john |
3 | 3 | 1003.0 | 300.0 | tony |
4 | 2 | 1002.0 | 200.0 | mike |
5 | 4 | NaN | NaN | kent |
df_left.join(df_right,on="customer_id",how="fullouter").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 6| 1005| 200|null|
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
| 4| null| null|kent|
+-----------+--------+------+----+
df_left.join(df_right,on="customer_id",how="full_outer").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 6| 1005| 200|null|
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
| 4| null| null|kent|
+-----------+--------+------+----+
df_left.join(df_right,on="customer_id",how="outer").show()
+-----------+--------+------+----+
|customer_id|order_id|amount|name|
+-----------+--------+------+----+
| 6| 1005| 200|null|
| 1| 1001| 100|john|
| 1| 1004| 200|john|
| 3| 1003| 300|tony|
| 2| 1002| 200|mike|
| 4| null| null|kent|
+-----------+--------+------+----+
1e. Cross Join¶
cross join
spark.conf.get("spark.sql.crossJoin.enabled")
'true'
spark.conf.set("spark.sql.crossJoin.enabled", "true")
df_left.crossJoin(df_right).toPandas()#show()
order_id | customer_id | amount | customer_id | name | |
---|---|---|---|---|---|
0 | 1001 | 1 | 100 | 1 | john |
1 | 1001 | 1 | 100 | 2 | mike |
2 | 1001 | 1 | 100 | 3 | tony |
3 | 1001 | 1 | 100 | 4 | kent |
4 | 1002 | 2 | 200 | 1 | john |
5 | 1002 | 2 | 200 | 2 | mike |
6 | 1002 | 2 | 200 | 3 | tony |
7 | 1002 | 2 | 200 | 4 | kent |
8 | 1003 | 3 | 300 | 1 | john |
9 | 1003 | 3 | 300 | 2 | mike |
10 | 1003 | 3 | 300 | 3 | tony |
11 | 1003 | 3 | 300 | 4 | kent |
12 | 1004 | 1 | 200 | 1 | john |
13 | 1004 | 1 | 200 | 2 | mike |
14 | 1004 | 1 | 200 | 3 | tony |
15 | 1004 | 1 | 200 | 4 | kent |
16 | 1005 | 6 | 200 | 1 | john |
17 | 1005 | 6 | 200 | 2 | mike |
18 | 1005 | 6 | 200 | 3 | tony |
19 | 1005 | 6 | 200 | 4 | kent |
df_left.join(df_right,on="customer_id",how="semi").show()
#semi, leftsemi, left_semi, anti, leftanti and left_anti.
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 1| 1001| 100|
| 1| 1004| 200|
| 3| 1003| 300|
| 2| 1002| 200|
+-----------+--------+------+
df_left.join(df_right,on="customer_id",how="leftsemi").show()
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 1| 1001| 100|
| 1| 1004| 200|
| 3| 1003| 300|
| 2| 1002| 200|
+-----------+--------+------+
df_left.join(df_right,on="customer_id",how="left_semi").show()
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 1| 1001| 100|
| 1| 1004| 200|
| 3| 1003| 300|
| 2| 1002| 200|
+-----------+--------+------+
1f. Anti Join¶
df_left.join(df_right,on="customer_id",how="anti").show()
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 6| 1005| 200|
+-----------+--------+------+
df_left.join(df_right,on="customer_id",how="leftanti").show()
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 6| 1005| 200|
+-----------+--------+------+
df_left.join(df_right,on="customer_id",how="left_anti").show()
+-----------+--------+------+
|customer_id|order_id|amount|
+-----------+--------+------+
| 6| 1005| 200|
+-----------+--------+------+