Chapter 9 : Map Column¶

Chapter Learning Objectives¶

Various data operations on columns containing map.

Chapter Outline¶

1. How to deal with map column?

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

Chapter Outline - Gallery¶

click on	any image

1a. How to create a column of map type?¶

df = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df.show(truncate=False)
print(df.dtypes)

+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

[('data', 'map<string,bigint>')]

1b. How to read individual elements of a map column ?¶

Input: Spark dataframe containing map column

df1 = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df1.show(truncate=False)

+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

Output : Spark dataframe containing map keys as column and its value

df_map = df1.select(df1.data.a.alias("a"), df1.data.b.alias("b"), df1.data.c.alias("c") )
df_map.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df1.toPandas(),df_map.toPandas())

Input                      Output

data
{'a': 1, 'b': 2, 'c': 3}

a	b	c
1	2	3

1c. How to extract the keys from a map column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_keys(col)

Returns an unordered array containing the keys of the map.

Parameters:

col – name of column or expression

‘’’

Input: Spark data frame consisting of a map column

df2 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df2.show(truncate=False)

+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+

Output : Spark data frame consisting of a column of keys

from pyspark.sql.functions import map_keys
df_keys = df2.select(map_keys(df2.data).alias("keys"))
df_keys.show()

+---------+
|     keys|
+---------+
|[a, b, c]|
+---------+

Summary:

print("input                     ",            "output")
display_side_by_side(df2.toPandas(),df_keys.toPandas())

input                      output

data
{'a': 1, 'b': None, 'c': 3}

keys
[a, b, c]

1d. How to extract the values from a map column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_values(col)

Collection function: Returns an unordered array containing the values of the map.

Parameters

col – name of column or expression

‘’’

Input: Spark data frame consisting of a map column

df3 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df3.show(truncate=False)

+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+

Output : Spark data frame consisting of a column of values

from pyspark.sql.functions import map_values
df_values = df3.select(map_values(df3.data).alias("values"))
df_values.show()

+-------+
| values|
+-------+
|[1,, 3]|
+-------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df3.toPandas(),df_values.toPandas())

Input                      Output

data
{'a': 1, 'b': None, 'c': 3}

values
[1, None, 3]

1e. How to convert a map column into an array column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_entries(col)

Collection function: Returns an unordered array of all entries in the given map.

Parameters

col – name of column or expression ‘’’

Input: Spark data frame with map column

df4 = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df4.show(truncate=False)

+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

Output : Spark dataframe containing an array

from pyspark.sql.functions import map_entries
df_array = df4.select(map_entries(df4.data).alias("array"))
df_array.show(truncate=False)

+------------------------+
|array                   |
+------------------------+
|[[a, 1], [b, 2], [c, 3]]|
+------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df4.toPandas(),df_array.toPandas())

Input                      Output

data
{'a': 1, 'b': 2, 'c': 3}

array
[(a, 1), (b, 2), (c, 3)]

1f. How to create a map column from multiple array columns?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values

‘’’

Input: Spark data frame with a column

df5 = spark.createDataFrame([([2, 5], ['a', 'b'])], ['k', 'v'])
df5.show()

+------+------+
|     k|     v|
+------+------+
|[2, 5]|[a, b]|
+------+------+

Output : Spark data frame with a column of array of repeated values

from pyspark.sql.functions import map_from_arrays
df_map1 = df5.select(map_from_arrays(df5.k, df5.v).alias("map"))
df_map1.show()

+----------------+
|             map|
+----------------+
|[2 -> a, 5 -> b]|
+----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df5.toPandas(),df_map1.toPandas())

Input                      Output

k	v
[2, 5]	[a, b]

map
{5: 'b', 2: 'a'}

1g. How to combine multiple map columns into one?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_concat(*cols)

Returns the union of all the given maps.

Parameters

col – name of columns

‘’’

Input: Spark data frame with multiple map columns

df6 = spark.sql("SELECT map(1, 'a', 2, 'b') as map1, map(3, 'c') as map2")
df6.show()

+----------------+--------+
|            map1|    map2|
+----------------+--------+
|[1 -> a, 2 -> b]|[3 -> c]|
+----------------+--------+

Output : Spark data frame with an array column with an element removed

from pyspark.sql.functions import map_concat
df_com = df6.select(map_concat("map1", "map2").alias("combined_map"))
df_com.show(truncate=False)

+------------------------+
|combined_map            |
+------------------------+
|[1 -> a, 2 -> b, 3 -> c]|
+------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df6.toPandas(),df_com.toPandas())

Input                      Output

map1	map2
{1: 'a', 2: 'b'}	{3: 'c'}

combined_map
{1: 'a', 2: 'b', 3: 'c'}

Spark Data Frame Programming for Modern Data Engineering

Chapter 9 : Map Column¶

Chapter Learning Objectives¶

Chapter Outline¶

Chapter Outline - Gallery¶

1a. How to create a column of map type?¶

1b. How to read individual elements of a map column ?¶

1c. How to extract the keys from a map column?¶

1d. How to extract the values from a map column?¶

1e. How to convert a map column into an array column?¶

1f. How to create a map column from multiple array columns?¶

1g. How to combine multiple map columns into one?¶