Chapter 8 : Array Columns¶

Chapter Learning Objectives¶

Various data operations on columns containing date strings, date and timestamps.

Chapter Outline¶

1. How to deal with Array columns?

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

Chapter Outline - Gallery¶

click on	any image

1a. How to create a array column from multiple columns?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array(*cols)

Creates a new array column.

Parameters

cols – list of column names (string) or list of Column expressions that have the same data type.

‘’’

Input: Spark data frame with multiple columns

df_mul = spark.createDataFrame([('John', 'Seattle', 60, True, 1.7, '1960-01-01'), 
('Tony', 'Cupertino', 30, False, 1.8, '1990-01-01'), 
('Mike', 'New York', 40, True, 1.65, '1980-01-01')],['name', 'city', 'age', 'smoker','height', 'birthdate'])
df_mul.show()

+----+---------+---+------+------+----------+
|name|     city|age|smoker|height| birthdate|
+----+---------+---+------+------+----------+
|John|  Seattle| 60|  true|   1.7|1960-01-01|
|Tony|Cupertino| 30| false|   1.8|1990-01-01|
|Mike| New York| 40|  true|  1.65|1980-01-01|
+----+---------+---+------+------+----------+

Output : Spark data frame with a array column

from pyspark.sql.functions import array
df_array = df_mul.select(array(df_mul.age,df_mul.height,df_mul.city).alias("array_column"))
df_array.show()

+--------------------+
|        array_column|
+--------------------+
|  [60, 1.7, Seattle]|
|[30, 1.8, Cupertino]|
|[40, 1.65, New York]|
+--------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_mul.toPandas(),df_array.toPandas())

Input                      Output

name	city	age	smoker	height	birthdate
John	Seattle	60	True	1.70	1960-01-01
Tony	Cupertino	30	False	1.80	1990-01-01
Mike	New York	40	True	1.65	1980-01-01

array_column
[60, 1.7, Seattle]
[30, 1.8, Cupertino]
[40, 1.65, New York]

1b. How to remove duplicate values from an array column?¶

Lets first understand the syntax

Syntax

ppyspark.sql.functions.array_distinct(col)

removes duplicate values from the array

Parameters:

col – name of column or expression ‘’’

Input: Spark data frame with a array column with duplicates

df_array = spark.createDataFrame([([1, 2, 3, 2, 4],), ([4, 5, 5, 4, 6],)], ['data'])
df_array.show()

+---------------+
|           data|
+---------------+
|[1, 2, 3, 2, 4]|
|[4, 5, 5, 4, 6]|
+---------------+

Output : Spark data frame with a array column with no duplicates

from pyspark.sql.functions import array_distinct
df_array_no = df_array.select(array_distinct(df_array.data).alias("array_no_dup"))
df_array_no.show()

+------------+
|array_no_dup|
+------------+
|[1, 2, 3, 4]|
|   [4, 5, 6]|
+------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_array.toPandas(),df_array_no.toPandas())

Input                      Output

data
[1, 2, 3, 2, 4]
[4, 5, 5, 4, 6]

array_no_dup
[1, 2, 3, 4]
[4, 5, 6]

1c. How to check if a value is in an array column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_contains(col, value)

returns null if the array is null, true if the array contains the given value, and false otherwise.

Parameters:

col – name of column containing array
value – value or column to check for in array

‘’’

Input: Spark data frame with a array column

df1 = spark.createDataFrame([([1, 2, 3],), ([],),([None, None],)], ['data'])
df1.show()

+---------+
|     data|
+---------+
|[1, 2, 3]|
|       []|
|      [,]|
+---------+

Output : Spark data frame with a column to indicate if a value exists

from pyspark.sql.functions import array_contains
df2 = df1.select(array_contains(df1.data, 1).alias("if_1_exists"))
df2.show()

+-----------+
|if_1_exists|
+-----------+
|       true|
|      false|
|       null|
+-----------+

Summary:

print("input                     ",            "output")
display_side_by_side(df1.toPandas(),df2.toPandas())

input                      output

data
[1, 2, 3]
[]
[None, None]

if_1_exists
True
False
None

1d. How to find the list of elements in column A, but not in column B without duplicates?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_except(col1, col2)

returns an array of the elements in col1 but not in col2, without duplicates.

Parameters

col1 – name of column containing array
col2 – name of column containing array ‘’’

Input: Spark data frame with 2 array columns

df3 = spark.createDataFrame([([1, 2, 3, 4, 5],[6, 7, 8, 9, 10]), ([4, 5, 5, 4, 6],[6, 2, 3, 2, 4])], ['A', 'B'])
df3.show()

+---------------+----------------+
|              A|               B|
+---------------+----------------+
|[1, 2, 3, 4, 5]|[6, 7, 8, 9, 10]|
|[4, 5, 5, 4, 6]| [6, 2, 3, 2, 4]|
+---------------+----------------+

Output : Spark data frame with a result array column

from pyspark.sql.functions import array_except
df4 = df3.select(array_except(df3.A, df3.B).alias("in_A_not_in_B"))
df4.show()

+---------------+
|  in_A_not_in_B|
+---------------+
|[1, 2, 3, 4, 5]|
|            [5]|
+---------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df3.toPandas(),df4.toPandas())

Input                      Output

A	B
[1, 2, 3, 4, 5]	[6, 7, 8, 9, 10]
[4, 5, 5, 4, 6]	[6, 2, 3, 2, 4]

in_A_not_in_B
[1, 2, 3, 4, 5]
[5]

1e.How to sort the column array in ascending order?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_sort(col)

sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

Parameters

col – name of column or expression ‘’’

Input: Spark data frame with an array column

df_arr = spark.createDataFrame([([2, 1, None, 3, 8, 3, 5],),([1],),([],)], ['data'])
df_arr.show()

+-------------------+
|               data|
+-------------------+
|[2, 1,, 3, 8, 3, 5]|
|                [1]|
|                 []|
+-------------------+

Output : Spark data frame with a sorted array column

from pyspark.sql.functions import array_sort
df_sort =df_arr.select(array_sort(df_arr.data).alias('sort'))
df_sort.show()

+-------------------+
|               sort|
+-------------------+
|[1, 2, 3, 3, 5, 8,]|
|                [1]|
|                 []|
+-------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arr.toPandas(),df_sort.toPandas())

Input                      Output

data
[2, 1, None, 3, 8, 3, 5]
[1]
[]

sort
[1, 2, 3, 3, 5, 8, None]
[1]
[]

1f. How to create an array from a column value repeated many times?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_repeat(col, count)

Collection function: creates an array containing a column repeated count times.

‘’’

Input: Spark data frame with a column

df_val = spark.createDataFrame([(5,)], ['data'])
df_val.show()

+----+
|data|
+----+
|   5|
+----+

Output : Spark data frame with a column of array of repeated values

from pyspark.sql.functions import array_repeat
df_repeat = df_val.select(array_repeat(df_val.data, 3).alias('repeat'))
df_repeat.show()

+---------+
|   repeat|
+---------+
|[5, 5, 5]|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_val.toPandas(),df_repeat.toPandas())

Input                      Output

data
5

repeat
[5, 5, 5]

1g. How to remove all elements equal to an element from the given array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_remove(col, element)

Remove all elements that equal to element from the given array.

Parameters

col – name of column containing array
element – element to be removed from the array

‘’’

Input: Spark data frame with an array column

df_arr2 = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df_arr2.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with an array column with an element removed

from pyspark.sql.functions import array_remove
df_arr3 = df_arr2.select(array_remove(df_arr2.data, 4).alias("array_remove_4"))
df_arr3.show()

+--------------+
|array_remove_4|
+--------------+
|  [1, 2, 3, 8]|
|[5, 32, 32, 6]|
+--------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arr2.toPandas(),df_arr3.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

array_remove_4
[1, 2, 3, 8]
[5, 32, 32, 6]

1h . How to locate the position of first occurrence of the given value in the given array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_position(col, value)

Collection function: Locates the position of the first occurrence of the given value in the given array. Returns null if either of the arguments are null.

‘’’

Input: Spark data frame with an array column

df_pos1 = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df_pos1.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with column giving the position of the element

from pyspark.sql.functions import array_position
df_pos2 = df_pos1.select(array_position(df_pos1.data, 4).alias("array_position_4"))
df_pos2.show()

+----------------+
|array_position_4|
+----------------+
|               5|
|               1|
+----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_pos1.toPandas(),df_pos2.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

array_position_4
5
1

1i. How to find the minimum value of an array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_min(col)

returns the minimum value of the array.

Parameters

col – name of column or expression ‘’’

Input: Spark data frame with an array columns

df_arr = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df_arr.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with a column

from pyspark.sql.functions import array_min
df_min = df_arr.select(array_min(df_arr.data).alias("array_min"))
df_min.show()

+---------+
|array_min|
+---------+
|        1|
|        4|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arr.toPandas(),df_min.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

array_min
1
4

1j. How to find the maximum value of an array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_max(col)

returns the maximum value of the array.

Parameters

col – name of column or expression ‘’’

Input: Spark data frame with an array column

df = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with a column

from pyspark.sql.functions import array_max
df_max = df.select(array_max(df.data).alias("array_max"))
df_max.show()

+---------+
|array_max|
+---------+
|        8|
|       32|
+---------+

Summary:

print("input                     ",            "output")
display_side_by_side(df.toPandas(),df_max.toPandas())

input                      output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

array_max
8
32

1k. How to convert a column of nested arrays into a map column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values

‘’’

Input: Spark data frame with a map column

df = spark.sql("SELECT array(struct(1, 'a'), struct(2, 'b')) as data")
df.show()

+----------------+
|            data|
+----------------+
|[[1, a], [2, b]]|
+----------------+

Output : Spark data frame with a date column

from pyspark.sql.functions import map_from_entries
df_map = df.select(map_from_entries("data").alias("map"))
df_map.show()

+----------------+
|             map|
+----------------+
|[1 -> a, 2 -> b]|
+----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_map.toPandas())

Input                      Output

data
[(1, a), (2, b)]

map
{1: 'a', 2: 'b'}

1l. How to sort an array in a column in ascending or descending order?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.sort_array(col, asc=True)

sorts the input array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.

Parameters

col – name of column or expression ‘’’

Input: Spark data frame with an array column

df = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with a sorted array column

from pyspark.sql.functions import sort_array
df_asc = df.select(sort_array(df.data, asc=True).alias('asc'))
df_asc.show()

+-----------------+
|              asc|
+-----------------+
|  [1, 2, 3, 4, 8]|
|[4, 5, 6, 32, 32]|
+-----------------+

from pyspark.sql.functions import sort_array
df_desc = df.select(sort_array(df.data, asc=False).alias('desc'))
df_desc.show()

+-----------------+
|             desc|
+-----------------+
|  [8, 4, 3, 2, 1]|
|[32, 32, 6, 5, 4]|
+-----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_asc.toPandas(), df_desc.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

asc
[1, 2, 3, 4, 8]
[4, 5, 6, 32, 32]

desc
[8, 4, 3, 2, 1]
[32, 32, 6, 5, 4]

1m. How to slice an array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values

‘’’

Input: Spark data frame with an array column

df = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with an array column

from pyspark.sql.functions import slice
df.select(slice(df.data, 2, 3).alias('slice')).show()

+-----------+
|      slice|
+-----------+
|  [2, 3, 8]|
|[5, 32, 32]|
+-----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_map.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

map
{1: 'a', 2: 'b'}

1n. How to shuffle a column containing an array?¶

Syntax

pyspark.sql.functions.shuffle(col)

Generates a random permutation of the given array.

‘’’

Input: Spark data frame with an array column

df = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with shuffled array column

from pyspark.sql.functions import shuffle
df_shu = df.select(shuffle(df.data).alias('shuffle'))
df_shu.show()

+-----------------+
|          shuffle|
+-----------------+
|  [4, 8, 3, 2, 1]|
|[32, 6, 4, 32, 5]|
+-----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_shu.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

shuffle
[4, 8, 3, 2, 1]
[32, 6, 4, 32, 5]

1o. How to create a array column containing elements with sequence(start, stop, step)?¶

Syntax

pyspark.sql.functions.sequence(start, stop, step=None)

Generate a sequence of integers from start to stop, incrementing by step. If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1.

‘’’

Input: Spark data frame

df = spark.createDataFrame([(-2, 2)], ('A', 'B'))
df.show()

+---+---+
|  A|  B|
+---+---+
| -2|  2|
+---+---+

Output : Spark data frame with an array sequence

from pyspark.sql.functions import sequence
df_seq = df.select(sequence('A', 'B').alias('seq'))
df_seq.show()

+-----------------+
|              seq|
+-----------------+
|[-2, -1, 0, 1, 2]|
+-----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_seq.toPandas())

Input                      Output

A	B
-2	2

seq
[-2, -1, 0, 1, 2]

1p. How to reverse the order(not reverse sort) of an array in a column ?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values

‘’’

Input: Spark data frame with an array column

df_arr = spark.createDataFrame([([1, 2, 3, 8, 4],), ([4, 5, 32, 32, 6],)], ['data'])
df_arr.show()

+-----------------+
|             data|
+-----------------+
|  [1, 2, 3, 8, 4]|
|[4, 5, 32, 32, 6]|
+-----------------+

Output : Spark data frame with a reverse ordered array column

from pyspark.sql.functions import reverse
df_rev = df_arr.select(reverse(df_arr.data).alias('reverse_order'))
df_rev.show(truncate=False)

+-----------------+
|reverse_order    |
+-----------------+
|[4, 8, 3, 2, 1]  |
|[6, 32, 32, 5, 4]|
+-----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arr.toPandas(),df_rev.toPandas())

Input                      Output

data
[1, 2, 3, 8, 4]
[4, 5, 32, 32, 6]

reverse_order
[4, 8, 3, 2, 1]
[6, 32, 32, 5, 4]

1q. How to combine two array columns into a map column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values

‘’’

Input: Spark data frame with 2 array columns

df_arrm = spark.createDataFrame([([1, 2, 3, 4, 5],[6, 7, 8, 9, 10]), ([4, 5, 6, 7, 8],[6, 2, 3, 9, 4])], ['A','B'])
df_arrm.show()

+---------------+----------------+
|              A|               B|
+---------------+----------------+
|[1, 2, 3, 4, 5]|[6, 7, 8, 9, 10]|
|[4, 5, 6, 7, 8]| [6, 2, 3, 9, 4]|
+---------------+----------------+

Output : Spark data frame with a map column

from pyspark.sql.functions import map_from_arrays
df_map = df_arrm.select(map_from_arrays(df_arrm.A, df_arrm.B).alias('map'))
df_map.show(truncate=False)

+-----------------------------------------+
|map                                      |
+-----------------------------------------+
|[1 -> 6, 2 -> 7, 3 -> 8, 4 -> 9, 5 -> 10]|
|[4 -> 6, 5 -> 2, 6 -> 3, 7 -> 9, 8 -> 4] |
+-----------------------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arrm.toPandas(),df_map.toPandas())

Input                      Output

A	B
[1, 2, 3, 4, 5]	[6, 7, 8, 9, 10]
[4, 5, 6, 7, 8]	[6, 2, 3, 9, 4]

map
{1: 6, 2: 7, 3: 8, 4: 9, 5: 10}
{8: 4, 4: 6, 5: 2, 6: 3, 7: 9}

1r. How to concatenate the elements of an array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.concat(*cols)

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

‘’’

Input: Spark data frame with a map column

df_arr1 = spark.createDataFrame([([1, 2, 3, 4, 5],[6, 7, 8, 9, 10]), ([4, 5, 6, 7, 8],[6, 2, 3, 9, 4])], ['A','B'])
df_arr1.show()

+---------------+----------------+
|              A|               B|
+---------------+----------------+
|[1, 2, 3, 4, 5]|[6, 7, 8, 9, 10]|
|[4, 5, 6, 7, 8]| [6, 2, 3, 9, 4]|
+---------------+----------------+

Output : Spark data frame with a date column

from pyspark.sql.functions import concat
df_con = df_arr1.select(concat(df_arr1.A, df_arr1.B).alias("concatenate"))
df_con.show(2,False)

+-------------------------------+
|concatenate                    |
+-------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|[4, 5, 6, 7, 8, 6, 2, 3, 9, 4] |
+-------------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_arr1.toPandas(),df_con.toPandas())

Input                      Output

A	B
[1, 2, 3, 4, 5]	[6, 7, 8, 9, 10]
[4, 5, 6, 7, 8]	[6, 2, 3, 9, 4]

concatenate
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[4, 5, 6, 7, 8, 6, 2, 3, 9, 4]

1s. How to find overlap between 2 array columns?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.arrays_overlap(a1, a2)

Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise

‘’’

Input: Spark data frame with array columns

df_over = spark.createDataFrame([(["a", "b"], ["b", "c"],), (["a"], ["b", "c"],),(["a", None], ["b", None],) ], ['A', 'B'])
df_over.show()

+------+------+
|     A|     B|
+------+------+
|[a, b]|[b, c]|
|   [a]|[b, c]|
|  [a,]|  [b,]|
+------+------+

Output : Spark data frame

from pyspark.sql.functions import arrays_overlap
df_overlap = df_over.select(arrays_overlap(df_over.A, df_over.B).alias("overlap"))
df_overlap.show()

+-------+
|overlap|
+-------+
|   true|
|  false|
|   null|
+-------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_over.toPandas(),df_overlap.toPandas())

Input                      Output

A	B
[a, b]	[b, c]
[a]	[b, c]
[a, None]	[b, None]

overlap
True
False
None

1t. How to flatten a column containing nested arrays?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.flatten(col)

creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

‘’’

Input: Spark data frame with nested array column

df4 = spark.createDataFrame([([[1, 2, 3, 8, 4],[6,8, 10]],), ([[4, 5, 32, 32, 6]],)], ['data'])
df4.show(truncate=False)

+-----------------------------+
|data                         |
+-----------------------------+
|[[1, 2, 3, 8, 4], [6, 8, 10]]|
|[[4, 5, 32, 32, 6]]          |
+-----------------------------+

Output : Spark data frame with a flattended array column

from pyspark.sql.functions import flatten
df_flat = df4.select(flatten(df4.data).alias('flatten'))
df_flat.show(truncate=False)

+-------------------------+
|flatten                  |
+-------------------------+
|[1, 2, 3, 8, 4, 6, 8, 10]|
|[4, 5, 32, 32, 6]        |
+-------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df4.toPandas(),df_flat.toPandas())

Input                      Output

data
[[1, 2, 3, 8, 4], [6, 8, 10]]
[[4, 5, 32, 32, 6]]

flatten
[1, 2, 3, 8, 4, 6, 8, 10]
[4, 5, 32, 32, 6]

1u. How to concatenate the elements of an array in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.array_join(col, delimiter, null_replacement=None)

Concatenates the elements of column using the delimiter. Null values are replaced with null_replacement if set, otherwise they are ignored.

‘’’

Input: Spark data frame with an array column

df_a1 = spark.createDataFrame([([1, 2, 3, 4, 5],), ([4, 5, None, 4, 6],)], ['A'])
df_a1.show()

+---------------+
|              A|
+---------------+
|[1, 2, 3, 4, 5]|
|  [4, 5,, 4, 6]|
+---------------+

Output : Spark data frame with a concatenated array element column

from pyspark.sql.functions import array_join
df_j1 = df_a1.select(array_join(df_a1.A,',').alias("array_join"))
df_j1.show()

+----------+
|array_join|
+----------+
| 1,2,3,4,5|
|   4,5,4,6|
+----------+

df_j2 = df_a1.select(array_join(df_a1.A,',', null_replacement="NA").alias("array_join"))
df_j2.show()

+----------+
|array_join|
+----------+
| 1,2,3,4,5|
|4,5,NA,4,6|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_a1.toPandas(),df_j1.toPandas(), df_j2.toPandas())

Input                      Output

A
[1, 2, 3, 4, 5]
[4, 5, None, 4, 6]

array_join
1,2,3,4,5
4,5,4,6

array_join
1,2,3,4,5
4,5,NA,4,6

1v. How to zip 2 array columns ?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.arrays_zip(*cols)

Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

Parameters cols – columns of arrays to be merged.

‘’’

Input: Spark data frame with an array column

dfz = spark.createDataFrame([(([1, 2, 3], [4, 5, 6]))], ['A', 'B'])
dfz.show()

+---------+---------+
|        A|        B|
+---------+---------+
|[1, 2, 3]|[4, 5, 6]|
+---------+---------+

Output : Spark data frame with a zipped array column

from pyspark.sql.functions import arrays_zip
df_zip = dfz.select(arrays_zip(dfz.A, dfz.B).alias('zipped'))
df_zip.show(truncate=False)

+------------------------+
|zipped                  |
+------------------------+
|[[1, 4], [2, 5], [3, 6]]|
+------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(dfz.toPandas(),df_zip.toPandas())

Input                      Output

A	B
[1, 2, 3]	[4, 5, 6]

zipped
[(1, 4), (2, 5), (3, 6)]

Spark Data Frame Programming for Modern Data Engineering

Chapter 8 : Array Columns¶

Chapter Learning Objectives¶

Chapter Outline¶

Chapter Outline - Gallery¶

1a. How to create a array column from multiple columns?¶

1b. How to remove duplicate values from an array column?¶

1c. How to check if a value is in an array column?¶

1d. How to find the list of elements in column A, but not in column B without duplicates?¶

1e.How to sort the column array in ascending order?¶

1f. How to create an array from a column value repeated many times?¶

1g. How to remove all elements equal to an element from the given array in a column?¶

1h . How to locate the position of first occurrence of the given value in the given array in a column?¶

1i. How to find the minimum value of an array in a column?¶

1j. How to find the maximum value of an array in a column?¶

1k. How to convert a column of nested arrays into a map column?¶

1l. How to sort an array in a column in ascending or descending order?¶

1m. How to slice an array in a column?¶

1n. How to shuffle a column containing an array?¶

1o. How to create a array column containing elements with sequence(start, stop, step)?¶

1p. How to reverse the order(not reverse sort) of an array in a column ?¶

1q. How to combine two array columns into a map column?¶

1r. How to concatenate the elements of an array in a column?¶

1s. How to find overlap between 2 array columns?¶

1t. How to flatten a column containing nested arrays?¶

1u. How to concatenate the elements of an array in a column?¶

1v. How to zip 2 array columns ?¶