Chapter 5 : String Columns¶

Chapter Learning Objectives¶

Various data operations on columns containing string.

Chapter Outline¶

1. Various data operations on columns containing string

# import panel as pn
# css = """
# div.special_table + table, th, td {
#   border: 3px solid orange;
# }
# """
# pn.extension(raw_css=[css])
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

Chapter Outline - Gallery¶

click on	any image

1a. How to split a string?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.split(str, pattern, limit=-1)

Splits str around matches of the given pattern.

Parameters:

str : a string expression to split
pattern : a string representing a regular expression. The regex string should be a Java regular expression.
limit : an integer which controls the number of times pattern is applied.

limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will contain all input beyond the last matched pattern.

limit <= 0: pattern will be applied as many times as possible, and the resulting array can be of any size. ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abc__def__ghc',)], ['string',])
df_string.show()

+-------------+
|       string|
+-------------+
|abc__def__ghc|
+-------------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import split
df_split = df_string.select(split(df_string.string,'__').alias('split_string'))
df_split.show()

+---------------+
|   split_string|
+---------------+
|[abc, def, ghc]|
+---------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_split.toPandas())

Input                      Output

string
abc__def__ghc

split_string
[abc, def, ghc]

1b. How to slice a string?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.slice(x, start, length)

Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.

Parameters:

x : the array to be sliced
start : the starting index
length : the length of the slice ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abcdefghi',)], ['string',])
df_string.show()

+---------+
|   string|
+---------+
|abcdefghi|
+---------+

Output : Spark data frame with a column with a sliced string

from pyspark.sql.functions import substring
df_sub = df_string.select(substring(df_string.string,1,4).alias('substring'))
df_sub.show()

+---------+
|substring|
+---------+
|     abcd|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_sub.toPandas())

Input                      Output

string
abcdefghi

substring
abcd

1c. How to convert lowercase to uppercase?¶

Lets first understand the syntax

Converts a string expression to upper case.

Syntax

pyspark.sql.functions.upper(col)

Converts a string expression to lower case.

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a lowercase string

df_upper = spark.createDataFrame([('ABCDEFGHI',)], ['uppercase',])
df_upper.show()

+---------+
|uppercase|
+---------+
|ABCDEFGHI|
+---------+

Output : Spark data frame with a column having a uppercase string

from pyspark.sql.functions import lower
df_lower= df_upper.select(lower(df_upper.uppercase).alias('lowercase'))
df_lower.show()

+---------+
|lowercase|
+---------+
|abcdefghi|
+---------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_upper.toPandas(),df_lower.toPandas())

input                      output

uppercase
ABCDEFGHI

lowercase
abcdefghi

1d. How to convert uppercase to lowercase?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.lower(col)

Converts a string expression to lower case.

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abcdefghc',)], ['lowercase',])
df_string.show()

+---------+
|lowercase|
+---------+
|abcdefghc|
+---------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import upper
df_upper= df_string.select(upper(df_string.lowercase).alias('uppercase'))
df_upper.show()

+---------+
|uppercase|
+---------+
|ABCDEFGHC|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_upper.toPandas())

Input                      Output

lowercase
abcdefghc

uppercase
ABCDEFGHC

1e. How to extract a specific group matched by a Java regex?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.regexp_extract(str, pattern, idx)

Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. ‘’’

Input: Spark data frame with a column having a string

df = spark.createDataFrame([('100-200',)], ['str'])
df.show()

+-------+
|    str|
+-------+
|100-200|
+-------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import regexp_extract
df_regex1 = df.select(regexp_extract('str', r'(\d+)-(\d+)', 1).alias('regex'))
df_regex1.show()

+-----+
|regex|
+-----+
|  100|
+-----+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_regex1.toPandas())

Input                      Output

str
100-200

regex
100

1f. How to replace a specific group matched by a Java regex?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.regexp_replace(str, pattern, replacement)

Replace all substrings of the specified string value that match regexp with rep. ‘’’

Input: Spark data frame with a column having a string

df = spark.createDataFrame([('100-200',)], ['string'])
df.show()

+-------+
| string|
+-------+
|100-200|
+-------+

Output : Spark data frame with a column with a regex

from pyspark.sql.functions import regexp_replace
df_regex2 = df.select(regexp_replace('string', r'(\d+)', '--').alias('replace'))
df_regex2.show()

+-------+
|replace|
+-------+
|  -----|
+-------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_regex2.toPandas())

Input                      Output

string
100-200

replace
-----

Spark Data Frame Programming for Modern Data Engineering

Chapter 5 : String Columns¶

Chapter Learning Objectives¶

Chapter Outline¶

Chapter Outline - Gallery¶

1a. How to split a string?¶

1b. How to slice a string?¶

1c. How to convert lowercase to uppercase?¶

1d. How to convert uppercase to lowercase?¶

1e. How to extract a specific group matched by a Java regex?¶

1f. How to replace a specific group matched by a Java regex?¶