Chapter 6 : Number Columns¶

Chapter Learning Objectives¶

Various data operations on columns containing numbers.

Chapter Outline¶

1. Various data operations on columns containing numbers

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

import panel as pn
css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

Chapter Outline - Gallery¶

click on	any image

1a. How to calculate the minimum value in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.min(col)

Aggregate function: returns the minimum value of the expression in a group.

Parameters:

col : column

‘’’

Input: Spark data frame with a column having a string

df_min = spark.createDataFrame([(1,),(2,),(3,)],['data'])
df_min.show()

+----+
|data|
+----+
|   1|
|   2|
|   3|
+----+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import min
df_min.select(min(df_min.data)).first()[0]

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_min.toPandas(),df_min.select(min(df_min.data)).toPandas())

Input                      Output

data
1
2
3

min(data)
1

1b. How to calculate the maximum value in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.max(col)

Aggregate function: returns the maximum value of the expression in a group

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a string

df_max = spark.createDataFrame([(1,),(2,),(3,)],['data'])
df_max.show()

+----+
|data|
+----+
|   1|
|   2|
|   3|
+----+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import max
df_max.select(max(df_max.data)).first()[0]

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_max.toPandas(),df_max.select(max(df_max.data)).toPandas())

Input                      Output

data
1
2
3

max(data)
3

1c. How to calculate the sum of a column?¶

Lets first understand the syntax

Converts a string expression to upper case.

Syntax

pyspark.sql.functions.sum(col)

Aggregate function: returns the sum of all values in the expression.

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a lowercase string

df_sum = spark.createDataFrame([(1,),(2,),(3,)],['data'])
df_sum.show()

+----+
|data|
+----+
|   1|
|   2|
|   3|
+----+

Output : Spark data frame with a column having a uppercase string

from pyspark.sql.functions import sum
df_sum.select(sum(df_sum.data)).first()[0]

Summary:

print("input                     ",            "output")
display_side_by_side(df_sum.toPandas(),df_sum.select(sum(df_sum.data)).toPandas())

input                      output

data
1
2
3

sum(data)
6

1d. How to round values in a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.round(col, scale=0)

Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a string

df_round = spark.createDataFrame([(1.453,),(2.65433,),(3.765,),(2.985,)],['data'])
df_round.show()

+-------+
|   data|
+-------+
|  1.453|
|2.65433|
|  3.765|
|  2.985|
+-------+

Output : Spark data frame with a column with a sliced string

from pyspark.sql.functions import round
df_round1 = df_round.select(round(df_round.data,2))
df_round1.show()

+--------------+
|round(data, 2)|
+--------------+
|          1.45|
|          2.65|
|          3.77|
|          2.99|
+--------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_round.toPandas(),df_round1.toPandas())

Input                      Output

data
1.45300
2.65433
3.76500
2.98500

round(data, 2)
1.45
2.65
3.77
2.99

1e. How to calculate the mean of a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.mean(col)

Aggregate function: returns the average of the values in a group

Parameters:

col : column ‘’’

Input: Spark data frame with a column having a string

df_mean = spark.createDataFrame([(1,),(2,),(3,)],['data'])
df_mean.show()

+----+
|data|
+----+
|   1|
|   2|
|   3|
+----+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import mean
df_mean.select(mean(df_mean.data)).first()[0]

2.0

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_mean.toPandas(),df_mean.select(mean(df_mean.data)).toPandas())

Input                      Output

data
1
2
3

avg(data)
2.0

1f. How to calculate the standard deviation of a column?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.stddev(col)

Aggregate function: alias for stddev_samp.

‘’’

Input: Spark data frame with a column having a string

df_stddev = spark.createDataFrame([(1,),(2,),(3,)],['data'])
df_stddev.show()

+----+
|data|
+----+
|   1|
|   2|
|   3|
+----+

Output : Spark data frame with a column with a regex

from pyspark.sql.functions import stddev
df_stddev.select(stddev(df_stddev.data)).first()[0]

1.0

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_stddev.toPandas(),df_stddev.select(stddev(df_stddev.data)).toPandas())

Input                      Output

data
1
2
3

stddev_samp(data)
1.0

Spark Data Frame Programming for Modern Data Engineering

Chapter 6 : Number Columns¶

Chapter Learning Objectives¶

Chapter Outline¶

Chapter Outline - Gallery¶

1a. How to calculate the minimum value in a column?¶

1b. How to calculate the maximum value in a column?¶

1c. How to calculate the sum of a column?¶

1d. How to round values in a column?¶

1e. How to calculate the mean of a column?¶

1f. How to calculate the standard deviation of a column?¶