Chapter 7 : Date Columns¶

Chapter Learning Objectives¶

Various data operations on columns containing date strings, date and timestamps.

Chapter Outline¶

1. How to deal with timezone/format conversions?
2. How to add and subtract on date columns?
3. How to extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column?

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])
#<div class="special_table"></div>

Chapter Outline - Gallery¶

Timezone format conversions¶

click on	any image

Addition and subtraction on date columns ¶

click on	any image

Extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column ¶

1a. How to convert string timestamp to date format?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.to_date(col, format=None)

Converts a Column into pyspark.sql.types.DateType using the optionally specified format.

Specify formats according to datetime pattern. By default, it follows casting rules to pyspark.sql.types.DateType if the format is omitted.

Equivalent to col.cast(“date”).

‘’’

Input: Spark data frame with a column having a string of timestamp

df_str = spark.createDataFrame([('2020-07-15 16:52:44',)], ['string_timestamp'])
df_str.show()

+-------------------+
|   string_timestamp|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Output : Spark data frame with a date column

from pyspark.sql.functions import to_date
df_date = df_str.select(to_date(df_str.string_timestamp).alias('date'))
df_date.show()

+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_str.toPandas(),df_date.toPandas())

Input                      Output

string_timestamp
2020-07-15 16:52:44

date
2020-07-15

1b. How to convert a string timestamp to datetime format?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.to_timestamp(col, format=None)

Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.

Specify formats according to datetime pattern. By default, it follows casting rules to pyspark.sql.types.TimestampType if the format is omitted.

Equivalent to col.cast(“timestamp”). ‘’’

Input: Spark data frame with a column having a string of timestamp

df_string = spark.createDataFrame([('2020-07-15 16:52:44',)], ['string_timestamp'])
df_string.show()

+-------------------+
|   string_timestamp|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Output : Spark data frame with a timestamp column

from pyspark.sql.functions import to_timestamp
df_datetime = df_string.select(to_timestamp(df_string.string_timestamp).alias('datetime'))
df_datetime.show()

+-------------------+
|           datetime|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_datetime.toPandas())

Input                      Output

string_timestamp
2020-07-15 16:52:44

datetime
2020-07-15 16:52:44

1c. How to convert unix timestamp to a string timestamp?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.from_unixtime(timestamp, format=’yyyy-MM-dd HH:mm:ss’)

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

‘’’

Input: Spark data frame with a number column representing unix time

df_unix = spark.createDataFrame([(1594846364,)], ['unix_time'])
df_unix.show()

+----------+
| unix_time|
+----------+
|1594846364|
+----------+

Output : Spark data frame with a column with a sliced string

from pyspark.sql.functions import from_unixtime
df_time = df_unix.select(from_unixtime(df_unix.unix_time).alias("timestamp_string"))
df_time.show(truncate=False)

+-------------------+
|timestamp_string   |
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_unix.toPandas(),df_time.toPandas())

Input                      Output

unix_time
1594846364

timestamp_string
2020-07-15 16:52:44

1d. How to convert a timestamp to unix timestamp?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.unix_timestamp(timestamp=None, format=’yyyy-MM-dd HH:mm:ss’)

Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

‘’’

Input: Spark data frame consisting of a timestamp column in UTC

from pyspark.sql.functions import current_timestamp
df = spark.createDataFrame([(),],)
df = df.select(current_timestamp().alias("timestamp"))
df.show(truncate=False)

+----------------------+
|timestamp             |
+----------------------+
|2021-03-13 21:42:38.01|
+----------------------+

Output : Spark data frame consisting of a unix timestamp column

from pyspark.sql.functions import unix_timestamp
df_unix = df.select(unix_timestamp(df.timestamp).alias("unixtimestamp"))
df_unix.show(truncate=False)

+-------------+
|unixtimestamp|
+-------------+
|1615689758   |
+-------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_unix.toPandas())

Input                      Output

timestamp
2021-03-13 21:42:38.659

unixtimestamp
1615689758

1e. How to convert timestamp in UTC timezone to a given time zone?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters

timestamp – the column that contains timestamps
tz – A string detailing the time zone ID that the input should be adjusted to. It should be in the format of
either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous. ‘’’

Input: Spark data frame consisting of a timestamp column in UTC time zone

# Please note that below timestamp is in UTC.
df_utc = spark.createDataFrame([(),],)
df_utc = df_utc.select(current_timestamp().alias("timestamp_UTC"))
df_utc.show(truncate=False)

+-----------------------+
|timestamp_UTC          |
+-----------------------+
|2021-03-13 21:42:38.959|
+-----------------------+

Output : Spark data frame consisting of a timestamp column in different timezone

from pyspark.sql.functions import from_utc_timestamp
df_tz = df_utc.select(from_utc_timestamp(df_utc.timestamp_UTC,'UTC-4').alias("timestamp in local TZ"))
df_tz.show(truncate=False)

+-----------------------+
|timestamp in local TZ  |
+-----------------------+
|2021-03-13 17:42:39.213|
+-----------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_utc.toPandas(),df_tz.toPandas())

Input                      Output

timestamp_UTC
2021-03-13 21:42:39.362

timestamp in local TZ
2021-03-13 17:42:39.487

1f. How to convert from a given time zone to UTC timestamps?¶

Lets first understand the syntax

{admonition} Syntax pyspark.sql.functions.to_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in the given timezone, and renders that timestamp as a timestamp in UTC.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from the given timezone to UTC timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters

timestamp – the column that contains timestamps
tz – A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous.

Input: Spark data frame consisting of a timestamp column in UTC time zone

# Please note that local time zone is EST.
df_est = spark.createDataFrame([(),],)
df_est = df.select(current_timestamp().alias("timestamp"))
df_est.show(truncate=False)

+-----------------------+
|timestamp              |
+-----------------------+
|2021-03-13 21:42:39.695|
+-----------------------+

Output : Spark data frame consisting of a timestamp column in different timezone

from pyspark.sql.functions import to_utc_timestamp
df_utc = df_est.select(to_utc_timestamp(df_est.timestamp,'GMT-4').alias("timestamp in UTC"))
df_utc.show(truncate=False)

+-----------------------+
|timestamp in UTC       |
+-----------------------+
|2021-03-14 01:42:40.042|
+-----------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_est.toPandas(),df_utc.toPandas())

Input                      Output

timestamp
2021-03-13 21:42:40.310

timestamp in UTC
2021-03-14 01:42:40.500

1g. How to convert the date format?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_format(date, format)

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

A pattern could be for instance dd.MM.yyyy and could return a string like ‘18.03.1993’. All pattern letters of datetime pattern. can be used.

Parameters:

col : column ‘’’

Input: Spark data frame with a string date column

df_format1 = spark.createDataFrame([('2020-07-15',)], ['date_format1'])
df_format1.show()

+------------+
|date_format1|
+------------+
|  2020-07-15|
+------------+

Output : Spark data frame with a formatted date

from pyspark.sql.functions import date_format
df_format2 = df_format1.select(date_format('date_format1', 'MM/dd/yyy').alias('date_format2'))
df_format2.show()

+------------+
|date_format2|
+------------+
|  07/15/2020|
+------------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_format1.toPandas(),df_format2.toPandas())

input                      output

date_format1
2020-07-15

date_format2
07/15/2020

2a. How to find no of days between 2 dates?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.datediff(end, start)

Returns the number of days from start to end.

‘’’

Input: Spark data frame consisting of date columns

df_dates = spark.createDataFrame([('2020-07-15','2020-06-10')], ['date1', 'date2'])
df_dates.show()

+----------+----------+
|     date1|     date2|
+----------+----------+
|2020-07-15|2020-06-10|
+----------+----------+

Output : Spark data frame with a days diff

from pyspark.sql.functions import datediff
df_diff = df_dates.select(datediff(df_dates.date1,df_dates.date2).alias('diff'))
df_diff.show()

+----+
|diff|
+----+
|  35|
+----+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_dates.toPandas(),df_diff.toPandas())

Input                      Output

date1	date2
2020-07-15	2020-06-10

diff
35

2b. How to find no of months between 2 dates?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.months_between(date1, date2, roundOff=True)

Returns number of months between dates date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, returns an integer (time of day will be ignored). The result is rounded off to 8 digits unless roundOff is set to False.

‘’’

Input: Spark data frame with date columns

df_month = spark.createDataFrame([('2020-07-15 11:32:00', '2020-01-30')], ['date1', 'date2'])
df_month.show()

+-------------------+----------+
|              date1|     date2|
+-------------------+----------+
|2020-07-15 11:32:00|2020-01-30|
+-------------------+----------+

Output : Spark data frame with months between 2 dates

from pyspark.sql.functions import months_between
df_month_diff = df_month.select(months_between(df_month.date1, df_month.date2))
df_month_diff.show()

+----------------------------------+
|months_between(date1, date2, true)|
+----------------------------------+
|                        5.53163082|
+----------------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_month.toPandas(),df_month_diff.toPandas())

Input                      Output

date1	date2
2020-07-15 11:32:00	2020-01-30

months_between(date1, date2, true)
5.531631

2c. How to add no of days to a date?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_add(start, days)

Returns the date that is days days after start ‘’’

Input: Spark data frame with a string date column

df_today = spark.createDataFrame([('2020-07-15',)], ['date'])
df_today.show()

+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame with a new date

from pyspark.sql.functions import date_add
df_tomo = df_today.select(date_add(df_today.date, 1).alias('next_day'))
df_tomo.show()

+----------+
|  next_day|
+----------+
|2020-07-16|
+----------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_today.toPandas(),df_tomo.toPandas())

input                      output

date
2020-07-15

next_day
2020-07-16

2d. How to subtract no of days from a date?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_sub(start, days)

Returns the date that is days days before start

‘’’

Input: Spark data frame with a date column

df_date = spark.createDataFrame([('2020-07-15',)], ['date'])
df_date.show()

+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame with a date column

from pyspark.sql.functions import date_sub
df_sub = df_date.select(date_sub(df_date.date, 1).alias('yesterday'))
df_sub.show()

+----------+
| yesterday|
+----------+
|2020-07-14|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_date.toPandas(),df_sub.toPandas())

Input                      Output

date
2020-07-15

yesterday
2020-07-14

2e. How to add no of months to a date?¶

Lets first understand the syntax

Syntax

pyspark.sql.functions.add_months(start, months)[source]

Returns the date that is months months after start ‘’’

Input: Spark data frame consisting of a date column

df = spark.createDataFrame([('2020-07-15',)], ['date'])
df.show()

+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame consisting of a date column

from pyspark.sql.functions import add_months
df1 = df.select(add_months(df.date, 1).alias('next_month'))
df1.show()

+----------+
|next_month|
+----------+
|2020-08-15|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df1.toPandas())

Input                      Output

date
2020-07-15

next_month
2020-08-15

How to extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column?¶

Input: Spark data frame consisting of a datetime column

df = spark.createDataFrame([(),],)
df = df.select(current_timestamp().alias("timestamp"))
df.show(truncate=False)

+-----------------------+
|timestamp              |
+-----------------------+
|2021-03-13 21:42:44.661|
+-----------------------+

Output : Spark data frame consisting of many columns

from pyspark.sql.functions import year, month, dayofmonth, dayofweek, dayofyear, hour, minute, second, weekofyear, quarter
df2 = df.select(df.timestamp,year(df.timestamp).alias("year"), 
month(df.timestamp).alias("month"), dayofmonth(df.timestamp).alias("dayofmonth"), 
dayofweek(df.timestamp).alias("dayofweek"), dayofyear(df.timestamp).alias("dayofyear"), 
hour(df.timestamp).alias("hour"), minute(df.timestamp).alias("minute"), 
second(df.timestamp).alias("second"), weekofyear(df.timestamp).alias("weekofyear"), 
quarter(df.timestamp).alias("quarter"))
df2.show(truncate=False)

+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+
|timestamp              |year|month|dayofmonth|dayofweek|dayofyear|hour|minute|second|weekofyear|quarter|
+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+
|2021-03-13 21:42:44.997|2021|3    |13        |7        |72       |21  |42    |44    |10        |1      |
+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+

Summary:

print("Input                            ",            "Output")
display_side_by_side(df.toPandas(),df2.toPandas())

Input                             Output

timestamp
2021-03-13 21:42:45.180

timestamp	year	month	dayofmonth	dayofweek	dayofyear	hour	minute	second	weekofyear	quarter
2021-03-13 21:42:45.282	2021	3	13	7	72	21	42	45	10	1

Spark Data Frame Programming for Modern Data Engineering

Chapter 7 : Date Columns¶

Chapter Learning Objectives¶

Chapter Outline¶

Chapter Outline - Gallery¶

Timezone format conversions¶

Addition and subtraction on date columns ¶

Extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column ¶

1a. How to convert string timestamp to date format?¶

1b. How to convert a string timestamp to datetime format?¶

1c. How to convert unix timestamp to a string timestamp?¶

1d. How to convert a timestamp to unix timestamp?¶

1e. How to convert timestamp in UTC timezone to a given time zone?¶

1f. How to convert from a given time zone to UTC timestamps?¶

1g. How to convert the date format?¶

2a. How to find no of days between 2 dates?¶

2b. How to find no of months between 2 dates?¶

2c. How to add no of days to a date?¶

2d. How to subtract no of days from a date?¶

2e. How to add no of months to a date?¶

How to extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column?¶