Generating random days between two dates

cheatsheet
pandas
Author

Teresa Kubacka

Published

July 30, 2022

import pandas as pd 
import numpy as np
start_date = '2010-03-23'
end_date = '2013-07-19'

Generate all days between two dates:

pd.date_range(start_date, end_date, freq='D')
DatetimeIndex(['2010-03-23', '2010-03-24', '2010-03-25', '2010-03-26',
               '2010-03-27', '2010-03-28', '2010-03-29', '2010-03-30',
               '2010-03-31', '2010-04-01',
               ...
               '2013-07-10', '2013-07-11', '2013-07-12', '2013-07-13',
               '2013-07-14', '2013-07-15', '2013-07-16', '2013-07-17',
               '2013-07-18', '2013-07-19'],
              dtype='datetime64[ns]', length=1215, freq='D')

Generate N dates equally spaced:

N = 4

pd.date_range(start_date, end_date, periods=N).normalize()
DatetimeIndex(['2010-03-23', '2011-05-01', '2012-06-09', '2013-07-19'], dtype='datetime64[ns]', freq=None)

Generate a random subsample of N dates between the dates

Method 1 & 2 require to generate the whole date range first, then sample from it. Method 3 & 4 leverage numpy generators and construct dates out of generated numbers. Method 2 doesn’t require explicitly importing numpy. Method 4 gives you the times for free as well and seems the fastest according to the benchmark in the original SO thread

Method 1:

np.random.choice src

# old syntax

N = 4

pd.Series(
    np.random.choice(
        pd.date_range(start_date, end_date), 
        N, 
        replace=True # replace=True -> 1 value can appear multiple times
    )
) 
0   2010-12-07
1   2010-05-20
2   2011-12-17
3   2013-02-24
dtype: datetime64[ns]
# new syntax

rng = np.random.default_rng()

N = 4

pd.Series(
    rng.choice(
        pd.date_range(start_date, end_date), 
        N, 
        replace=True # replace=True -> 1 value can appear multiple times
    )
) 
0   2013-03-16
1   2010-11-18
2   2012-05-27
3   2011-03-24
dtype: datetime64[ns]

Method 2:

pd.Series.sample()

N = 4

pd.Series(
    pd.date_range(start_date, end_date, freq='D')
)\
.sample(N, replace=True)\
.reset_index(drop=True)
0   2012-05-05
1   2012-10-20
2   2013-02-04
3   2011-01-03
dtype: datetime64[ns]

Method 3:

pd.to_timedelta

# old syntax

N = 4

max_days = (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
delta_days = pd.to_timedelta(
    np.random.randint(0, max_days+1, N), 
    unit='D')

pd.to_datetime(start_date) + delta_days
DatetimeIndex(['2013-05-10', '2011-07-19', '2011-05-14', '2013-01-30'], dtype='datetime64[ns]', freq=None)
# new syntax

rng = np.random.default_rng()

N = 4

max_days = (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
delta_days = pd.to_timedelta(
    rng.integers(0, max_days, size=N, endpoint=True), 
    unit='D')

pd.to_datetime(start_date) + delta_days
DatetimeIndex(['2013-02-17', '2012-02-21', '2011-04-13', '2013-02-20'], dtype='datetime64[ns]', freq=None)

Method 4:

unix timestamps src

# old syntax

N = 4

start_u = pd.to_datetime(start_date).value//int(1e9)
end_u = pd.to_datetime(end_date).value//int(1e9)
pd.to_datetime(np.random.randint(start_u, end_u, N), unit='s').normalize()
DatetimeIndex(['2010-07-15', '2011-05-11', '2010-10-30', '2010-05-19'], dtype='datetime64[ns]', freq=None)
# new syntax

rng = np.random.default_rng()

N = 4

start_u = pd.to_datetime(start_date).value//int(1e9)
end_u = pd.to_datetime(end_date).value//int(1e9)
pd.to_datetime(rng.integers(start_u, end_u, N, endpoint=True), unit='s').normalize()
DatetimeIndex(['2011-03-22', '2011-08-23', '2012-12-26', '2010-08-05'], dtype='datetime64[ns]', freq=None)