import pandas as pd
import numpy as npstart_date = '2010-03-23'
end_date = '2013-07-19'Generate all days between two dates:
pd.date_range(start_date, end_date, freq='D')DatetimeIndex(['2010-03-23', '2010-03-24', '2010-03-25', '2010-03-26',
'2010-03-27', '2010-03-28', '2010-03-29', '2010-03-30',
'2010-03-31', '2010-04-01',
...
'2013-07-10', '2013-07-11', '2013-07-12', '2013-07-13',
'2013-07-14', '2013-07-15', '2013-07-16', '2013-07-17',
'2013-07-18', '2013-07-19'],
dtype='datetime64[ns]', length=1215, freq='D')
Generate N dates equally spaced:
N = 4
pd.date_range(start_date, end_date, periods=N).normalize()DatetimeIndex(['2010-03-23', '2011-05-01', '2012-06-09', '2013-07-19'], dtype='datetime64[ns]', freq=None)
Generate a random subsample of N dates between the dates
Method 1 & 2 require to generate the whole date range first, then sample from it. Method 3 & 4 leverage numpy generators and construct dates out of generated numbers. Method 2 doesn’t require explicitly importing numpy. Method 4 gives you the times for free as well and seems the fastest according to the benchmark in the original SO thread
Method 1:
np.random.choice src
# old syntax
N = 4
pd.Series(
np.random.choice(
pd.date_range(start_date, end_date),
N,
replace=True # replace=True -> 1 value can appear multiple times
)
) 0 2010-12-07
1 2010-05-20
2 2011-12-17
3 2013-02-24
dtype: datetime64[ns]
# new syntax
rng = np.random.default_rng()
N = 4
pd.Series(
rng.choice(
pd.date_range(start_date, end_date),
N,
replace=True # replace=True -> 1 value can appear multiple times
)
) 0 2013-03-16
1 2010-11-18
2 2012-05-27
3 2011-03-24
dtype: datetime64[ns]
Method 2:
pd.Series.sample()
N = 4
pd.Series(
pd.date_range(start_date, end_date, freq='D')
)\
.sample(N, replace=True)\
.reset_index(drop=True)0 2012-05-05
1 2012-10-20
2 2013-02-04
3 2011-01-03
dtype: datetime64[ns]
Method 3:
pd.to_timedelta
# old syntax
N = 4
max_days = (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
delta_days = pd.to_timedelta(
np.random.randint(0, max_days+1, N),
unit='D')
pd.to_datetime(start_date) + delta_daysDatetimeIndex(['2013-05-10', '2011-07-19', '2011-05-14', '2013-01-30'], dtype='datetime64[ns]', freq=None)
# new syntax
rng = np.random.default_rng()
N = 4
max_days = (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
delta_days = pd.to_timedelta(
rng.integers(0, max_days, size=N, endpoint=True),
unit='D')
pd.to_datetime(start_date) + delta_daysDatetimeIndex(['2013-02-17', '2012-02-21', '2011-04-13', '2013-02-20'], dtype='datetime64[ns]', freq=None)
Method 4:
unix timestamps src
# old syntax
N = 4
start_u = pd.to_datetime(start_date).value//int(1e9)
end_u = pd.to_datetime(end_date).value//int(1e9)
pd.to_datetime(np.random.randint(start_u, end_u, N), unit='s').normalize()DatetimeIndex(['2010-07-15', '2011-05-11', '2010-10-30', '2010-05-19'], dtype='datetime64[ns]', freq=None)
# new syntax
rng = np.random.default_rng()
N = 4
start_u = pd.to_datetime(start_date).value//int(1e9)
end_u = pd.to_datetime(end_date).value//int(1e9)
pd.to_datetime(rng.integers(start_u, end_u, N, endpoint=True), unit='s').normalize()DatetimeIndex(['2011-03-22', '2011-08-23', '2012-12-26', '2010-08-05'], dtype='datetime64[ns]', freq=None)