import pandas as pd
import numpy as np
= '2010-03-23'
start_date = '2013-07-19' end_date
Generate all days between two dates:
='D') pd.date_range(start_date, end_date, freq
DatetimeIndex(['2010-03-23', '2010-03-24', '2010-03-25', '2010-03-26',
'2010-03-27', '2010-03-28', '2010-03-29', '2010-03-30',
'2010-03-31', '2010-04-01',
...
'2013-07-10', '2013-07-11', '2013-07-12', '2013-07-13',
'2013-07-14', '2013-07-15', '2013-07-16', '2013-07-17',
'2013-07-18', '2013-07-19'],
dtype='datetime64[ns]', length=1215, freq='D')
Generate N dates equally spaced:
= 4
N
=N).normalize() pd.date_range(start_date, end_date, periods
DatetimeIndex(['2010-03-23', '2011-05-01', '2012-06-09', '2013-07-19'], dtype='datetime64[ns]', freq=None)
Generate a random subsample of N dates between the dates
Method 1 & 2 require to generate the whole date range first, then sample from it. Method 3 & 4 leverage numpy generators and construct dates out of generated numbers. Method 2 doesn’t require explicitly importing numpy. Method 4 gives you the times for free as well and seems the fastest according to the benchmark in the original SO thread
Method 1:
np.random.choice
src
# old syntax
= 4
N
pd.Series(
np.random.choice(
pd.date_range(start_date, end_date),
N, =True # replace=True -> 1 value can appear multiple times
replace
) )
0 2010-12-07
1 2010-05-20
2 2011-12-17
3 2013-02-24
dtype: datetime64[ns]
# new syntax
= np.random.default_rng()
rng
= 4
N
pd.Series(
rng.choice(
pd.date_range(start_date, end_date),
N, =True # replace=True -> 1 value can appear multiple times
replace
) )
0 2013-03-16
1 2010-11-18
2 2012-05-27
3 2011-03-24
dtype: datetime64[ns]
Method 2:
pd.Series.sample()
= 4
N
pd.Series(='D')
pd.date_range(start_date, end_date, freq\
)=True)\
.sample(N, replace=True) .reset_index(drop
0 2012-05-05
1 2012-10-20
2 2013-02-04
3 2011-01-03
dtype: datetime64[ns]
Method 3:
pd.to_timedelta
# old syntax
= 4
N
= (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
max_days = pd.to_timedelta(
delta_days 0, max_days+1, N),
np.random.randint(='D')
unit
+ delta_days pd.to_datetime(start_date)
DatetimeIndex(['2013-05-10', '2011-07-19', '2011-05-14', '2013-01-30'], dtype='datetime64[ns]', freq=None)
# new syntax
= np.random.default_rng()
rng
= 4
N
= (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days
max_days = pd.to_timedelta(
delta_days 0, max_days, size=N, endpoint=True),
rng.integers(='D')
unit
+ delta_days pd.to_datetime(start_date)
DatetimeIndex(['2013-02-17', '2012-02-21', '2011-04-13', '2013-02-20'], dtype='datetime64[ns]', freq=None)
Method 4:
unix timestamps src
# old syntax
= 4
N
= pd.to_datetime(start_date).value//int(1e9)
start_u = pd.to_datetime(end_date).value//int(1e9)
end_u ='s').normalize() pd.to_datetime(np.random.randint(start_u, end_u, N), unit
DatetimeIndex(['2010-07-15', '2011-05-11', '2010-10-30', '2010-05-19'], dtype='datetime64[ns]', freq=None)
# new syntax
= np.random.default_rng()
rng
= 4
N
= pd.to_datetime(start_date).value//int(1e9)
start_u = pd.to_datetime(end_date).value//int(1e9)
end_u =True), unit='s').normalize() pd.to_datetime(rng.integers(start_u, end_u, N, endpoint
DatetimeIndex(['2011-03-22', '2011-08-23', '2012-12-26', '2010-08-05'], dtype='datetime64[ns]', freq=None)