In the wild, some data come as messy CSVs or TXT files. Luckily, we can customize the pd.read_csv()
to deal with many surprises that await us.
Configure the file imports
To set the explicit file encoding, for a case when the non-Latin characters were saved in a weird way. Pick your favorite one (ansi
, utf-8
etc.)
'messyfile.txt', encoding='ansi') pd.read_csv(
Delimiter other than a comma. Common delimiters are: \t
for tab, |
for pipe etc.
'messyfile.tsv', delimiter='\t')
pd.read_csv('messyfile.tsv', sep='\t') pd.read_csv(
Your CSV has a commentary on top of your data and you know how many lines there are (here, 9 lines of commentary):
'messyfile.csv', skiprows=9) pd.read_csv(
Your CSV has a commentary above/in/below your data and it is preceded with a #
:
'messyfile.csv', comment='#') pd.read_csv(
Your CSV has data which look like NaN
, but are actually proper data points, e.g. a shortcut NA
for North America: (this will also switch off all the other ways of fishing out the NaN
values)
'messyfile.csv', na_filter=False) pd.read_csv(
Your CSV has textual data which contain a valid double-quote character as part of the string which has been escaped with \
:
'messyfile.csv', escapechar='\') pd.read_csv(
You know that the table is at the beginning of the file and it takes 11 lines, and below the table there is an unknown volume of garbage/commentary:
'messyfile.csv', nrows=11) pd.read_csv(
How to snoop the file before reading
Sometimes the mess is big and we want to snoop the file first and decide between an import scenario.
For example, we have a set of csvs and the headers are missing in some of them. The columns should be the same in each file. We know what the default header should be. We can store it in a variable:
= 'col1,col2,col3'
headerline = headerline.split(',') header_names
If your column names are messy, you may want to strip them from whitespace etc. in this step.
Then we can open each csv file separately, snoop the beginning and decide on the import scenario. It is important to move back the cursor to the beginning after performing a read operation.
with open('single_file.csv') as tmp_file:
# snoop the first line and move the file pointer back to the file beginning
= tmp_file.tell() # get the initial position
x = tmp_file.readline().decode('utf-8') # snoop
first_line # move the pointer back to the iniital position
tmp_file.seek(x)
# do something depending on what we found out
if first_line != headerline: # if the headerline is missing at the beginning
= pd.read_csv(tmp_file, delimiter=',',
df_tmp =0, names=header_names) # read header manually
header# here you can do something else to your data to clean it
else:
= pd.read_csv(tmp_file, delimiter=',') # read as usual df_tmp
I had a real case like this, where in the bunch of CSVs the header was sometimes above and sometimes below the data table. I want to believe that a person who created this data set must have had a very very bad day.
To merge the csvs into one big dataframe
The best is to use pd.concat()
:
= []
df_list for f in list_of_files:
= pd.read_csv(f) # here you can use your customized read function like in the snooping example
df_tmp
df_list.append(df_tmp)= pd.concat(df_list).reset_index() big_df
Remember to reset_index()
, as all the dataframes will come with their own index, which may screw up your loc
operations later on if you leave duplicate indices.