Sometimes it is useful to use dictionaries in pandas workflows. For example, when using a df.replace()
or df.rename()
command. Sometimes it is useful when those dictionaries are generated from another dataframe.
Concrete example could be working with messy country names. Let’s say I have a dataframe containing some values per country:
= pd.DataFrame([['Poland', 'Jan Nowak', 3], ['POL', 'Martyna Kowalska', 15], ['PL', 'Joanna Byk', 19]],
df =['country', 'athlete', 'score']) columns
and another dataframe which contains attribution of a messy country name to a proper country name:
= pd.DataFrame(['Poland', 'PL'], ['POL', 'PL'], ['PL', 'PL'], columns=['messy', 'proper']) countries
Here countries
is generated programmatically, but the benefit of using a countries
dataframe is that it can be maintained in a form of an external csv or xlsx, which contains a hand-curated list of synonyms.
What we want to get is a dictionary of a form: {'messy_name': 'proper_name'}
.
We can achieve it in a following way:
= countries.drop_duplicates().set_index('messy')['proper'].to_dict() names_dict
We need to drop duplicates, otherwise we will only get the first entry (keys are unique in a dictionary). Optimally we would do it at the earlier stage.
To clean the messy names, we can then do:
'country'].replace(names_dict) df[
to get a disambiguated column. We can overwrite the original or create a new column, e.g. country_clean
.