Topics
Reshaping data is common task in data analysis. Pandas offers pivot and pivot_table for this. Both rearrange data based on unique values in specified columns, creating new index, columns, and values.
df.pivot
Reshapes DataFrame based on column values. Requires three columns: one for index, one for columns, one for values.
index: Column to use to make new frame indexcolumns: Column to use to make new frame columnsvalues: Column to use for filling new frame values
Warning
Cannot handle duplicate entries for same index/column combination. If duplicates exist,
pivotwill raise an error.
import pandas as pd
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'city': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'temperature': [32, 65, 30, 68]
})
pivot_df = df.pivot(index='date', columns='city', values='temperature')
print(pivot_df)
city Los Angeles New York
date
2023-01-01 65 32
2023-01-02 68 30
df.pivot_table
Useful when combinations of
indexandcolumnsvalues are not unique in original data.pivot_tablewill group rows with same index/column combination and applyaggfunc.
More general function than pivot. Also reshapes data but can handle duplicate entries by aggregating them. Essentially a generalization of pivot that can perform aggregation.
values: Column to aggregateindex: Column or list of columns for indexcolumns: Column or list of columns for columnsaggfunc: Function or list of functions to aggregate values (default ‘mean’). Can be string name (‘sum’, ‘mean’, ‘count’) or function (np.sum)
pivot_tableis essentially a fanciergroupby().agg()followed by reshaping.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02'],
'city': ['New York', 'New York', 'Los Angeles', 'New York'],
'temperature': [32, 35, 65, 30],
'humidity': [80, 75, 20, 85]
})
# Using pivot_table with duplicate date/city ('2023-01-01', 'New York')
# Default aggfunc='mean'
pivot_table_df = df.pivot_table(index='date', columns='city',
values='temperature')
print(pivot_table_df)
# Using multiple values and different aggfunc
pivot_table_multi = df.pivot_table(index='date', columns='city',
values=['temperature', 'humidity'],
aggfunc={'temperature': np.mean, 'humidity': np.max})
print(pivot_table_multi)city Los Angeles New York
date
2023-01-01 65.0 33.5
2023-01-02 NaN 30.0
humidity temperature
city Los Angeles New York Los Angeles New York
date
2023-01-01 20.0 80.0 65.0 33.5
2023-01-02 NaN 85.0 NaN 30.0