Topics

Basic Filtering

Label vs position selection: Use .loc for label-based indexing, .iloc for positional indexing.

  • .loc: Index labels or boolean masks
  • .iloc: Integer positions
df.loc['2025-01-01']           # select rows with index label "2025-01-01"
df.loc[df['value'] > 0, 'id']  # filter rows by condition, get 'id'
df.iloc[0:5, [1, 3]]           # first 5 rows, columns at 1 and 3

.at (label-based) or .iat (position-based) are faster for single value access

Boolean filtering: Boolean masks filter DataFrames. Combine conditions with & (AND), | (OR), and use parentheses.

mask = (df['age'] > 30) & (df['income'] < 50_000)
subset = df[mask]
subset = df[df['category'] != 'X']

.isin performs membership checks:

df[df['country'].isin(['USA', 'Canada'])]
df[df['score'].between(0, 100)]    # numeric range filter (inclusive)

Query strings: Filter using DataFrame.query(expr) with pandas-safe expression strings. Concise, leverages numexpr engine for speed.

df.query('age > 30 and income < 50000')
min_val = 5
df.query('value >= @min_val or category == "A"')  # use @ for external vars

Tip

query/eval may be faster than complex boolean indexing, avoids intermediate arrays.

Filtering with eval: Use df.eval("df.col > 2") using expression strings (can also include boolean masks). Vectorized evaluation improves performance for complex expressions.

import math
df[df.eval('math.sin(df.a)>0')] # fast filtering

Multi-index slicing: Use .loc with tuples/pd.IndexSlice for DataFrames with MultiIndex to slice by partial levels.

  • Example: df.loc[('StoreA', slice('2025-01','2025-02')), :] selects data for StoreA in Jan-Feb 2025
  • Use df.xs() (cross-section) to select along a level by label: df.xs('StoreA', level='store')

Select and Assign

Conditional assignment: Use np.where for vectorized if-else logic. It chooses elements from two options based on a boolean condition array:

df['sign'] = np.where(df['balance'] >= 0, 'Positive', 'Negative')

For multiple conditions, use np.select:

conditions = [df['score'] >= 90, df['score'] >= 60]
choices = ['A', 'B']
df['grade'] = np.select(conditions, choices, default='C')

These are much faster than iterating or using apply row by row.

DataFrame.where and mask: df.where(cond, other) keeps original values where cond is True and uses other where False. df.mask(cond, other) does the inverse (replaces where condition is True). This is useful for bulk value setting:

df['adjusted'] = df['value'].where(df['value'] >= 0, 0)  # set negatives to 0
# equivalent using mask:
df['adjusted'] = df['value'].mask(df['value'] < 0, 0)

Tip

You can also fill entire DataFrame cells conditionally (e.g., replace all negative entries with NaN: df.mask(df < 0, np.nan)).