Pandas for Data Work

The table library every data scientist reaches for. Learn the 20 operations that cover 95% of your data-cleaning life.

Python beginner #python #pandas #dataframes #data-cleaning
Prereqs: Python basics, NumPy basics

What pandas is for

A pandas DataFrame is a spreadsheet you can script. You load it, clean it, filter it, group it, join it, and hand the result to a model. Half of ML engineering is pandas.

The operations that matter

Loading

df = pd.read_csv('data.csv')
df = pd.read_parquet('data.parquet')   # faster, preserves dtypes
df = pd.read_json('data.jsonl', lines=True)

Inspecting

df.head()          # first 5 rows
df.shape           # (rows, cols)
df.dtypes          # column types
df.describe()      # numeric summaries
df.isna().sum()    # null count per column
df['col'].value_counts()

Selecting

df['col']                      # one column as Series
df[['a', 'b']]                 # multiple columns
df.loc[df.age > 30]            # rows by condition
df.loc[:, 'a':'c']             # column range
df.iloc[5, 2]                  # positional

Cleaning

df.drop_duplicates(subset=['user_id'])
df.dropna(subset=['target'])
df.fillna({'age': df.age.median()})
df['text'] = df['text'].str.lower().str.strip()
df = df[df['text'].str.len() > 10]

Grouping and aggregation

df.groupby('category')['revenue'].sum()
df.groupby(['country', 'month']).agg({'orders': 'count', 'revenue': 'mean'})
df.groupby('user_id').apply(lambda g: g.sort_values('date').tail(10))

Joining

merged = df.merge(other, on='user_id', how='left')

how options: 'inner' (default), 'left', 'right', 'outer'. Nine out of ten times you want 'left'.

Transforming

df['log_price'] = np.log1p(df.price)
df['is_weekend'] = df.date.dt.dayofweek >= 5
df['bucket'] = pd.cut(df.age, bins=[0, 18, 35, 65, 120])

Reshaping

df.pivot_table(index='user', columns='month', values='revenue', aggfunc='sum')
df.melt(id_vars=['user'], var_name='metric', value_name='value')

The mistakes that will bite you

  1. Chained assignmentdf[df.x > 0]['y'] = 1 silently does nothing. Use df.loc[df.x > 0, 'y'] = 1.
  2. apply on rows — 100× slower than vectorized. Use .apply(..., axis=1) only as a last resort.
  3. Ignoring dtypes — loading object instead of category wastes 10× memory on string columns.
  4. inplace=True — deprecated in spirit. Reassign instead.

When to stop using pandas

  • Data bigger than RAM → Polars (similar API, much faster) or DuckDB (SQL over files).
  • Distributed → Spark.
  • Streaming → Kafka + Flink, not pandas.

Pandas is a fantastic single-machine tool. Know when to graduate.