Pandas for Data Work
The table library every data scientist reaches for. Learn the 20 operations that cover 95% of your data-cleaning life.
Python beginner
#python
#pandas
#dataframes
#data-cleaning
Prereqs: Python basics, NumPy basics
What pandas is for
A pandas DataFrame is a spreadsheet you can script. You load it, clean it, filter it, group it, join it, and hand the result to a model. Half of ML engineering is pandas.
The operations that matter
Loading
df = pd.read_csv('data.csv')
df = pd.read_parquet('data.parquet') # faster, preserves dtypes
df = pd.read_json('data.jsonl', lines=True)
Inspecting
df.head() # first 5 rows
df.shape # (rows, cols)
df.dtypes # column types
df.describe() # numeric summaries
df.isna().sum() # null count per column
df['col'].value_counts()
Selecting
df['col'] # one column as Series
df[['a', 'b']] # multiple columns
df.loc[df.age > 30] # rows by condition
df.loc[:, 'a':'c'] # column range
df.iloc[5, 2] # positional
Cleaning
df.drop_duplicates(subset=['user_id'])
df.dropna(subset=['target'])
df.fillna({'age': df.age.median()})
df['text'] = df['text'].str.lower().str.strip()
df = df[df['text'].str.len() > 10]
Grouping and aggregation
df.groupby('category')['revenue'].sum()
df.groupby(['country', 'month']).agg({'orders': 'count', 'revenue': 'mean'})
df.groupby('user_id').apply(lambda g: g.sort_values('date').tail(10))
Joining
merged = df.merge(other, on='user_id', how='left')
how options: 'inner' (default), 'left', 'right', 'outer'. Nine out of ten times you want 'left'.
Transforming
df['log_price'] = np.log1p(df.price)
df['is_weekend'] = df.date.dt.dayofweek >= 5
df['bucket'] = pd.cut(df.age, bins=[0, 18, 35, 65, 120])
Reshaping
df.pivot_table(index='user', columns='month', values='revenue', aggfunc='sum')
df.melt(id_vars=['user'], var_name='metric', value_name='value')
The mistakes that will bite you
- Chained assignment —
df[df.x > 0]['y'] = 1silently does nothing. Usedf.loc[df.x > 0, 'y'] = 1. applyon rows — 100× slower than vectorized. Use.apply(..., axis=1)only as a last resort.- Ignoring dtypes — loading
objectinstead ofcategorywastes 10× memory on string columns. inplace=True— deprecated in spirit. Reassign instead.
When to stop using pandas
- Data bigger than RAM → Polars (similar API, much faster) or DuckDB (SQL over files).
- Distributed → Spark.
- Streaming → Kafka + Flink, not pandas.
Pandas is a fantastic single-machine tool. Know when to graduate.