Pandas Cheatsheet
Pandas is a great tool for analyzing, manipulating, and visualizing data. It is built on top of Python.
You can also watch my Youtube series on 5 Days of Pandas.
Importing Data
pd.read_csv(filename)
| From a CSV filepd.read_table(filename)
| From a delimited text file (like TSV)pd.read_excel(filename)
| From an Excel filepd.read_json(json_string)
| Read from a JSON formatted string, URL or file.pd.read_html(url)
| Parses an html URL, string or file and extracts tables to a list of dataframes pd.DataFrame(dict)
| From a dict, keys for columns names, values for data as listspd.read_clipboard()
| Takes the contents of your clipboard and passes it to read_table()
Exporting Data
df.to_csv(filename)
| Write to a CSV filedf.to_excel(filename)
| Write to an Excel filedf.to_json(filename)
| Write to a file in JSON format
Viewing/Inspecting Data
df.head(n)
| First n rows of the DataFramedf.tail(n)
| Last n rows of the DataFramedf.shape
| Number of rows and columnsdf.info()
| Index, Datatype and Memory informationdf.describe()
| Summary statistics for numerical columnss.value_counts(dropna=False)
| View unique values and countsdf.apply(pd.Series.value_counts)
| Unique values and counts for all columns
Selection
df[col]
| Returns column with label col as Seriesdf[[col1, col2]]
| Returns columns as a new DataFramedf.iloc[0]
| Selection by positiondf.loc['index_one']
| Selection by indexdf.loc[1000:1005, 'Salary']
| selecting salaries from certain indicesdf.iloc[0]
| First rowdf.iloc[0,0]
| First element of first column df.at[3, 'Salary']
| select a value at Row 3 column salary
Data Cleaning
df.columns = ['a','b','c']
| Rename columnsdf.isnull()
| Checks for null Values, Returns Boolean Arrraydf.isnull.sum() |
retrurns the number of null valuesdf[df['Salary'].isna()]|
returns the rows containing nulldf.notnull()
| Opposite of pd.isnull()df.dropna()
| Drop all rows that contain null valuesdf.dropna(axis=1)
| Drop all columns that contain null valuesdf.dropna(axis=1,thresh=n)
| Drop all rows have have less than n non null valuesdf.fillna(x)
| Replace all null values with xs.fillna(s.mean())
| Replace all null values with the means.astype(float)
| Convert the datatype of the series to floats.replace(1,'one')
| Replace all values equal to 1 with ‘one’s.replace([1,3],['one','three'])
| Replace all 1 with ‘one’ and 3 with ‘three’df.rename(columns=lambda x: x + 1)
| Mass renaming of columnsdf.rename(columns={'old_name': 'new_ name'})
| Selective renamingdf.set_index('column_one')
| Change the indexdf.rename(index=lambda x: x + 1)
| Mass renaming of index
Filter, Sort, and Groupby
df[df[col] > 0.5]
| Rows where the column col
is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)]
| Rows where 0.7 > col > 0.5
df.sort_values(col1)
| Sort values by col1 in ascending orderdf.sort_values(col2,ascending=False)
| Sort values by col2
in descending orderdf.sort_values([col1,col2],ascending=[True,False])
| Sort values by col1
in ascending order then col2
in descending orderdf.groupby(col)
| Returns a groupby object for values from one columndf.groupby([col1,col2])
| Returns groupby object for values from multiple columnsdf.groupby(col1)[col2]
| Returns the mean of the values in col2
, grouped by the values in col1
(mean can be replaced with almost any function from the statistics module)df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean)
| Create a pivot table that groups by col1
and calculates the mean of col2
and col3
df.groupby(col1).agg(np.mean)
| Find the average across all columns for every unique col1 groupdf.apply(np.mean)
| Apply the function np.mean()
across each columnnf.apply(np.max,axis=1)
| Apply the function np.max()
across each row
Join/Combine
df1.append(df2)
| Add the rows in df1
to the end of df2
(columns should be identical)pd.concat([df1, df2],axis=1)
| Add the columns in df1
to the end of df2
(rows should be identical)df1.join(df2,on=col1,how='inner')
| SQL-style join the columns in df1
with the columns on df2
where the rows for col
have identical values. 'how'
can be one of 'left'
, 'right'
, 'outer'
, 'inner'
Statistics
df.describe()
| Summary statistics for numerical columnsdf.mean()
| Returns the mean of all columnsdf.corr()
| Returns the correlation between columns in a DataFramedf.count()
| Returns the number of non-null values in each DataFrame columndf.max()
| Returns the highest value in each columndf.min()
| Returns the lowest value in each columndf.median()
| Returns the median of each columndf.std()
| Returns the standard deviation of each column