REF: Pandas Cheat Sheet
Creating tables
There are two main objects in pandas: DataFrame (table) and Series (a sequence of values, such as a single column of a DataFrame). A table has row keys - indexes and column keys - columns. By default, the index is integer-based. Row indexes don't have to be unique. If they are the same, accessing the index with df.loc[row] will return a table with multiple rows.
The key difference from numpy arrays is that pandas DataFrame columns can be of different types (including class instances). A table can be transposed with df.T, so there's no fundamental difference between rows and columns. However, if the columns are of different types, transposing will convert all types into object.
import pandas as pd df = pd.DataFrame() # create an empty table print(df.index, df.columns) # Index( [] ) Index([]) df = pd.DataFrame([1,3,2], # one column with 3 numbers named A columns=['A']) # default row indexes [0,1,2] print(df.index) # RangeIndex(start=0,stop=3,step=1) print(df.columns) # Index(['A']) df = pd.DataFrame() # same as above df['A'] = [1,2,3] # create a column in the empty table df = pd.DataFrame([[1,2], # table from a 2D array [3,4]]) # first row [1,2], second row [3,4]
The table can be defined using a dictionary (its keys are the column names). If a key has a scalar value, it will be duplicated. When specifying row keys (index), you can mix str, int, Timestamp etc.:
cat = pd.Categorical(["a","b","a"])} # categorical (not strings!) df = pd.DataFrame( {"Num": 1, # | Num Name Age Cat "Name": ["Jon", "Mia", "Sam"], # -------------------- "Age" : [43, np.nan, 56], # m | 1 Jon 43 a "Cat": cat, # f | 1 Mia NaN b index=['m', 'f', 0]) # 0 | 1 Sam 56 aA list of dictionaries represents a list of rows. Each new key generates a column:
# | b c a pd.DataFrame([{ 'b': 1, 'c': 2}, # ---------------- {'a': 3, 'b': 4, 'c': 5}]) # 0 | 1 2 NaN # 1 | 3 4 5
df1 = df.copy() # deepcopy of the table
Working with files
When loading from a CSV file, you can set a specific column (by name or number) as the index column (row names) using (index_col) :df = pd.read_csv("name.csv", index_col="id")Reading large files in chunks:
chunksize = 10 ** 6 with pd.read_csv(filename, chunksize=chunksize) as reader: for chunk in reader: process(chunk)Saving to a CSV file:
df.to_csv("submission.csv", index=False) # save without index df.to_csv('probs.zip', index=False, # save as a zip file compression=dict(method='zip', archive_name='probs.csv') )If you don't specify index=False, the CSV will include a first unnamed column with the indices.
Summary information
print('shape:', df.shape) # output shape (rows, cols) display(df) # print (if not the last command) df.head(7) # first 7 rows (5 by default) df.tail(2) # last 2 rows (5 by default) df.sample(2) # 2 random rows df.describe() # statistics for all columns df.col.describe() # statistics for the column named 'col' df.describe(include=[bool])) # only boolean columns (bool, float, object, category) df.col.mean() # mean value for the 'col' column df.col.unique() # list of unique values df.col.value_counts() # table of values and their frequency in 'col' df.info() # list of columns with their data types and non-null counts df.duplicated().sum() # number of duplicate rows df.isna().sum() # count of missing values in each columnThe output format for describe, etc.:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
Access to elements
The at method gives access to table elements, while the loc method allows access to slices based on indexes (both rows and columns). The same applies to iat and iloc, but indexing is done using row and column numbers (integers, slices similar to NumPy). For loc, slices include the boundaries, whereas for iloc, slices follow the usual Python convention (upper boundary is not included).
The at[rowID, colID] property is faster than loc[rowID, colID] when accessing a single element:df.at[0,'a'] = 5The loc method allows you to extract one row, a list of rows, or their slice, as well as a slice of the table (index and columns). In this case, the last index is included in the slice (!):
df.loc[ 0 ] # Series (row 0) df.loc[ 0 ] = 5 # Assign value 5 to all elements in row 0 df.loc[ [1,3] ] # DataFrame with 2 rows df.loc[ :2] # DataFrame with 3 rows df.loc[0, 'a'] # Similar to 'at', but slower df.loc[1:3, 'a'] # Result is a Series (3 elements) df.a.loc[1:3] # Same as above (works with string indexes as well) df.loc[1:3, ['a','b']] # DataFrame (3 rows, 2 columns) df.loc[[1,3], ['a','b']] # DataFrame (2 rows, 2 columns)
To retrieve columns, df['col_name'] returns a Series. The same can be achieved with df.col_name (if the column already exists). You can also pass a list of column names: df[['a','c']] returns a DataFrame instead of a Series.
df['a'] = 5 # Change all values in column to 5 df.a = 5 # Same as above df['a'] = [1,2,3] # Set column with specific values (must match row count)
Boolean indexing
df.loc[ df.a > 1 ] # DataFrame where values in column 'a' > 1 df.loc[ (df.a > 1) & (df.a < 5) ] df.loc[df.c.isin(['aa','bb'])] # Rows where column 'c' has values df.loc[df.isin(['aa','bb'])] # Rows where any column has values df.loc[lambda df: df.a == 8] # Select rows that satisfy the function df[df > 0] # All elements greater than zero (others will be NaN) df[df > 0] = -df
Modifying tables
Sorting:
df = df.sort_values(by="A") # Sort by column df = df.sort_values(by=["A", "B"]) df = df.sort_index(axis=1, ascending=False) # Sort columns df = df.sort_index(axis=0) # Sort row indexes (same type!)
Renaming columns and row indexes:
df.rename(columns = {'points': 'score'}) df.rename(index = {0: 'firstEntry', 1: 'secondEntry'})
Adding rows and columns
Adding a row using loc or by concatenating tables with:df = pd.DataFrame({'a':[1,2], 'b':[-1,-2]}) # | a b # ---------- df.loc[len(df.index)] = [3, -3] # 0 | 1 -1 # 1 | 2 -2 da = pd.DataFrame([{'a':4, 'b':'-4'}]) # 2 | 3 -3 df = pd.concat([df, da]) # 3 | 4 -4 df = df.reset_index() # reset row indexes to be unique (!)Adding a column with the length of the string in the "name" column
df["len"] = df["name"].map(lambda x: len(x))
Undefined cells
Some cells can be undefined (None or np.nan), which will appear as NaN when displayed. The isna method returns a DataFrame with logical values (its counterpart is notna):df = pd.DataFrame({'a':[1,2], 'b':[3,None]}) # | a b # --------------- # 0 | False False df.isna() # 1 | False True
You can create a table with undefined cells (for generating large tables, it's more efficient to create undefined rows first and then fill them, rather than adding rows). The values in the cells can be objects (lists, etc.).
df = pd.DataFrame(index=range(2), # | a columns=['a']) # -------- df.at[0,'a'] = [1,2] # 0 | [1,2] df # 1 | NaNHowever, after saving to a file, they will be JSON strings, and you'll need to parse them again when reading the file.
df.dropna(how="any") # drop all rows with missing values df.fillna(value=5) # fill missing values
Calculations in tables
df['c']= df.a + df.b # create or modify a column df.a = df.a.map(lambda x: x - 50) # modify a column using a function
count | number of non-NA values |
---|---|
sum | sum |
mean | average value |
mad | mean absolute deviation |
median | median |
min | minimum |
max | maximum |
mode | mode |
abs | absolute value |
prod | product |
std | standard deviation |
var | unbiased variance |
sem | standard error of the mean |
skew | skewness (3rd moment) |
kurt | kurtosis (4th moment) |
quantile | quantile (%) |
cumsum | cumulative sum |
cumprod | cumulative product |
cummax | cumulative maximum |
cummin | cumulative minimum |
Aggregated information
Instead of using the standard describe, you can manually create summary tables for columns (with textual names for pandas functions like df.sum etc.), or just use the function name:def fun(x): return sum(x)**0.5 fun.__name__ = 'sqrt' # assign any name for display purposes df.agg(['sum', 'mean', 'std', max, strange])
Groups
df = pd.DataFrame({ 'A': ['a','b','a','b','b'], 'B': ['0','1','0','1','0'], 'X': [ 1, 2, 3, 4, 5 ], 'Y': [ 0, 1, 1, 0, 0 ] } )Let's group the same values in column A and iterate through all the groups (each group is a DataFrame):
groups = df.groupby("A") # val=a # | A B X Y for val_A, df_A in groups: # -------------- print("val=",val_A) # 0 | a 0 1 0 display(df_A) # 2 | a 0 3 1 #...Get the sum for all numeric columns in each group (the index is named index.name='A'):
# | X Y df.groupby("A").sum() # A ------- # a | 4 1 # b | 11 1 df.groupby("A").X.sum() # get Series of sums in groups for column X
First group by A, then by B:
# | X Y df.groupby(["A","B"]).sum() # A B |----- # a 0 | 4 1 # b 0 | 5 0 # 1 6 | 1Now the index is not a single number but a pair of values:
MultiIndex([('a', '0'), ('b', '0'), ('b', '1')], names=['A', 'B'])
Group aggregation
df = pd.DataFrame({'s': [0,0,0,1,1,2], 'i': [1,2,1,1,3,2], 't': [0,0,1,0,1,0]})Creating a table where column sess is unique, and all aid values are collected into a set (the second example includes renaming columns "set" → "s1" "min" → "mi"):
df.groupby('s').i.agg([set, min]) # s set min # 0 {1,2} 1 df.groupby('s').agg(s1=('i',set), mi=('i',min)) # 1 {1,3} 1 # 2 {2} 2After grouping, the table index takes on the name index.name == s.
gr0 = df[df.t==0].groupby('s').agg(s1=('i',set)) # s s1 s2 gr1 = df[df.t==1].groupby('s').agg(s2=('i',set)) # 0 {1,2} {1} # 1 {1} {3} pd.concat([gr0, gr1], axis=1) # 2 {2} NaN
Using lookup tables
Linking tables is similar to SQL (df_ref is a reference table that gives its values to df based on the key i):df_ref = pd.DataFrame({'i':[0, 1, 3, 4], # | i v s 'v':[10,20,30,40], # ---------------- 's':['a','b','c','d']}) # 0 | 0 10 a # 1 | 0 10 a df = pd.DataFrame({"i":[0,0,1,3,2] }) # 2 | 1 20 b # 3 | 3 30 c df.merge(df_ref, on='i', how="left") # 4 | 2 NaN NaNThe same applies if the column names for the connection are different
df.merge(df_ref, left_on='i1', right_on='i', how="left")
Working with string data
Using the str attribute, you can access all str-type methods applied to the Series:s.str.lower() s.str.lower().str.strip()
Working with time
Time objects and time periodsts = pd.Timestamp('2022-11-13 00:00:00', tz=None) ts.to_pydatetime() # datetime.datetime(2022, 11, 13, 0, 0) rng = pd.date_range('2022-11-13', periods=3, freq='D') rng.to_pydatetime() # list of 3 datetime objectsIf the date-time in CSV is saved as Unix timestamp (seconds passed since 1970-1-1) then they can be converted:
df = pd.DataFrame({"date": [1659304800, # 2022-07-31 22:00:00 1659304904]}) # 2022-07-31 22:01:44 df.date = pd.to_datetime(df.date, unit='s') # D,s,ms,us,ns
t1 = pd.to_datetime(df.ts.min(), unit='s') t2 = pd.to_datetime(df.ts.max(), unit='s') print(t1, t2, pd.Timedelta(t2 - t1) )
df.time = pd.to_datetime(df.date, unit='s') # D,s,ms,us,ns df['hour'] = df.time.dt.hour # hour number df['weekday'] = df.time.dt.weekday # day of the week number
