REF: Pandas Cheat Sheet


Creating tables

There are two main objects in pandas: DataFrame (table) and Series (a sequence of values, such as a single column of a DataFrame). A table has row keys - indexes and column keys - columns. By default, the index is integer-based. Row indexes don't have to be unique. If they are the same, accessing the index with df.loc[row] will return a table with multiple rows.

The key difference from numpy arrays is that pandas DataFrame columns can be of different types (including class instances). A table can be transposed with df.T, so there's no fundamental difference between rows and columns. However, if the columns are of different types, transposing will convert all types into object.

import pandas as pd

df = pd.DataFrame()              # create an empty table
print(df.index, df.columns)      # Index( [] )  Index([])

df = pd.DataFrame([1,3,2],       # one column with 3 numbers named A 
                  columns=['A']) # default row indexes [0,1,2]
print(df.index)                  # RangeIndex(start=0,stop=3,step=1)
print(df.columns)                # Index(['A'])

df = pd.DataFrame()              # same as above
df['A'] = [1,2,3]                # create a column in the empty table

df = pd.DataFrame([[1,2],        # table from a 2D array
                   [3,4]])       # first row [1,2], second row [3,4]

The table can be defined using a dictionary (its keys are the column names). If a key has a scalar value, it will be duplicated. When specifying row keys (index), you can mix str, int, Timestamp etc.:

cat = pd.Categorical(["a","b","a"])}        # categorical (not strings!)

df  = pd.DataFrame(
          {"Num":  1,                       #    | Num Name Age Cat
           "Name": ["Jon", "Mia", "Sam"],   #  --------------------
           "Age" : [43,    np.nan, 56],     #  m | 1   Jon  43   a
           "Cat":  cat,                     #  f | 1   Mia  NaN  b
           index=['m', 'f', 0])             #  0 | 1   Sam  56   a

A list of dictionaries represents a list of rows. Each new key generates a column:
                                            #    |  b   c   a
pd.DataFrame([{        'b': 1, 'c': 2},     #  ----------------
              {'a': 3, 'b': 4, 'c': 5}])    #  0 |  1   2   NaN
                                            #  1 |  3   4   5
df1 =  df.copy()                            # deepcopy of the table

Working with files

When loading from a CSV file, you can set a specific column (by name or number) as the index column (row names) using (index_col) :
df = pd.read_csv("name.csv", index_col="id")
Reading large files in chunks:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)
Saving to a CSV file:
df.to_csv("submission.csv", index=False)   # save without index

df.to_csv('probs.zip', index=False,        # save as a zip file 
           compression=dict(method='zip', archive_name='probs.csv') )
If you don't specify index=False, the CSV will include a first unnamed column with the indices.

Summary information

print('shape:', df.shape)       # output shape (rows, cols)
display(df)                     # print (if not the last command)

df.head(7)                      # first 7 rows (5 by default)
df.tail(2)                      # last 2 rows (5 by default)
df.sample(2)                    # 2 random rows

df.describe()                   # statistics for all columns
df.col.describe()               # statistics for the column named 'col'
df.describe(include=[bool]))    # only boolean columns (bool, float, object, category)

df.col.mean()                   # mean value for the 'col' column
df.col.unique()                 # list of unique values 
df.col.value_counts()           # table of values and their frequency in 'col'

df.info()                       # list of columns with their data types and non-null counts
df.duplicated().sum()           # number of duplicate rows
df.isna().sum()                 # count of missing values in each column
The output format for describe, etc.:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Access to elements

The at method gives access to table elements, while the loc method allows access to slices based on indexes (both rows and columns). The same applies to iat and iloc, but indexing is done using row and column numbers (integers, slices similar to NumPy). For loc, slices include the boundaries, whereas for iloc, slices follow the usual Python convention (upper boundary is not included).

The at[rowID, colID] property is faster than loc[rowID, colID] when accessing a single element:
df.at[0,'a'] = 5

The loc method allows you to extract one row, a list of rows, or their slice, as well as a slice of the table (index and columns). In this case, the last index is included in the slice (!):
df.loc[ 0 ]                     # Series    (row 0)
df.loc[ 0 ] = 5                 # Assign value 5 to all elements in row 0
df.loc[ [1,3] ]                 # DataFrame with 2 rows 
df.loc[ :2]                     # DataFrame with 3 rows

df.loc[0,   'a']                # Similar to 'at', but slower
df.loc[1:3, 'a']                # Result is a Series (3 elements)
df.a.loc[1:3]                   # Same as above (works with string indexes as well)

df.loc[1:3,   ['a','b']]        # DataFrame (3 rows, 2 columns)
df.loc[[1,3], ['a','b']]        # DataFrame (2 rows, 2 columns)

To retrieve columns, df['col_name'] returns a Series. The same can be achieved with df.col_name (if the column already exists). You can also pass a list of column names: df[['a','c']] returns a DataFrame instead of a Series.

df['a'] = 5          # Change all values in column to 5
df.a    = 5          # Same as above
df['a'] = [1,2,3]    # Set column with specific values (must match row count)

Boolean indexing

df.loc[ df.a > 1 ]              # DataFrame where values in column 'a' > 1
df.loc[ (df.a > 1) & (df.a < 5) ]

df.loc[df.c.isin(['aa','bb'])]  # Rows where column 'c' has values 
df.loc[df.isin(['aa','bb'])]    # Rows where any column has values

df.loc[lambda df: df.a == 8]    # Select rows that satisfy the function

df[df > 0]                      # All elements greater than zero (others will be NaN)
df[df > 0] = -df

Modifying tables

Sorting:

df = df.sort_values(by="A")                   # Sort by column
df = df.sort_values(by=["A", "B"])

df = df.sort_index(axis=1, ascending=False)   # Sort columns
df = df.sort_index(axis=0)                    # Sort row indexes (same type!)

Renaming columns and row indexes:

df.rename(columns = {'points': 'score'})
df.rename(index   = {0: 'firstEntry', 1: 'secondEntry'})

Adding rows and columns

Adding a row using loc or by concatenating tables with:
df = pd.DataFrame({'a':[1,2], 'b':[-1,-2]}) #   | a b
                                            # ----------
df.loc[len(df.index)] = [3, -3]             # 0 | 1 -1
                                            # 1 | 2 -2
da = pd.DataFrame([{'a':4, 'b':'-4'}])      # 2 | 3 -3
df = pd.concat([df, da])                    # 3 | 4 -4
df = df.reset_index()                       # reset row indexes to be unique (!)
Adding a column with the length of the string in the "name" column
df["len"] = df["name"].map(lambda x: len(x))

Undefined cells

Some cells can be undefined (None or np.nan), which will appear as NaN when displayed. The isna method returns a DataFrame with logical values (its counterpart is notna):
df = pd.DataFrame({'a':[1,2], 'b':[3,None]})   #    | a     b
                                               #  ---------------
                                               #  0 | False False
df.isna()                                      #  1 | False True

You can create a table with undefined cells (for generating large tables, it's more efficient to create undefined rows first and then fill them, rather than adding rows). The values in the cells can be objects (lists, etc.).

df = pd.DataFrame(index=range(2),              #     | a
                  columns=['a'])               #  --------
df.at[0,'a'] = [1,2]                           #   0 | [1,2]
df                                             #   1 | NaN
However, after saving to a file, they will be JSON strings, and you'll need to parse them again when reading the file.

df.dropna(how="any")           # drop all rows with missing values
df.fillna(value=5)             # fill missing values 

Calculations in tables

df['c']= df.a + df.b                # create or modify a column

df.a = df.a.map(lambda x: x - 50)   # modify a column using a function
count number of non-NA values
sum sum
mean average value
mad mean absolute deviation
median median
min minimum
max maximum
mode mode
abs absolute value
prod product
std standard deviation
var unbiased variance
sem standard error of the mean
skew skewness (3rd moment)
kurt kurtosis (4th moment)
quantile quantile (%)
cumsum cumulative sum
cumprod cumulative product
cummax cumulative maximum
cummin cumulative minimum

Aggregated information

Instead of using the standard describe, you can manually create summary tables for columns (with textual names for pandas functions like df.sum etc.), or just use the function name:
def fun(x):
    return sum(x)**0.5

fun.__name__ = 'sqrt'     # assign any name for display purposes

df.agg(['sum', 'mean', 'std', max, strange])

Groups

df = pd.DataFrame({
    'A': ['a','b','a','b','b'],
    'B': ['0','1','0','1','0'],
    'X': [ 1,  2,  3,  4,  5 ],
    'Y': [ 0,  1,  1,  0,  0 ]
} )
Let's group the same values ​​in column A and iterate through all the groups (each group is a DataFrame):
groups = df.groupby("A")          # val=a
                                  #   | A  B  X  Y
for val_A, df_A in groups:        # --------------
    print("val=",val_A)           # 0 | a  0  1  0
    display(df_A)                 # 2 | a  0  3  1
                                  #...
Get the sum for all numeric columns in each group (the index is named index.name='A'):
                                  #   | X   Y
df.groupby("A").sum()             # A -------
                                  # a | 4   1
                                  # b | 11  1

df.groupby("A").X.sum()           # get Series of sums in groups for column X

First group by A, then by B:

                                  #     | X  Y
df.groupby(["A","B"]).sum()       # A B |-----
                                  # a 0 | 4  1
                                  # b 0 | 5  0
                                  # 1 6 | 1
Now the index is not a single number but a pair of values:
MultiIndex([('a', '0'), ('b', '0'), ('b', '1')], names=['A', 'B'])


Group aggregation

df = pd.DataFrame({'s': [0,0,0,1,1,2],
                   'i': [1,2,1,1,3,2],
                   't': [0,0,1,0,1,0]})
Creating a table where column sess is unique, and all aid values are collected into a set (the second example includes renaming columns "set" → "s1" "min" → "mi"):
df.groupby('s').i.agg([set, min])                  # s  set    min
                                                   # 0  {1,2}    1
df.groupby('s').agg(s1=('i',set), mi=('i',min))    # 1  {1,3}    1
                                                   # 2  {2}      2
After grouping, the table index takes on the name index.name == s.
gr0 = df[df.t==0].groupby('s').agg(s1=('i',set))   # s   s1     s2
gr1 = df[df.t==1].groupby('s').agg(s2=('i',set))   # 0   {1,2}  {1}
                                                   # 1   {1}    {3}
pd.concat([gr0, gr1], axis=1)                      # 2   {2}    NaN

Using lookup tables

Linking tables is similar to SQL (df_ref is a reference table that gives its values to df based on the key i):
df_ref = pd.DataFrame({'i':[0,  1, 3, 4],      #    | i   v    s
                       'v':[10,20,30,40],      #  ----------------
                       's':['a','b','c','d']}) #  0 |  0  10   a
                                               #  1 |  0  10   a
df = pd.DataFrame({"i":[0,0,1,3,2] })          #  2 |  1  20   b
                                               #  3 |  3  30   c
df.merge(df_ref, on='i', how="left")           #  4 |  2  NaN  NaN
The same applies if the column names for the connection are different
df.merge(df_ref, left_on='i1', right_on='i', how="left")

Working with string data

Using the str attribute, you can access all str-type methods applied to the Series:
s.str.lower()
s.str.lower().str.strip()

Working with time

Time objects and time periods
ts = pd.Timestamp('2022-11-13 00:00:00', tz=None)
ts.to_pydatetime()           # datetime.datetime(2022, 11, 13, 0, 0)

rng = pd.date_range('2022-11-13', periods=3, freq='D')
rng.to_pydatetime()          # list of 3 datetime objects
If the date-time in CSV is saved as Unix timestamp (seconds passed since 1970-1-1) then they can be converted:
df = pd.DataFrame({"date": [1659304800,      # 2022-07-31 22:00:00
                            1659304904]})    # 2022-07-31 22:01:44
df.date = pd.to_datetime(df.date, unit='s')  # D,s,ms,us,ns
t1 = pd.to_datetime(df.ts.min(), unit='s')
t2 = pd.to_datetime(df.ts.max(), unit='s')
print(t1, t2, pd.Timedelta(t2 - t1) )
df.time = pd.to_datetime(df.date, unit='s')   # D,s,ms,us,ns

df['hour']    = df.time.dt.hour               # hour number
df['weekday'] = df.time.dt.weekday            # day of the week number