Functions for dataframes

These functions are designed to be used in conjunction with the DataFrames.jl package. Since this is a heavy dependency, DataFrames.jl is not shipped directly with BigRiverJunbi.jl. Instead, this is part of a package extension that can be loaded by installing DataFrames.jl separately and simply typing using DataFrames in the REPL or your code before using these functions.

Imputation

BigRiverJunbi.imputeKNN Method

julia

imputeKNN(df::DataFrame; k = 5, threshold = 0.2, start_col = 1, end_col = size(df, 2))

Replaces missing elements based on k-nearest neighbors (KNN) imputation.

Arguments

df: dataframe with missing values.
k: number of nearest neighbors to use for imputation.
threshold: threshold for the number of missing neighbors above which the imputation is skipped.
start_col: column index to start imputing from.
end_col: column index to end imputing at.

source

BigRiverJunbi.impute_half_min Method

julia

impute_half_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with half of the minimum of non-missing elements in the corresponding variable.

Arguments

df: dataframe with missing values.
start_col: column index to start imputing from.
end_col: column index to end imputing at.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_half_min(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   1 │      1       0       0       6       0
   2 │      2       1       4       1       1
   3 │      3       1       5       7      10

source

BigRiverJunbi.impute_median_cat Method

julia

impute_median_cat(df_missing::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df_missing, 2))

Returns imputed dataframe based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median

Arguments

df_missing: dataframe with missing values.
start_col: column index to start imputing from.
end_col: column index to end imputing at.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_median_cat(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   1 │      1       0       0       1       0
   2 │      2       0       1       0       0
   3 │      2       0       2       2       2

source

BigRiverJunbi.impute_min Method

julia

impute_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with the minimum of non-missing elements in the corresponding variable.

Arguments

df: dataframe with missing values.
start_col: column index to start imputing from.
end_col: column index to end imputing at.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing 
   2 │     2  missing        4  missing  missing 
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_min(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   1 │      1       1       1       6       1
   2 │      2       2       4       2       2
   3 │      3       3       5       7      10

source

BigRiverJunbi.impute_zero Method

julia

impute_zero(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with zero.

Arguments

df: dataframe with missing values.
start_col: column index to start imputing from.
end_col: column index to end imputing at.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_zero(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   1 │      1       0       0       6       0
   2 │      2       0       4       0       0
   3 │      3       0       5       7      10

source

Normalization

BigRiverJunbi.huberize Method

julia

huberize(df::DataFrame; alpha::Real = 1,
         error_on_zero_mad::Bool = true,
         start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs Huberization for sample intensities.

Arguments

df: The dataframe to normalize.
alpha: The alpha parameter for Huberization. Default is 1.
error_on_zero_mad: Whether to throw an error if the MAD is zero. Default is true.
start_col: The column to start normalizing from. Default is 1.
end_col: The column to end normalizing at. Default is the last column.

Warning

If you set error_on_zero_mad to false, this function will return a result with NaN values if the MAD is zero. This can be useful if you are expecting this behavior and want to handle it yourself, but should be used with caution.

source

BigRiverJunbi.intnorm Method

julia

intnorm(df::DataFrame; lambda::Float64 = 1.0,
        start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Total Area Normalization for each row or column. By default, it normalizes each row. This requires that the matrix has all positive values.

Arguments

df: The dataframe to normalize.
lambda: The lambda parameter for the normalization. Default is 1.
start_col: The column to start normalizing from. Default is 1.
end_col: The column to end normalizing at. Default is the last column.

source

BigRiverJunbi.pqnorm Method

julia

pqnorm(df::DataFrame; lambda::Real = 1,
       start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs a probabilistic quotient normalization (PQN) for sample intensities. This assumes that the matrix is organized as samples x features and requires that the matrix have all positive values.

Arguments

df: The dataframe to normalize.
lambda: The lambda parameter for the normalization. Default is 1.
start_col: The column to start normalizing from. Default is 1.
end_col: The column to end normalizing at. Default is the last column.

source

BigRiverJunbi.quantilenorm Method

julia

quantilenorm(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs quantile normalization for sample intensities. This assumes that the matrix is organized as samples x features.

Arguments

df: The dataframe to normalize.
start_col: The column to start normalizing from. Default is 1.
end_col: The column to end normalizing at. Default is the last column.

source

BigRiverJunbi.standardize Method

julia

standardize(df::DataFrame; center::Bool = true,
            start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Standardize a dataframe i.e. scale to unit variance, with the option of centering or not.

Arguments

df: The dataframe to standardize.
center: Whether to center the data. Default is true.
start_col: The column to start standardizing from. Default is 1.
end_col: The column to end standardizing at. Default is the last column.

source

Transformation

BigRiverJunbi.log_tx Method

julia

log_tx(df::DataFrame; base::Real = 2, constant::Real = 0,
       start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Computes logarithm on a dataframe, adding a constant to all values (for instance, to avoid log(0)). Default base is 2, default constant is 0.

Arguments

df: The dataframe to transform.
base: The base of the logarithm. Default is 2.
constant: The constant to add to all values. Default is 0.
start_col: The column to start transforming from. Default is 1.
end_col: The column to end transforming at. Default is the last column.

source

BigRiverJunbi.meancenter_tx Method

julia

meancenter_tx(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Mean centers a dataframe across the specified dimension. This requires that the matrix has all positive values.

Arguments

df: The dataframe to transform.
dims: The dimension to mean center across. Default is 1.
start_col: The column to start transforming from. Default is 1.
end_col: The column to end transforming at. Default is the last column.

source

Utility functions

DataFramesExt.missing_percentages Method

julia

missing_percentages(df::DataFrame)

Returns the percentage of missing values in each column and row, as well as the total percentage of missing values in the dataframe.

Arguments

df: The dataframe to calculate the missing percentages for.

Returns

pmissing_cols: A Vector of the percentage of missing values in each column.
pmissing_rows: A Vector of the percentage of missing values in each row.
total_missing: The total percentage of missing values in the dataframe.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.missing_percentages(df)
([0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666], [0.6, 0.6, 0.2], 0.4666666666666667)

source

DataFramesExt.missing_summary Method

julia

missing_summary(df::DataFrame)

Adds a row and column to the dataframe that contains the percentage of missing values in each column and row. Returns a pretty table with the percentage of missing values in the last row and column highlighted.

Warning

This function will not preserve the type of the dataframe, as it converts everything to a string for the pretty table. It is primarily used for quick visualizations. For getting the actual missing percentages, use the missing_percentages function instead.

Arguments

df: The dataframe to add the missing summary to.

Example

julia

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.missing_summary(df)
┌───────────────┬────────┬─────────┬─────────┬─────────┬─────────┬───────────────┐
│               │      A │       B │       C │       D │       E │ pmissing_rows │
│               │ String │  String │  String │  String │  String │        String │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│             1 │      1 │ missing │ missing │       6 │ missing │           0.6 │
│             2 │      2 │ missing │       4 │ missing │ missing │           0.6 │
│             3 │      3 │ missing │       5 │       7 │      10 │           0.2 │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ pmissing_cols │    0.0 │     1.0 │    0.33 │    0.33 │    0.67 │          0.47 │
└───────────────┴────────┴─────────┴─────────┴─────────┴─────────┴───────────────┘

source

Functions for dataframes ​

Imputation ​

Normalization ​

Transformation ​

Utility functions ​

Functions for dataframes

Imputation

Normalization

Transformation

Utility functions