Skip to content

Functions for dataframes

These functions are designed to be used in conjunction with the DataFrames.jl package. This is part of an extension that can be loaded by simply typing using DataFrames in the REPL or your code.

BigRiverJunbi.huberize Method
julia
huberize(df::DataFrame; alpha::Real = 1,
         error_on_zero_mad::Bool = true,
         start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs Huberization for sample intensities.

Arguments

  • df: The dataframe to normalize.

  • alpha: The alpha parameter for Huberization. Default is 1.

  • error_on_zero_mad: Whether to throw an error if the MAD is zero. Default is true.

  • start_col: The column to start normalizing from. Default is 1.

  • end_col: The column to end normalizing at. Default is the last column.

Warning

If you set error_on_zero_mad to false, this function will return a result with NaN values if the MAD is zero. This can be useful if you are expecting this behavior and want to handle it yourself, but should be used with caution.

source
BigRiverJunbi.imputeKNN Method
julia
imputeKNN(df::DataFrame; k = 5, threshold = 0.2, start_col = 1, end_col = size(df, 2))

Replaces missing elements based on k-nearest neighbors (KNN) imputation.

Arguments

  • df: dataframe with missing values.

  • k: number of nearest neighbors to use for imputation.

  • threshold: threshold for the number of missing neighbors above which the imputation is skipped.

  • start_col: column index to start imputing from.

  • end_col: column index to end imputing at.

source
BigRiverJunbi.impute_half_min Method
julia
impute_half_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with half of the minimum of non-missing elements in the corresponding variable.

Arguments

  • df: dataframe with missing values.

  • start_col: column index to start imputing from.

  • end_col: column index to end imputing at.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing
   22  missing        4  missing  missing
   33  missing        5        7       10

julia> BigRiverJunbi.impute_half_min(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   11       0       0       6       0
   22       1       4       1       1
   33       1       5       7      10
source
BigRiverJunbi.impute_median_cat Method
julia
impute_median_cat(df_missing::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df_missing, 2))

Returns imputed dataframe based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median

Arguments

  • df_missing: dataframe with missing values.

  • start_col: column index to start imputing from.

  • end_col: column index to end imputing at.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing
   22  missing        4  missing  missing
   33  missing        5        7       10

julia> BigRiverJunbi.impute_median_cat(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   11       0       0       1       0
   22       0       1       0       0
   32       0       2       2       2
source
BigRiverJunbi.impute_min Method
julia
impute_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with the minimum of non-missing elements in the corresponding variable.

Arguments

  • df: dataframe with missing values.

  • start_col: column index to start imputing from.

  • end_col: column index to end imputing at.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing 
   22  missing        4  missing  missing 
   33  missing        5        7       10

julia> BigRiverJunbi.impute_min(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   11       1       1       6       1
   22       2       4       2       2
   33       3       5       7      10
source
BigRiverJunbi.impute_zero Method
julia
impute_zero(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Replaces missing elements in the specified columns with zero.

Arguments

  • df: dataframe with missing values.

  • start_col: column index to start imputing from.

  • end_col: column index to end imputing at.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing
   22  missing        4  missing  missing
   33  missing        5        7       10

julia> BigRiverJunbi.impute_zero(df)
3×5 DataFrame
 Row │ A       B       C       D       E      
     │ Int64?  Int64?  Int64?  Int64?  Int64? 
─────┼────────────────────────────────────────
   11       0       0       6       0
   22       0       4       0       0
   33       0       5       7      10
source
BigRiverJunbi.intnorm Method
julia
intnorm(df::DataFrame; lambda::Float64 = 1.0,
        start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Total Area Normalization for each row or column. By default, it normalizes each row. This requires that the matrix has all positive values.

Arguments

  • df: The dataframe to normalize.

  • lambda: The lambda parameter for the normalization. Default is 1.

  • start_col: The column to start normalizing from. Default is 1.

  • end_col: The column to end normalizing at. Default is the last column.

source
BigRiverJunbi.log_tx Method
julia
log_tx(df::DataFrame; base::Real = 2, constant::Real = 0,
       start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Computes logarithm on a dataframe, adding a constant to all values (for instance, to avoid log(0)). Default base is 2, default constant is 0.

Arguments

  • df: The dataframe to transform.

  • base: The base of the logarithm. Default is 2.

  • constant: The constant to add to all values. Default is 0.

  • start_col: The column to start transforming from. Default is 1.

  • end_col: The column to end transforming at. Default is the last column.

source
BigRiverJunbi.meancenter_tx Method
julia
meancenter_tx(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Mean centers a dataframe across the specified dimension. This requires that the matrix has all positive values.

Arguments

  • df: The dataframe to transform.

  • dims: The dimension to mean center across. Default is 1.

  • start_col: The column to start transforming from. Default is 1.

  • end_col: The column to end transforming at. Default is the last column.

source
BigRiverJunbi.pqnorm Method
julia
pqnorm(df::DataFrame; lambda::Real = 1,
       start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs a probabilistic quotient normalization (PQN) for sample intensities. This assumes that the matrix is organized as samples x features and requires that the matrix have all positive values.

Arguments

  • df: The dataframe to normalize.

  • lambda: The lambda parameter for the normalization. Default is 1.

  • start_col: The column to start normalizing from. Default is 1.

  • end_col: The column to end normalizing at. Default is the last column.

source
BigRiverJunbi.quantilenorm Method
julia
quantilenorm(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Performs quantile normalization for sample intensities. This assumes that the matrix is organized as samples x features.

Arguments

  • df: The dataframe to normalize.

  • start_col: The column to start normalizing from. Default is 1.

  • end_col: The column to end normalizing at. Default is the last column.

source
BigRiverJunbi.standardize Method
julia
standardize(df::DataFrame; center::Bool = true,
            start_col::Int64 = 1, end_col::Int64 = size(df, 2))

Standardize a dataframe i.e. scale to unit variance, with the option of centering or not.

Arguments

  • df: The dataframe to standardize.

  • center: Whether to center the data. Default is true.

  • start_col: The column to start standardizing from. Default is 1.

  • end_col: The column to end standardizing at. Default is the last column.

source
DataFramesExt.missing_percentages Method
julia
missing_percentages(df::DataFrame)

Returns the percentage of missing values in each column and row, as well as the total percentage of missing values in the dataframe.

Arguments

  • df: The dataframe to calculate the missing percentages for.

Returns

  • pmissing_cols: A Vector of the percentage of missing values in each column.

  • pmissing_rows: A Vector of the percentage of missing values in each row.

  • total_missing: The total percentage of missing values in the dataframe.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing
   22  missing        4  missing  missing
   33  missing        5        7       10

julia> BigRiverJunbi.missing_percentages(df)
([0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666], [0.6, 0.6, 0.2], 0.4666666666666667)
source
DataFramesExt.missing_summary Method
julia
missing_summary(df::DataFrame)

Adds a row and column to the dataframe that contains the percentage of missing values in each column and row. Returns a pretty table with the percentage of missing values in the last row and column highlighted.

Warning

This function will not preserve the type of the dataframe, as it converts everything to a string for the pretty table. It is primarily used for quick visualizations. For getting the actual missing percentages, use the missing_percentages function instead.

Arguments

  • df: The dataframe to add the missing summary to.

Examples

julia
julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   11  missing  missing        6  missing
   22  missing        4  missing  missing
   33  missing        5        7       10

julia> BigRiverJunbi.missing_summary(df)
┌───────────────┬────────┬─────────┬─────────┬─────────┬─────────┬───────────────┐
│               │      A │       B │       C │       D │       E │ pmissing_rows │
│               │ String │  String │  String │  String │  String │        String │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
11missingmissing6missing0.6
22missing4missingmissing0.6
33missing57100.2
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ pmissing_cols │    0.01.00.330.330.670.47
└───────────────┴────────┴─────────┴─────────┴─────────┴─────────┴───────────────┘
source