BigRiverJunbi

Documentation for BigRiverJunbi.

BigRiverJunbi.check_madMethod
check_mad(mat::Matrix{T}; dims::Int = 2) where {T <: Real}

Checks if the MAD (median absolute deviation) is zero for each column of a matrix. If it is, then errors and displays the list of columns with zero MAD.

Arguments

  • mat::Matrix{T}: The matrix to check the MAD for.
source
BigRiverJunbi.check_madMethod
check_mad(x::Vector{T}) where {T <: Real}

Checks if the MAD (median absolute deviation) is zero for a vector. If it is, then errors.

Arguments

  • x::Vector{T}: The vector to check the MAD for.
source
BigRiverJunbi.huberizeMethod
huberize(mat::Matrix{T}; alpha::Float64 = 1.0,
    error_on_zero_mad::Bool = true) where {T <: Real}

Performs Huberization for sample intensities.

Arguments

  • mat: The matrix to normalize.
  • alpha: The alpha parameter for Huberization. Default is 1.0.
  • error_on_zero_mad: Whether to throw an error if the MAD is zero. Default is true.
Warning

If you set error_on_zero_mad to false, this function will return a result with NaN values if the MAD is zero. This can be useful if you are expecting this behavior and want to handle it yourself, but should be used with caution.

Examples

julia> mat = [0.5 1 2 3 3.5;
              7 3 5 1.5 4.5;
              8 2 7 6 9]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  1.5  4.5
 8.0  2.0  7.0  6.0  9.0

julia> BigRiverJunbi.huberize(mat)
3×5 Matrix{Float64}:
 2.86772  1.0  2.0002  3.0      3.5
 7.0      3.0  5.0     1.5      4.5
 8.0      2.0  7.0     5.89787  7.83846
source
BigRiverJunbi.huberizeMethod
huberize(x::Vector{T}; alpha::Float64 = 1.0,
    error_on_zero_mad::Bool = true) where {T <: Real}

Performs Huberization for a single vector.

Arguments

  • x: The vector to Huberize.
  • alpha: The alpha parameter for the Huberization. Default is 1.0.
  • error_on_zero_mad: Whether to throw an error if the MAD is zero. Default is true.
source
BigRiverJunbi.huberlossMethod
huberloss(x::Real; alpha::Float64 = 1.0)

Computes the Huber loss for a given value. This is defined as:

\[L(x) = \begin{cases} \frac{1}{2}x^2 & \text{if } |x| \leq \alpha \\ \alpha (|x| - \frac{\alpha^2}{2}) & \text{if } |x| > \alpha \end{cases}\]

Arguments

  • x: The value to compute the Huber loss for.
  • alpha: The alpha parameter for the Huber loss. Default is 1.0.
source
BigRiverJunbi.imputeKNN!Method
imputeKNN!(
    data::AbstractMatrix{Union{Missing, Float64}},
    k::Int,
    threshold::Float64,
    dims::Union{Nothing, Int},
    distance::M
) where {M <: NearestNeighbors.MinkowskiMetric}

Replaces missing elements based on k-nearest neighbors (KNN) imputation. Modifies the original matrix in place. This method is almost an exact copy of the KNN imputation method from Impute.jl.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
  • k: number of nearest neighbors to use for imputation.
  • threshold: threshold for the number of missing neighbors.
  • dims: dimension along which the statistic is calculated.
  • distance: distance metric to use for the nearest neighbors search, taken from Distances.jl. Default is Euclidean(). This can only be one of the Minkowski metrics i.e. Euclidean, Cityblock, Minkowski and Chebyshev.
source
BigRiverJunbi.imputeKNNMethod
imputeKNN(df::DataFrame; k = 5, threshold = 0.2, start_col = 1)

Replaces missing elements based on k-nearest neighbors (KNN) imputation.

Arguments

  • df: dataframe with missing values.
  • k: number of nearest neighbors to use for imputation.
  • threshold: threshold for the number of missing neighbors.
source
BigRiverJunbi.impute_QRILCMethod
impute_QRILC(
    data::Matrix{Union{Missing, Float64}};
    tune_sigma::Float64 = 1.0,
    eps::Float64 = 0.005
)

Returns imputated matrix based on the "Quantile regression Imputation for left-censored data" (QRILC) method. The function is based on the function impute.QRILC from the imputeLCMD.R package, with one difference: the default value of eps is set to 0.005 instead of 0.001.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
  • tune_sigma: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
  • eps: small value added to the quantile for stability.
source
BigRiverJunbi.impute_cat!Method
impute_cat!(data::Matrix{Union{Missing, Float64}})

Imputes missing elements based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median Modifies the original matrix in place.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
source
BigRiverJunbi.impute_catMethod
impute_cat(df_missing::DataFrame; start_col::Int64 = 1)

Returns imputated dataframe based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median

Arguments

  • df_missing: dataframe with missing values.
  • start_col: column index to start imputing from.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_cat(df)
3×5 DataFrame
 Row │ A         B         C         D         E
     │ Float64?  Float64?  Float64?  Float64?  Float64?
─────┼──────────────────────────────────────────────────
   1 │      1.0       0.0       0.0       1.0       0.0
   2 │      2.0       0.0       1.0       0.0       0.0
   3 │      2.0       0.0       2.0       2.0       2.0
source
BigRiverJunbi.impute_catMethod
impute_cat(data::Matrix{Union{Missing, Float64}})

Imputes missing elements based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median Returns a new matrix without modifying the original matrix.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
source
BigRiverJunbi.impute_half_minMethod
impute_half_min(df::DataFrame; start_col::Int64 = 1)

Replaces missing elements in the specified columns with half of the minimum of non-missing elements in the corresponding variable.

Arguments

  • df: dataframe with missing values.
  • start_col: column index to start imputing from.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_half_min(df)
3×5 DataFrame
 Row │ A         B         C         D         E
     │ Float64?  Float64?  Float64?  Float64?  Float64?
─────┼──────────────────────────────────────────────────
   1 │      1.0       0.5       0.5       6.0       0.5
   2 │      2.0       1.0       4.0       1.0       1.0
   3 │      3.0       1.5       5.0       7.0      10.0
source
BigRiverJunbi.impute_minMethod
impute_min(df::DataFrame; start_col::Int64 = 1)

Replaces missing elements in the specified columns with the minimum of non-missing elements in the corresponding variable.

Arguments

  • df: dataframe with missing values.
  • start_col: column index to start imputing from.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing 
   2 │     2  missing        4  missing  missing 
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_min(df)
3×5 DataFrame
 Row │ A         B         C         D         E
     │ Float64?  Float64?  Float64?  Float64?  Float64? 
─────┼──────────────────────────────────────────────────
   1 │      1.0       1.0       1.0       6.0       1.0
   2 │      2.0       2.0       4.0       2.0       2.0
   3 │      3.0       3.0       5.0       7.0      10.0
source
BigRiverJunbi.impute_min_probFunction
impute_min_prob(data::Matrix{Union{Missing, Float64}}, q = 0.01; tune_sigma = 1)

Replaces missing values with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations. Returns a new matrix without modifying the original matrix.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
  • q: quantile of the minimum values to use for imputation. Default is 0.01.
  • tune_sigma: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
source
BigRiverJunbi.impute_min_prob!Function
impute_min_prob!(data::Matrix{Union{Missing, Float64}}, q = 0.01; tune_sigma = 1)

Replaces missing values with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations. Modifies the original matrix in place.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
  • q: quantile of the minimum values to use for imputation. Default is 0.01.
  • tune_sigma: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
source
BigRiverJunbi.impute_min_probMethod
impute_min_prob(df::DataFrame; start_col::Int64 = 1, q = 0.01; tune_sigma = 1)

Replaces missing values in the specified columns with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations.

Arguments

  • df: dataframe with missing values.
  • start_col: column index to start imputing from.
  • q: quantile of the minimum values to use for imputation. Default is 0.01.
  • tune_sigma: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
source
BigRiverJunbi.impute_zero!Method
impute_zero!(data::Matrix{Union{Missing, Float64}})

Modifies the original matrix in place to replace missing elements with zero.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
source
BigRiverJunbi.impute_zeroMethod
impute_zero(df::DataFrame; start_col::Int64 = 1)

Replaces missing elements in the specified columns with zero.

Arguments

  • df: dataframe with missing values.
  • start_col: column index to start imputing from.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.impute_zero(df)
3×5 DataFrame
 Row │ A         B         C         D         E
     │ Float64?  Float64?  Float64?  Float64?  Float64?
─────┼──────────────────────────────────────────────────
   1 │      1.0       0.0       0.0       6.0       0.0
   2 │      2.0       0.0       4.0       0.0       0.0
   3 │      3.0       0.0       5.0       7.0      10.0
source
BigRiverJunbi.impute_zeroMethod
impute_zero(data::Matrix{Union{Missing, Float64}})

Returns a matrix with missing elements replaced with zero without modifying the original matrix.

Arguments

  • data: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
source
BigRiverJunbi.intnormMethod
intnorm(mat::Matrix{T}; dims::Int64 = 2, lambda::Float64 = 1.0) where T <: Real

Total Area Normalization for each row or column. By default, it normalizes each row. This requires that the matrix has all positive values.

Arguments

  • mat: The matrix to normalize.
  • dims: The dimension to normalize across. Default is 2.
  • lambda: The lambda parameter for the normalization. Default is 1.0.

Examples

julia> mat = [0.5 1 2 3 3.5;
              7 3 5 1.5 4.5;
              8 2 7 6 9]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  1.5  4.5
 8.0  2.0  7.0  6.0  9.0

julia> BigRiverJunbi.intnorm(mat)
3×5 Matrix{Float64}:
 0.05      0.1       0.2       0.3        0.35
 0.333333  0.142857  0.238095  0.0714286  0.214286
 0.25      0.0625    0.21875   0.1875     0.28125
source
BigRiverJunbi.log2_txMethod
log2_tx(mat::Matrix{Float64}; eps::Float64 = 1.0)

Computes logarithm base 2 on a matrix, adding a constant to all values to avoid log(0). This requires that the matrix has all positive values.

Arguments

  • mat: The matrix to transform.
  • eps: The constant to add to all values. Default is 1.0.

Examples

julia> mat = [0.5 1 2 3 3.5;
             7 3 5 0 3.5;
             8 2 5 6 0]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  0.0  3.5
 8.0  2.0  5.0  6.0  0.0

julia> BigRiverJunbi.log2_tx(mat)
3×5 Matrix{Float64}:
 0.584963  1.0      1.58496  2.0      2.16993
 3.0       2.0      2.58496  0.0      2.16993
 3.16993   1.58496  2.58496  2.80735  0.0
source
BigRiverJunbi.meancenter_txMethod
meancenter_tx(mat::Matrix{Float64}, dims::Int64 = 1)

Mean center a matrix across the specified dimension. This requires that the matrix has all positive values.

Arguments

  • mat: The matrix to transform.
  • dims: The dimension to mean center across. Default is 1.

Examples

julia> mat = [0.5 1 2 3 3.5;
             7 3 5 0 3.5;
             8 2 5 6 0]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  0.0  3.5
 8.0  2.0  5.0  6.0  0.0

julia> BigRiverJunbi.meancenter_tx(mat)
3×5 Matrix{Float64}:
 -4.66667  -1.0  -2.0   0.0   1.16667
  1.83333   1.0   1.0  -3.0   1.16667
  2.83333   0.0   1.0   3.0  -2.33333
source
BigRiverJunbi.missing_percentagesMethod
missing_percentages(df::DataFrame)

Returns the percentage of missing values in each column and row, as well as the total percentage of missing values in the dataframe.

Arguments

  • df::DataFrame: The dataframe to calculate the missing percentages for.

Returns

  • pmissing_cols::Vector{Float64}: The percentage of missing values in each column.
  • pmissing_rows::Vector{Float64}: The percentage of missing values in each row.
  • total_missing::Float64: The total percentage of missing values in the dataframe.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.missing_percentages(df)
([0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666], [0.6, 0.6, 0.2], 0.4666666666666667)
source
BigRiverJunbi.missing_summaryMethod
missing_summary(df::DataFrame)

Adds a row and column to the dataframe that contains the percentage of missing values in each column and row. Returns a pretty table with the percentage of missing values in the last row and column highlighted.

Warning

This function will not preserve the type of the dataframe, as it converts everything to a string for the pretty table. It is primarily used for quick visualizations. For getting the actual missing percentages, use the missing_percentages function instead.

Arguments

  • df::DataFrame: The dataframe to add the missing summary to.

Examples

julia> df = DataFrame(A = [1, 2, 3],
                 B = [missing, missing, missing],
                 C = [missing, 4, 5],
                 D = [6, missing, 7],
                 E = [missing, missing, 10])
3×5 DataFrame
 Row │ A      B        C        D        E
     │ Int64  Missing  Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │     1  missing  missing        6  missing
   2 │     2  missing        4  missing  missing
   3 │     3  missing        5        7       10

julia> BigRiverJunbi.missing_summary(df)
┌───────────────┬────────┬─────────┬─────────┬─────────┬─────────┬───────────────┐
│               │      A │       B │       C │       D │       E │ pmissing_rows │
│               │ String │  String │  String │  String │  String │        String │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│             1 │      1 │ missing │ missing │       6 │ missing │           0.6 │
│             2 │      2 │ missing │       4 │ missing │ missing │           0.6 │
│             3 │      3 │ missing │       5 │       7 │      10 │           0.2 │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ pmissing_cols │    0.0 │     1.0 │    0.33 │    0.33 │    0.67 │          0.47 │
└───────────────┴────────┴─────────┴─────────┴─────────┴─────────┴───────────────┘
source
BigRiverJunbi.pqnormMethod
pqnorm(mat::Matrix{Float64})

Performs a probabilistic quotient normalization (PQN) for sample intensities. This assumes that the matrix is organized as samples x features and requires that the matrix have all positive values.

Arguments

  • mat: The matrix to normalize.

Examples

julia> mat = [0.5 1 2 3 3.5;
              7 3 5 1.5 4.5;
              8 2 7 6 9]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  1.5  4.5
 8.0  2.0  7.0  6.0  9.0

julia> BigRiverJunbi.pqnorm(mat)
3×5 Matrix{Float64}:
 0.05     0.1      0.2      0.3       0.35
 0.30625  0.13125  0.21875  0.065625  0.196875
 0.25     0.0625   0.21875  0.1875    0.28125
source
BigRiverJunbi.quantilenormMethod
quantilenorm(data::Matrix{T}) where T <: Real

Performs quantile normalization for sample intensities. This assumes that the matrix is organized as samples x features.

Arguments

  • data: The matrix to normalize.

Examples

julia> mat = [0.5 1 2 3 3.5;
              7 3 5 1.5 4.5;
              8 2 7 6 9]
3×5 Matrix{Float64}:
 0.5  1.0  2.0  3.0  3.5
 7.0  3.0  5.0  1.5  4.5
 8.0  2.0  7.0  6.0  9.0

julia> BigRiverJunbi.quantilenorm(mat)
3×5 Matrix{Float64}:
 1.7  1.7  1.7  4.3  1.7
 4.3  6.6  4.3  1.7  4.3
 6.6  4.3  6.6  6.6  6.6
source
BigRiverJunbi.substitute!Method
substitute!(
    data::AbstractArray{Union{Missing, Float64}},
    statistic::Function;
    dims::Union{Nothing, Int} = nothing
)

Substitutes missing values with the value calculated by the statistic function along the specified dimension and modifies the original array in place.

Arguments

  • data: array of values. One example: matrix of metabolomics data, where the rows are the features and the columns are the samples.
  • statistic: function that calculates the value to substitute the missing values.
  • dims: dimension along which the statistic is calculated.
source
BigRiverJunbi.substituteMethod
substitute(
    data::AbstractArray{Union{Missing, Float64}},
    statistic::Function;
    dims::Union{Nothing, Int} = nothing
)

Substitutes missing values with the value calculated by the statistic function along the specified dimension and returns a new array without modifying the original array.

Arguments

  • data: array of values. One example: matrix of metabolomics data, where the rows are the features and the columns are the samples.
  • statistic: function that calculates the value to substitute the missing values.
  • dims: dimension along which the statistic is calculated.
source