BigRiverJunbi
Documentation for BigRiverJunbi.
BigRiverJunbi.check_mad
BigRiverJunbi.check_mad
BigRiverJunbi.huberize
BigRiverJunbi.huberize
BigRiverJunbi.huberloss
BigRiverJunbi.imputeKNN
BigRiverJunbi.imputeKNN!
BigRiverJunbi.impute_QRILC
BigRiverJunbi.impute_cat
BigRiverJunbi.impute_cat
BigRiverJunbi.impute_cat!
BigRiverJunbi.impute_half_min
BigRiverJunbi.impute_min
BigRiverJunbi.impute_min_prob
BigRiverJunbi.impute_min_prob
BigRiverJunbi.impute_min_prob!
BigRiverJunbi.impute_zero
BigRiverJunbi.impute_zero
BigRiverJunbi.impute_zero!
BigRiverJunbi.intnorm
BigRiverJunbi.log2_tx
BigRiverJunbi.meancenter_tx
BigRiverJunbi.missing_percentages
BigRiverJunbi.missing_summary
BigRiverJunbi.pqnorm
BigRiverJunbi.quantilenorm
BigRiverJunbi.substitute
BigRiverJunbi.substitute!
BigRiverJunbi.check_mad
— Methodcheck_mad(mat::Matrix{T}; dims::Int = 2) where {T <: Real}
Checks if the MAD (median absolute deviation) is zero for each column of a matrix. If it is, then errors and displays the list of columns with zero MAD.
Arguments
mat::Matrix{T}
: The matrix to check the MAD for.
BigRiverJunbi.check_mad
— Methodcheck_mad(x::Vector{T}) where {T <: Real}
Checks if the MAD (median absolute deviation) is zero for a vector. If it is, then errors.
Arguments
x::Vector{T}
: The vector to check the MAD for.
BigRiverJunbi.huberize
— Methodhuberize(mat::Matrix{T}; alpha::Float64 = 1.0,
error_on_zero_mad::Bool = true) where {T <: Real}
Performs Huberization for sample intensities.
Arguments
mat
: The matrix to normalize.alpha
: The alpha parameter for Huberization. Default is 1.0.error_on_zero_mad
: Whether to throw an error if the MAD is zero. Default istrue
.
If you set error_on_zero_mad
to false
, this function will return a result with NaN values if the MAD is zero. This can be useful if you are expecting this behavior and want to handle it yourself, but should be used with caution.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 1.5 4.5;
8 2 7 6 9]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 1.5 4.5
8.0 2.0 7.0 6.0 9.0
julia> BigRiverJunbi.huberize(mat)
3×5 Matrix{Float64}:
2.86772 1.0 2.0002 3.0 3.5
7.0 3.0 5.0 1.5 4.5
8.0 2.0 7.0 5.89787 7.83846
BigRiverJunbi.huberize
— Methodhuberize(x::Vector{T}; alpha::Float64 = 1.0,
error_on_zero_mad::Bool = true) where {T <: Real}
Performs Huberization for a single vector.
Arguments
x
: The vector to Huberize.alpha
: The alpha parameter for the Huberization. Default is 1.0.error_on_zero_mad
: Whether to throw an error if the MAD is zero. Default istrue
.
BigRiverJunbi.huberloss
— Methodhuberloss(x::Real; alpha::Float64 = 1.0)
Computes the Huber loss for a given value. This is defined as:
\[L(x) = \begin{cases} \frac{1}{2}x^2 & \text{if } |x| \leq \alpha \\ \alpha (|x| - \frac{\alpha^2}{2}) & \text{if } |x| > \alpha \end{cases}\]
Arguments
x
: The value to compute the Huber loss for.alpha
: The alpha parameter for the Huber loss. Default is 1.0.
BigRiverJunbi.imputeKNN!
— MethodimputeKNN!(
data::AbstractMatrix{Union{Missing, Float64}},
k::Int,
threshold::Float64,
dims::Union{Nothing, Int},
distance::M
) where {M <: NearestNeighbors.MinkowskiMetric}
Replaces missing elements based on k-nearest neighbors (KNN) imputation. Modifies the original matrix in place. This method is almost an exact copy of the KNN imputation method from Impute.jl.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.k
: number of nearest neighbors to use for imputation.threshold
: threshold for the number of missing neighbors.dims
: dimension along which the statistic is calculated.distance
: distance metric to use for the nearest neighbors search, taken from Distances.jl. Default isEuclidean()
. This can only be one of the Minkowski metrics i.e. Euclidean, Cityblock, Minkowski and Chebyshev.
BigRiverJunbi.imputeKNN
— MethodimputeKNN(df::DataFrame; k = 5, threshold = 0.2, start_col = 1)
Replaces missing elements based on k-nearest neighbors (KNN) imputation.
Arguments
df
: dataframe with missing values.k
: number of nearest neighbors to use for imputation.threshold
: threshold for the number of missing neighbors.
BigRiverJunbi.impute_QRILC
— Methodimpute_QRILC(
data::Matrix{Union{Missing, Float64}};
tune_sigma::Float64 = 1.0,
eps::Float64 = 0.005
)
Returns imputated matrix based on the "Quantile regression Imputation for left-censored data" (QRILC) method. The function is based on the function impute.QRILC
from the imputeLCMD.R
package, with one difference: the default value of eps
is set to 0.005 instead of 0.001.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.tune_sigma
: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.eps
: small value added to the quantile for stability.
BigRiverJunbi.impute_cat!
— Methodimpute_cat!(data::Matrix{Union{Missing, Float64}})
Imputes missing elements based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median Modifies the original matrix in place.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
BigRiverJunbi.impute_cat
— Methodimpute_cat(df_missing::DataFrame; start_col::Int64 = 1)
Returns imputated dataframe based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median
Arguments
df_missing
: dataframe with missing values.start_col
: column index to start imputing from.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_cat(df)
3×5 DataFrame
Row │ A B C D E
│ Float64? Float64? Float64? Float64? Float64?
─────┼──────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 1.0 0.0
2 │ 2.0 0.0 1.0 0.0 0.0
3 │ 2.0 0.0 2.0 2.0 2.0
BigRiverJunbi.impute_cat
— Methodimpute_cat(data::Matrix{Union{Missing, Float64}})
Imputes missing elements based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median Returns a new matrix without modifying the original matrix.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
BigRiverJunbi.impute_half_min
— Methodimpute_half_min(df::DataFrame; start_col::Int64 = 1)
Replaces missing elements in the specified columns with half of the minimum of non-missing elements in the corresponding variable.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_half_min(df)
3×5 DataFrame
Row │ A B C D E
│ Float64? Float64? Float64? Float64? Float64?
─────┼──────────────────────────────────────────────────
1 │ 1.0 0.5 0.5 6.0 0.5
2 │ 2.0 1.0 4.0 1.0 1.0
3 │ 3.0 1.5 5.0 7.0 10.0
BigRiverJunbi.impute_min
— Methodimpute_min(df::DataFrame; start_col::Int64 = 1)
Replaces missing elements in the specified columns with the minimum of non-missing elements in the corresponding variable.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_min(df)
3×5 DataFrame
Row │ A B C D E
│ Float64? Float64? Float64? Float64? Float64?
─────┼──────────────────────────────────────────────────
1 │ 1.0 1.0 1.0 6.0 1.0
2 │ 2.0 2.0 4.0 2.0 2.0
3 │ 3.0 3.0 5.0 7.0 10.0
BigRiverJunbi.impute_min_prob
— Functionimpute_min_prob(data::Matrix{Union{Missing, Float64}}, q = 0.01; tune_sigma = 1)
Replaces missing values with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations. Returns a new matrix without modifying the original matrix.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.q
: quantile of the minimum values to use for imputation. Default is 0.01.tune_sigma
: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
BigRiverJunbi.impute_min_prob!
— Functionimpute_min_prob!(data::Matrix{Union{Missing, Float64}}, q = 0.01; tune_sigma = 1)
Replaces missing values with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations. Modifies the original matrix in place.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.q
: quantile of the minimum values to use for imputation. Default is 0.01.tune_sigma
: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
BigRiverJunbi.impute_min_prob
— Methodimpute_min_prob(df::DataFrame; start_col::Int64 = 1, q = 0.01; tune_sigma = 1)
Replaces missing values in the specified columns with random draws from a gaussian distribution centered in the minimum value observed and with standard deviation equal to the median value of the population of line-wise standard deviations.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.q
: quantile of the minimum values to use for imputation. Default is 0.01.tune_sigma
: coefficient that controls the sd of the MNAR distribution: - 1 if the complete data distribution is supposed to be gaussian. - 0 < tune_sigma < 1 if the complete data distribution is supposed to be left-censored. Default is 1.0.
BigRiverJunbi.impute_zero!
— Methodimpute_zero!(data::Matrix{Union{Missing, Float64}})
Modifies the original matrix in place to replace missing elements with zero.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
BigRiverJunbi.impute_zero
— Methodimpute_zero(df::DataFrame; start_col::Int64 = 1)
Replaces missing elements in the specified columns with zero.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_zero(df)
3×5 DataFrame
Row │ A B C D E
│ Float64? Float64? Float64? Float64? Float64?
─────┼──────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 6.0 0.0
2 │ 2.0 0.0 4.0 0.0 0.0
3 │ 3.0 0.0 5.0 7.0 10.0
BigRiverJunbi.impute_zero
— Methodimpute_zero(data::Matrix{Union{Missing, Float64}})
Returns a matrix with missing elements replaced with zero without modifying the original matrix.
Arguments
data
: matrix of omics value, e.g., metabolomics matrix, where the rows are the samples and the columns are the features.
BigRiverJunbi.intnorm
— Methodintnorm(mat::Matrix{T}; dims::Int64 = 2, lambda::Float64 = 1.0) where T <: Real
Total Area Normalization for each row or column. By default, it normalizes each row. This requires that the matrix has all positive values.
Arguments
mat
: The matrix to normalize.dims
: The dimension to normalize across. Default is 2.lambda
: The lambda parameter for the normalization. Default is 1.0.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 1.5 4.5;
8 2 7 6 9]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 1.5 4.5
8.0 2.0 7.0 6.0 9.0
julia> BigRiverJunbi.intnorm(mat)
3×5 Matrix{Float64}:
0.05 0.1 0.2 0.3 0.35
0.333333 0.142857 0.238095 0.0714286 0.214286
0.25 0.0625 0.21875 0.1875 0.28125
BigRiverJunbi.log2_tx
— Methodlog2_tx(mat::Matrix{Float64}; eps::Float64 = 1.0)
Computes logarithm base 2 on a matrix, adding a constant to all values to avoid log(0). This requires that the matrix has all positive values.
Arguments
mat
: The matrix to transform.eps
: The constant to add to all values. Default is 1.0.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 0 3.5;
8 2 5 6 0]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 0.0 3.5
8.0 2.0 5.0 6.0 0.0
julia> BigRiverJunbi.log2_tx(mat)
3×5 Matrix{Float64}:
0.584963 1.0 1.58496 2.0 2.16993
3.0 2.0 2.58496 0.0 2.16993
3.16993 1.58496 2.58496 2.80735 0.0
BigRiverJunbi.meancenter_tx
— Methodmeancenter_tx(mat::Matrix{Float64}, dims::Int64 = 1)
Mean center a matrix across the specified dimension. This requires that the matrix has all positive values.
Arguments
mat
: The matrix to transform.dims
: The dimension to mean center across. Default is 1.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 0 3.5;
8 2 5 6 0]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 0.0 3.5
8.0 2.0 5.0 6.0 0.0
julia> BigRiverJunbi.meancenter_tx(mat)
3×5 Matrix{Float64}:
-4.66667 -1.0 -2.0 0.0 1.16667
1.83333 1.0 1.0 -3.0 1.16667
2.83333 0.0 1.0 3.0 -2.33333
BigRiverJunbi.missing_percentages
— Methodmissing_percentages(df::DataFrame)
Returns the percentage of missing values in each column and row, as well as the total percentage of missing values in the dataframe.
Arguments
df::DataFrame
: The dataframe to calculate the missing percentages for.
Returns
pmissing_cols::Vector{Float64}
: The percentage of missing values in each column.pmissing_rows::Vector{Float64}
: The percentage of missing values in each row.total_missing::Float64
: The total percentage of missing values in the dataframe.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.missing_percentages(df)
([0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666], [0.6, 0.6, 0.2], 0.4666666666666667)
BigRiverJunbi.missing_summary
— Methodmissing_summary(df::DataFrame)
Adds a row and column to the dataframe that contains the percentage of missing values in each column and row. Returns a pretty table with the percentage of missing values in the last row and column highlighted.
This function will not preserve the type of the dataframe, as it converts everything to a string for the pretty table. It is primarily used for quick visualizations. For getting the actual missing percentages, use the missing_percentages
function instead.
Arguments
df::DataFrame
: The dataframe to add the missing summary to.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.missing_summary(df)
┌───────────────┬────────┬─────────┬─────────┬─────────┬─────────┬───────────────┐
│ │ A │ B │ C │ D │ E │ pmissing_rows │
│ │ String │ String │ String │ String │ String │ String │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ 1 │ 1 │ missing │ missing │ 6 │ missing │ 0.6 │
│ 2 │ 2 │ missing │ 4 │ missing │ missing │ 0.6 │
│ 3 │ 3 │ missing │ 5 │ 7 │ 10 │ 0.2 │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ pmissing_cols │ 0.0 │ 1.0 │ 0.33 │ 0.33 │ 0.67 │ 0.47 │
└───────────────┴────────┴─────────┴─────────┴─────────┴─────────┴───────────────┘
BigRiverJunbi.pqnorm
— Methodpqnorm(mat::Matrix{Float64})
Performs a probabilistic quotient normalization (PQN) for sample intensities. This assumes that the matrix is organized as samples x features and requires that the matrix have all positive values.
Arguments
mat
: The matrix to normalize.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 1.5 4.5;
8 2 7 6 9]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 1.5 4.5
8.0 2.0 7.0 6.0 9.0
julia> BigRiverJunbi.pqnorm(mat)
3×5 Matrix{Float64}:
0.05 0.1 0.2 0.3 0.35
0.30625 0.13125 0.21875 0.065625 0.196875
0.25 0.0625 0.21875 0.1875 0.28125
BigRiverJunbi.quantilenorm
— Methodquantilenorm(data::Matrix{T}) where T <: Real
Performs quantile normalization for sample intensities. This assumes that the matrix is organized as samples x features.
Arguments
data
: The matrix to normalize.
Examples
julia> mat = [0.5 1 2 3 3.5;
7 3 5 1.5 4.5;
8 2 7 6 9]
3×5 Matrix{Float64}:
0.5 1.0 2.0 3.0 3.5
7.0 3.0 5.0 1.5 4.5
8.0 2.0 7.0 6.0 9.0
julia> BigRiverJunbi.quantilenorm(mat)
3×5 Matrix{Float64}:
1.7 1.7 1.7 4.3 1.7
4.3 6.6 4.3 1.7 4.3
6.6 4.3 6.6 6.6 6.6
BigRiverJunbi.substitute!
— Methodsubstitute!(
data::AbstractArray{Union{Missing, Float64}},
statistic::Function;
dims::Union{Nothing, Int} = nothing
)
Substitutes missing values with the value calculated by the statistic function along the specified dimension and modifies the original array in place.
Arguments
data
: array of values. One example: matrix of metabolomics data, where the rows are the features and the columns are the samples.statistic
: function that calculates the value to substitute the missing values.dims
: dimension along which the statistic is calculated.
BigRiverJunbi.substitute
— Methodsubstitute(
data::AbstractArray{Union{Missing, Float64}},
statistic::Function;
dims::Union{Nothing, Int} = nothing
)
Substitutes missing values with the value calculated by the statistic function along the specified dimension and returns a new array without modifying the original array.
Arguments
data
: array of values. One example: matrix of metabolomics data, where the rows are the features and the columns are the samples.statistic
: function that calculates the value to substitute the missing values.dims
: dimension along which the statistic is calculated.