Functions for dataframes
These functions are designed to be used in conjunction with the DataFrames.jl package. This is part of an extension that can be loaded by simply typing using DataFrames
in the REPL or your code.
BigRiverJunbi.huberize
BigRiverJunbi.imputeKNN
BigRiverJunbi.impute_half_min
BigRiverJunbi.impute_median_cat
BigRiverJunbi.impute_min
BigRiverJunbi.impute_zero
BigRiverJunbi.intnorm
BigRiverJunbi.log_tx
BigRiverJunbi.meancenter_tx
BigRiverJunbi.pqnorm
BigRiverJunbi.quantilenorm
BigRiverJunbi.standardize
DataFramesExt.missing_percentages
DataFramesExt.missing_summary
BigRiverJunbi.huberize Method
huberize(df::DataFrame; alpha::Real = 1,
error_on_zero_mad::Bool = true,
start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Performs Huberization for sample intensities.
Arguments
df
: The dataframe to normalize.alpha
: The alpha parameter for Huberization. Default is 1.error_on_zero_mad
: Whether to throw an error if the MAD is zero. Default istrue
.start_col
: The column to start normalizing from. Default is 1.end_col
: The column to end normalizing at. Default is the last column.
Warning
If you set error_on_zero_mad
to false
, this function will return a result with NaN values if the MAD is zero. This can be useful if you are expecting this behavior and want to handle it yourself, but should be used with caution.
BigRiverJunbi.imputeKNN Method
imputeKNN(df::DataFrame; k = 5, threshold = 0.2, start_col = 1, end_col = size(df, 2))
Replaces missing elements based on k-nearest neighbors (KNN) imputation.
Arguments
df
: dataframe with missing values.k
: number of nearest neighbors to use for imputation.threshold
: threshold for the number of missing neighbors above which the imputation is skipped.start_col
: column index to start imputing from.end_col
: column index to end imputing at.
BigRiverJunbi.impute_half_min Method
impute_half_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Replaces missing elements in the specified columns with half of the minimum of non-missing elements in the corresponding variable.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.end_col
: column index to end imputing at.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_half_min(df)
3×5 DataFrame
Row │ A B C D E
│ Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 0 0 6 0
2 │ 2 1 4 1 1
3 │ 3 1 5 7 10
BigRiverJunbi.impute_median_cat Method
impute_median_cat(df_missing::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df_missing, 2))
Returns imputed dataframe based on a categorical imputation: - 0: Missing values - 1: Values below the median - 2: Values equal to or above the median
Arguments
df_missing
: dataframe with missing values.start_col
: column index to start imputing from.end_col
: column index to end imputing at.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_median_cat(df)
3×5 DataFrame
Row │ A B C D E
│ Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 0 0 1 0
2 │ 2 0 1 0 0
3 │ 2 0 2 2 2
BigRiverJunbi.impute_min Method
impute_min(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Replaces missing elements in the specified columns with the minimum of non-missing elements in the corresponding variable.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.end_col
: column index to end imputing at.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_min(df)
3×5 DataFrame
Row │ A B C D E
│ Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 1 1 6 1
2 │ 2 2 4 2 2
3 │ 3 3 5 7 10
BigRiverJunbi.impute_zero Method
impute_zero(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Replaces missing elements in the specified columns with zero.
Arguments
df
: dataframe with missing values.start_col
: column index to start imputing from.end_col
: column index to end imputing at.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.impute_zero(df)
3×5 DataFrame
Row │ A B C D E
│ Int64? Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────────
1 │ 1 0 0 6 0
2 │ 2 0 4 0 0
3 │ 3 0 5 7 10
BigRiverJunbi.intnorm Method
intnorm(df::DataFrame; lambda::Float64 = 1.0,
start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Total Area Normalization for each row or column. By default, it normalizes each row. This requires that the matrix has all positive values.
Arguments
df
: The dataframe to normalize.lambda
: The lambda parameter for the normalization. Default is 1.start_col
: The column to start normalizing from. Default is 1.end_col
: The column to end normalizing at. Default is the last column.
BigRiverJunbi.log_tx Method
log_tx(df::DataFrame; base::Real = 2, constant::Real = 0,
start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Computes logarithm on a dataframe, adding a constant to all values (for instance, to avoid log(0)). Default base is 2, default constant is 0.
Arguments
df
: The dataframe to transform.base
: The base of the logarithm. Default is 2.constant
: The constant to add to all values. Default is 0.start_col
: The column to start transforming from. Default is 1.end_col
: The column to end transforming at. Default is the last column.
BigRiverJunbi.meancenter_tx Method
meancenter_tx(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Mean centers a dataframe across the specified dimension. This requires that the matrix has all positive values.
Arguments
df
: The dataframe to transform.dims
: The dimension to mean center across. Default is 1.start_col
: The column to start transforming from. Default is 1.end_col
: The column to end transforming at. Default is the last column.
BigRiverJunbi.pqnorm Method
pqnorm(df::DataFrame; lambda::Real = 1,
start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Performs a probabilistic quotient normalization (PQN) for sample intensities. This assumes that the matrix is organized as samples x features and requires that the matrix have all positive values.
Arguments
df
: The dataframe to normalize.lambda
: The lambda parameter for the normalization. Default is 1.start_col
: The column to start normalizing from. Default is 1.end_col
: The column to end normalizing at. Default is the last column.
BigRiverJunbi.quantilenorm Method
quantilenorm(df::DataFrame; start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Performs quantile normalization for sample intensities. This assumes that the matrix is organized as samples x features.
Arguments
df
: The dataframe to normalize.start_col
: The column to start normalizing from. Default is 1.end_col
: The column to end normalizing at. Default is the last column.
BigRiverJunbi.standardize Method
standardize(df::DataFrame; center::Bool = true,
start_col::Int64 = 1, end_col::Int64 = size(df, 2))
Standardize a dataframe i.e. scale to unit variance, with the option of centering or not.
Arguments
df
: The dataframe to standardize.center
: Whether to center the data. Default istrue
.start_col
: The column to start standardizing from. Default is 1.end_col
: The column to end standardizing at. Default is the last column.
DataFramesExt.missing_percentages Method
missing_percentages(df::DataFrame)
Returns the percentage of missing values in each column and row, as well as the total percentage of missing values in the dataframe.
Arguments
df
: The dataframe to calculate the missing percentages for.
Returns
pmissing_cols
: AVector
of the percentage of missing values in each column.pmissing_rows
: AVector
of the percentage of missing values in each row.total_missing
: The total percentage of missing values in the dataframe.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.missing_percentages(df)
([0.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666], [0.6, 0.6, 0.2], 0.4666666666666667)
DataFramesExt.missing_summary Method
missing_summary(df::DataFrame)
Adds a row and column to the dataframe that contains the percentage of missing values in each column and row. Returns a pretty table with the percentage of missing values in the last row and column highlighted.
Warning
This function will not preserve the type of the dataframe, as it converts everything to a string for the pretty table. It is primarily used for quick visualizations. For getting the actual missing percentages, use the missing_percentages
function instead.
Arguments
df
: The dataframe to add the missing summary to.
Examples
julia> df = DataFrame(A = [1, 2, 3],
B = [missing, missing, missing],
C = [missing, 4, 5],
D = [6, missing, 7],
E = [missing, missing, 10])
3×5 DataFrame
Row │ A B C D E
│ Int64 Missing Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ 1 missing missing 6 missing
2 │ 2 missing 4 missing missing
3 │ 3 missing 5 7 10
julia> BigRiverJunbi.missing_summary(df)
┌───────────────┬────────┬─────────┬─────────┬─────────┬─────────┬───────────────┐
│ │ A │ B │ C │ D │ E │ pmissing_rows │
│ │ String │ String │ String │ String │ String │ String │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ 1 │ 1 │ missing │ missing │ 6 │ missing │ 0.6 │
│ 2 │ 2 │ missing │ 4 │ missing │ missing │ 0.6 │
│ 3 │ 3 │ missing │ 5 │ 7 │ 10 │ 0.2 │
├───────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────────┤
│ pmissing_cols │ 0.0 │ 1.0 │ 0.33 │ 0.33 │ 0.67 │ 0.47 │
└───────────────┴────────┴─────────┴─────────┴─────────┴─────────┴───────────────┘