Title: | Enabling Cell-Type Deconvolution of Proteomics Data |
---|---|
Description: | Tools for deconvoluting proteomics data to identify and quantify cell types (e.g., immune cell types) in complex biological samples. |
Authors: | Måns Zamore [aut, cre] |
Maintainer: | Måns Zamore <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.9000 |
Built: | 2025-03-14 15:24:03 UTC |
Source: | https://github.com/ComputationalProteomics/proteoDeconv |
Normalizes protein abundance data using a TPM-like approach (Transcripts Per Million), adapting this RNA-seq normalization method for use with proteomics data.
convert_to_tpm(data)
convert_to_tpm(data)
data |
A numeric matrix containing protein abundance data with identifiers as row names and samples as columns. Values should be in linear scale, not log-transformed. |
This function applies a TPM-like normalization to proteomics data, where each protein abundance value is scaled by the total abundance in the sample and multiplied by 1 million.
If your input data is log-transformed, use unlog2_data
first to convert it to
linear scale before applying this normalization.
A numeric matrix with the same dimensions as the input, with values normalized to a TPM-like scale (sum of each column equals 1 million).
# Create example protein abundance data matrix prot_mat <- matrix(abs(rnorm(12, mean = 500, sd = 200)), nrow = 4, ncol = 3) rownames(prot_mat) <- paste0("Protein", 1:4) colnames(prot_mat) <- paste0("Sample", 1:3) # View original values and column sums print(prot_mat) print(colSums(prot_mat)) # Convert to TPM-like normalization tpm_mat <- convert_to_tpm(prot_mat) # Verify that column sums equal 1 million print(tpm_mat) print(colSums(tpm_mat))
# Create example protein abundance data matrix prot_mat <- matrix(abs(rnorm(12, mean = 500, sd = 200)), nrow = 4, ncol = 3) rownames(prot_mat) <- paste0("Protein", 1:4) colnames(prot_mat) <- paste0("Sample", 1:3) # View original values and column sums print(prot_mat) print(colSums(prot_mat)) # Convert to TPM-like normalization tpm_mat <- convert_to_tpm(prot_mat) # Verify that column sums equal 1 million print(tpm_mat) print(colSums(tpm_mat))
Generates a phenotype classification matrix from sample data for use in cell type deconvolution workflows. This matrix maps samples to their respective cell types.
create_phenoclasses( data, mapping_rules, verbose = FALSE, return_format = c("tibble", "matrix") )
create_phenoclasses( data, mapping_rules, verbose = FALSE, return_format = c("tibble", "matrix") )
data |
A numeric matrix of pure cell type profiles with genes as row names and samples as columns. Column names should contain identifiers that can be mapped to cell types. |
mapping_rules |
Either: 1. A named list where names are cell type labels and values are regular expression patterns used to match sample column names (e.g., list("CD8+ T cells" = "CD8", "Monocytes" = "Mono")), OR 2. A character vector with the same length and order as colnames(data), directly specifying the cell type for each sample. |
verbose |
Logical. If TRUE, displays additional messages during processing. |
return_format |
A string specifying the return format: "matrix" or "tibble" (default). |
The function is particularly useful for preparing reference data for signature matrix generation using CIBERSORTx. Each cell in the output is encoded as:
0 for 'Unknown' mappings
1 if the sample belongs to the cell type in that row
2 if the sample belongs to a different cell type
If return_format is "matrix", a numeric matrix with cell type groups as rows and samples as columns. If return_format is "tibble", a tibble with a "cell_type" column and columns for each sample.
create_signature_matrix
which uses the phenoclasses
matrix to generate a cell type signature matrix.
## Not run: # Example using a named list of regex patterns pure_samples <- readRDS(system.file("extdata", "pure_samples_matrix.rds", package = "proteoDeconv")) mapping_rules <- list( "CD8+ T cells" = "CD8", "Monocytes" = "Mono" ) phenoclasses1 <- create_phenoclasses( data = pure_samples, mapping_rules = mapping_rules, verbose = TRUE ) # Example using a character vector of direct cell type assignments cell_types <- c("CD8+ T cells", "CD8+ T cells", "Monocytes", "Monocytes", "Unknown") phenoclasses2 <- create_phenoclasses( data = pure_samples, mapping_rules = cell_types, verbose = TRUE ) ## End(Not run)
## Not run: # Example using a named list of regex patterns pure_samples <- readRDS(system.file("extdata", "pure_samples_matrix.rds", package = "proteoDeconv")) mapping_rules <- list( "CD8+ T cells" = "CD8", "Monocytes" = "Mono" ) phenoclasses1 <- create_phenoclasses( data = pure_samples, mapping_rules = mapping_rules, verbose = TRUE ) # Example using a character vector of direct cell type assignments cell_types <- c("CD8+ T cells", "CD8+ T cells", "Monocytes", "Monocytes", "Unknown") phenoclasses2 <- create_phenoclasses( data = pure_samples, mapping_rules = cell_types, verbose = TRUE ) ## End(Not run)
Generates a signature matrix from reference profiles and cell type classes using the CIBERSORTx Docker image. This signature matrix contains cell type-specific marker proteins that will be used for deconvolution.
create_signature_matrix( refsample = NULL, phenoclasses = NULL, g_min = 200, g_max = 400, q_value = 0.01, replicates = 5, sampling = 0.5, fraction = 0.75, filter = FALSE, verbose = FALSE, QN = FALSE, single_cell = FALSE, use_sudo = FALSE, ... )
create_signature_matrix( refsample = NULL, phenoclasses = NULL, g_min = 200, g_max = 400, q_value = 0.01, replicates = 5, sampling = 0.5, fraction = 0.75, filter = FALSE, verbose = FALSE, QN = FALSE, single_cell = FALSE, use_sudo = FALSE, ... )
refsample |
A numeric matrix containing reference profiles with genes as row names and samples as columns. This should be preprocessed data from pure cell populations. |
phenoclasses |
A numeric matrix, data frame, or tibble containing cell type classification (0/1/2). If provided as a matrix or base data frame, the cell types are assumed to be in the row names. If provided as a tibble, the cell type identifiers should be included as a column. This is typically created using the create_phenoclasses() function. |
g_min |
Minimum number of genes per cell type in the signature matrix. Default is 200. |
g_max |
Maximum number of genes per cell type in the signature matrix. Default is 400. |
q_value |
Q-value threshold for differential expression. Default is 0.01. |
replicates |
Number of replicates to use for building the reference file (only relevant when single_cell=TRUE). Default is 5. |
sampling |
Fraction of available gene expression profiles selected by random sampling (only relevant when single_cell=TRUE). Default is 0.5. |
fraction |
Fraction of cells of the same identity showing evidence of expression (only relevant when single_cell=TRUE). Default is 0.75. |
filter |
Logical indicating whether to remove non-hematopoietic genes. Default is FALSE. |
verbose |
Logical indicating whether to print detailed output. Default is FALSE. |
QN |
Logical indicating whether to run quantile normalization. Default is FALSE. |
single_cell |
Logical indicating whether to create signature from scRNA-Seq data. Default is FALSE. |
use_sudo |
Logical indicating whether to use sudo for Docker commands. Default is FALSE. |
... |
Additional arguments passed to the function. |
This function uses the CIBERSORTx Docker image to construct a signature matrix based on the input reference profiles and cell type classifications. The CIBERSORTx token and email must be set as environment variables either in the project directory's .Renviron file or in the user's home directory .Renviron file.
A numeric matrix with genes as rows and cell types as columns, representing the expression profile of each cell type. This signature matrix can be used as input for deconvolution algorithms to estimate cell type proportions in mixed samples.
create_phenoclasses
for creating the phenoclasses
input and deconvolute
for using the signature matrix in
deconvolution.
## Not run: # Load preprocessed pure samples data pure_samples <- readRDS(system.file("extdata", "pure_samples_matrix.rds", package = "proteoDeconv")) # Create phenoclasses for the samples mapping_rules <- list( "CD8+ T cells" = "CD8", "Monocytes" = "Mono" ) phenoclasses <- create_phenoclasses( data = pure_samples, mapping_rules = mapping_rules ) # Create signature matrix signature_matrix <- create_signature_matrix( refsample = pure_samples, phenoclasses = phenoclasses, g_min = 200, g_max = 400, q_value = 0.01 ) ## End(Not run)
## Not run: # Load preprocessed pure samples data pure_samples <- readRDS(system.file("extdata", "pure_samples_matrix.rds", package = "proteoDeconv")) # Create phenoclasses for the samples mapping_rules <- list( "CD8+ T cells" = "CD8", "Monocytes" = "Mono" ) phenoclasses <- create_phenoclasses( data = pure_samples, mapping_rules = mapping_rules ) # Create signature matrix signature_matrix <- create_signature_matrix( refsample = pure_samples, phenoclasses = phenoclasses, g_min = 200, g_max = 400, q_value = 0.01 ) ## End(Not run)
This function provides a unified interface for deconvoluting protein or gene expression data using various algorithms, making it easy to switch between different deconvolution methods.
deconvolute(algorithm, data, signature = NULL, ...)
deconvolute(algorithm, data, signature = NULL, ...)
algorithm |
A string specifying the deconvolution algorithm to use. Options are "cibersortx", "cibersort", "epic", and "bayesdebulk". |
data |
A numeric matrix containing protein or gene expression data with genes/proteins as row names and samples as columns. |
signature |
A numeric matrix containing the reference signature with genes/proteins as row names and cell types as columns. Required by all supported algorithms. |
... |
Additional arguments passed to the specific deconvolution functions. |
The function supports multiple deconvolution algorithms:
cibersortx: Uses Docker-based CIBERSORTx, which requires Docker installation
and credentials (see deconvolute_cibersortx
).
cibersort: Uses the R implementation of CIBERSORT, which requires sourcing
the original CIBERSORT.R script (see deconvolute_cibersort
).
epic: Uses the EPIC algorithm which performs constrained least squares regression
(see deconvolute_epic
).
bayesdebulk: Uses the BayesDeBulk Bayesian deconvolution algorithm
(see deconvolute_bayesdebulk
).
A numeric matrix with samples as rows and cell types as columns, representing the estimated proportion of each cell type in each sample.
deconvolute_cibersortx
for the CIBERSORTx Docker
implementation deconvolute_cibersort
for the CIBERSORT R
implementation deconvolute_epic
for the EPIC algorithm
deconvolute_bayesdebulk
for the BayesDeBulk algorithm
## Not run: # Load example data data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Using EPIC algorithm epic_results <- deconvolute( algorithm = "epic", data = mixed_samples, signature = signature_matrix, with_other_cells = TRUE ) # Using BayesDeBulk algorithm bayes_results <- deconvolute( algorithm = "bayesdebulk", data = mixed_samples, signature = signature_matrix, n_iter = 1000, burn_in = 100 ) ## End(Not run)
## Not run: # Load example data data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Using EPIC algorithm epic_results <- deconvolute( algorithm = "epic", data = mixed_samples, signature = signature_matrix, with_other_cells = TRUE ) # Using BayesDeBulk algorithm bayes_results <- deconvolute( algorithm = "bayesdebulk", data = mixed_samples, signature = signature_matrix, n_iter = 1000, burn_in = 100 ) ## End(Not run)
Deconvolutes bulk proteome data using the BayesDeBulk algorithm to estimate cell type proportions in mixed samples based on a signature matrix.
deconvolute_bayesdebulk( data, signature, n_iter = 1000, burn_in = 100, marker_selection = "limma", ... )
deconvolute_bayesdebulk( data, signature, n_iter = 1000, burn_in = 100, marker_selection = "limma", ... )
data |
A numeric matrix of bulk proteome data with gene identifiers as row names and samples as columns. |
signature |
A numeric matrix containing signature marker values with gene identifiers as row names and cell types as columns. |
n_iter |
Number of iterations for the MCMC sampling; default is 1000. |
burn_in |
Number of burn-in iterations to discard; default is 100. |
marker_selection |
The method to use for marker selection: "limma" (default), "simple", or a pre-computed marker matrix. |
... |
Additional arguments passed to marker selection functions. |
This function calculates signature markers using the specified marker selection method and runs the BayesDeBulk deconvolution algorithm. The marker selection process identifies genes that are uniquely expressed in specific cell types, which are then used for the deconvolution.
A matrix containing cell type proportions with samples as rows and cell types as columns.
## Not run: # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution result <- deconvolute_bayesdebulk(mixed_samples, signature_matrix) ## End(Not run)
## Not run: # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution result <- deconvolute_bayesdebulk(mixed_samples, signature_matrix) ## End(Not run)
Applies the CIBERSORT algorithm to deconvolute bulk proteome data into constituent cell types using a signature matrix. The function requires the original CIBERSORT.R script to be sourced.
deconvolute_cibersort( data, signature, QN = FALSE, absolute = FALSE, abs_method = "sig.score", ... )
deconvolute_cibersort( data, signature, QN = FALSE, absolute = FALSE, abs_method = "sig.score", ... )
data |
A numeric matrix containing mixture data with genes as row names and samples as columns. |
signature |
A numeric matrix containing signature data with genes as row names and cell types as columns. |
QN |
Logical indicating whether quantile normalization is performed (default FALSE). |
absolute |
Logical indicating whether an absolute score is computed (default FALSE). |
abs_method |
Method for absolute scoring if absolute is TRUE (default "sig.score"). |
... |
Additional arguments passed to the CIBERSORT function. |
This function requires the original CIBERSORT.R script from the CIBERSORT website (https://cibersortx.stanford.edu/) to be sourced before use. It writes temporary files for the mixture data and signature matrix, calls the CIBERSORT function, and processes the results.
A numeric matrix with samples as rows and cell types as columns, representing the estimated proportion of each cell type in each sample. The returned matrix excludes CIBERSORT's diagnostic columns (RMSE, P-value, Correlation).
deconvolute
for a unified interface to multiple
deconvolution methods.
## Not run: # First source the CIBERSORT.R script source("/path/to/CIBERSORT.R") # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution result <- deconvolute_cibersort( data = mixed_samples, signature = signature_matrix, QN = FALSE, absolute = FALSE ) ## End(Not run)
## Not run: # First source the CIBERSORT.R script source("/path/to/CIBERSORT.R") # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution result <- deconvolute_cibersort( data = mixed_samples, signature = signature_matrix, QN = FALSE, absolute = FALSE ) ## End(Not run)
Performs deconvolution of bulk proteome data into constituent cell types using the CIBERSORTx Docker image. This function handles the interaction with the Docker container and processes the results.
deconvolute_cibersortx( data, signature, perm = 1, rmbatch_S_mode = FALSE, source_GEPs = NULL, rmbatch_B_mode = FALSE, QN = FALSE, absolute = FALSE, abs_method = "sig.score", use_sudo = FALSE )
deconvolute_cibersortx( data, signature, perm = 1, rmbatch_S_mode = FALSE, source_GEPs = NULL, rmbatch_B_mode = FALSE, QN = FALSE, absolute = FALSE, abs_method = "sig.score", use_sudo = FALSE )
data |
A numeric matrix containing mixture data with genes as row names and samples as columns. |
signature |
A numeric matrix containing the signature matrix with genes as row names and cell types as columns. |
perm |
An integer specifying the number of permutations to be performed. Default is 1. |
rmbatch_S_mode |
A logical value indicating whether to remove batch effects in source GEPs mode. Default is FALSE. |
source_GEPs |
A matrix containing the source gene expression profiles.
Required if |
rmbatch_B_mode |
A logical value indicating whether to remove batch effects in bulk mode. Default is FALSE. |
QN |
A logical value indicating whether to perform quantile normalization. Default is FALSE. |
absolute |
A logical value indicating whether to use absolute mode. Default is FALSE. |
abs_method |
A character string specifying the method to use for absolute mode. Default is "sig.score". |
use_sudo |
A logical value indicating whether to use sudo for Docker commands. Default is FALSE. |
This function requires the CIBERSORTx Docker image to be installed
and the CIBERSORTX_EMAIL
and CIBERSORTX_TOKEN
environment
variables to be set. You can get these credentials by registering at the
CIBERSORTx website (https://cibersortx.stanford.edu/).
The function creates temporary files for the mixture data and signature matrix, runs the CIBERSORTx Docker container, and processes the results. Note that absolute mode is not currently supported in the Docker version.
A numeric matrix with samples as rows and cell types as columns, representing the estimated proportion of each cell type in each sample.
deconvolute_cibersort
for using the R implementation
of CIBERSORT, deconvolute
for a unified interface to multiple
deconvolution methods.
## Not run: # Set required environment variables (ideally in .Renviron) Sys.setenv(CIBERSORTX_EMAIL = "[email protected]") Sys.setenv(CIBERSORTX_TOKEN = "your-token-here") # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution with CIBERSORTx Docker result <- deconvolute_cibersortx( data = mixed_samples, signature = signature_matrix, perm = 100, QN = TRUE ) ## End(Not run)
## Not run: # Set required environment variables (ideally in .Renviron) Sys.setenv(CIBERSORTX_EMAIL = "[email protected]") Sys.setenv(CIBERSORTX_TOKEN = "your-token-here") # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run deconvolution with CIBERSORTx Docker result <- deconvolute_cibersortx( data = mixed_samples, signature = signature_matrix, perm = 100, QN = TRUE ) ## End(Not run)
Runs the EPIC algorithm on preprocessed proteomic data to estimate cell type proportions in mixed samples using a reference signature matrix.
deconvolute_epic(data, signature, with_other_cells = TRUE)
deconvolute_epic(data, signature, with_other_cells = TRUE)
data |
A numeric matrix of the bulk proteome data with gene identifiers as row names and samples as columns. |
signature |
A numeric matrix of reference signature profiles with gene identifiers as row names and cell types as columns. |
with_other_cells |
Logical; if TRUE, EPIC will include an "other cells" component in the output to account for cell types not present in the reference. Default is TRUE. |
The function normalizes both the input data matrix and signature
matrix using the handle_scaling
function before running the EPIC
deconvolution. The EPIC algorithm uses constrained least squares regression
to estimate cell type proportions.
A numeric matrix with samples as rows and cell types as columns, representing the estimated proportion of each cell type in each sample.
Racle, J. et al. (2017). Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. eLife, 6:e26476.
handle_scaling
for preprocessing inputs before
deconvolution, deconvolute
for a unified interface to
multiple deconvolution methods.
## Not run: # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run EPIC deconvolution result <- deconvolute_epic( data = mixed_samples, signature = signature_matrix, with_other_cells = TRUE ) # View first few rows of the result head(result) ## End(Not run)
## Not run: # Load example data and signature matrix data_file <- system.file("extdata", "mixed_samples_matrix.rds", package = "proteoDeconv") mixed_samples <- readRDS(data_file) signature_file <- system.file("extdata", "cd8t_mono_signature_matrix.rds", package = "proteoDeconv") signature_matrix <- readRDS(signature_file) # Run EPIC deconvolution result <- deconvolute_epic( data = mixed_samples, signature = signature_matrix, with_other_cells = TRUE ) # View first few rows of the result head(result) ## End(Not run)
Extracts the primary identifier from compound row identifiers in a matrix by splitting on a separator and keeping only the first entry.
extract_identifiers(data, separator = ";")
extract_identifiers(data, separator = ";")
data |
A numeric matrix with compound identifiers (e.g. protein/gene groups) as row names. |
separator |
A character string used to separate multiple identifiers. Default is ";". |
A matrix with simplified identifiers as row names, where each identifier is the first element from the original compound identifier.
# Create matrix with compound identifiers (like protein groups) mat <- matrix(1:9, nrow = 3, ncol = 3) rownames(mat) <- c("P04637;P02340", "Q15796;O35182", "P01308;P01315;P01317") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Extract primary identifiers result <- extract_identifiers(mat) print(result)
# Create matrix with compound identifiers (like protein groups) mat <- matrix(1:9, nrow = 3, ncol = 3) rownames(mat) <- c("P04637;P02340", "Q15796;O35182", "P01308;P01315;P01317") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Extract primary identifiers result <- extract_identifiers(mat) print(result)
Resolves duplicate row identifiers in an expression matrix using the specified method.
handle_duplicates(data, duplicate_mode = "slice")
handle_duplicates(data, duplicate_mode = "slice")
data |
A numeric matrix containing expression data with identifiers as row names and samples as columns. |
duplicate_mode |
A string specifying the approach to handle duplicates:
|
A numeric matrix with unique identifiers as row names. The number of rows will be equal to the number of unique identifiers in the input matrix.
# Create example matrix with duplicate identifiers mat <- matrix(1:12, nrow = 4, ncol = 3) rownames(mat) <- c("ID1", "ID2", "ID1", "ID3") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Handle duplicates by keeping rows with maximum median (default) result1 <- handle_duplicates(mat) print(result1) # Handle duplicates by merging rows result2 <- handle_duplicates(mat, duplicate_mode = "merge") print(result2)
# Create example matrix with duplicate identifiers mat <- matrix(1:12, nrow = 4, ncol = 3) rownames(mat) <- c("ID1", "ID2", "ID1", "ID3") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Handle duplicates by keeping rows with maximum median (default) result1 <- handle_duplicates(mat) print(result1) # Handle duplicates by merging rows result2 <- handle_duplicates(mat, duplicate_mode = "merge") print(result2)
Imputes missing values in an expression matrix using various methods, including simple replacement with the lowest non-zero value or more sophisticated imputation techniques via the MsCoreUtils package.
handle_missing_values(data, imputation_mode = "lowest_value", ...)
handle_missing_values(data, imputation_mode = "lowest_value", ...)
data |
A numeric matrix containing the data to be processed, with identifiers as row names and samples as columns. |
imputation_mode |
A string specifying the imputation method to use:
|
... |
Additional arguments passed to |
A matrix with the same dimensions as the input, but with missing values imputed according to the specified method.
impute_matrix
for detailed description of
the imputation methods
# Create example matrix with missing values mat <- matrix(c(1.2, 3.4, NA, 5.6, NA, 7.8, 9.0, 2.1, 4.3), nrow = 3, ncol = 3) rownames(mat) <- c("Protein1", "Protein2", "Protein3") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Impute missing values with lowest non-zero value (default) result1 <- handle_missing_values(mat) print(result1) # Impute missing values using k-nearest neighbors ## Not run: # Requires the 'impute' package result2 <- handle_missing_values(mat, imputation_mode = "knn") print(result2) ## End(Not run)
# Create example matrix with missing values mat <- matrix(c(1.2, 3.4, NA, 5.6, NA, 7.8, 9.0, 2.1, 4.3), nrow = 3, ncol = 3) rownames(mat) <- c("Protein1", "Protein2", "Protein3") colnames(mat) <- c("Sample1", "Sample2", "Sample3") # View original matrix print(mat) # Impute missing values with lowest non-zero value (default) result1 <- handle_missing_values(mat) print(result1) # Impute missing values using k-nearest neighbors ## Not run: # Requires the 'impute' package result2 <- handle_missing_values(mat, imputation_mode = "knn") print(result2) ## End(Not run)
Scales protein abundance data by optionally converting from log2 scale to linear scale and/or applying TPM-like normalization for proteomics data comparison.
handle_scaling(data, unlog = FALSE, tpm = FALSE)
handle_scaling(data, unlog = FALSE, tpm = FALSE)
data |
A numeric matrix containing protein abundance data with identifiers as row names and samples as columns. |
unlog |
A logical value indicating whether to unlog the data (convert from log2 scale). Default is FALSE. |
tpm |
A logical value indicating whether to apply TPM-like normalization (adapting Transcripts Per Million for proteomics). Default is FALSE. |
This function combines the functionality of
unlog2_data
and convert_to_tpm
to provide a
flexible way to handle common scaling operations for proteomics data.
A numeric matrix with the scaled protein abundance data according to the specified options.
unlog2_data
for just converting from log2 scale,
convert_to_tpm
for just applying TPM-like normalization
# Create log2-transformed protein abundance matrix log2_mat <- matrix(rnorm(12, mean = 10, sd = 2), nrow = 4, ncol = 3) rownames(log2_mat) <- paste0("Protein", 1:4) colnames(log2_mat) <- paste0("Sample", 1:3) # Convert from log2 to linear scale only linear_mat <- handle_scaling(log2_mat, unlog = TRUE, tpm = FALSE) # Convert from log2 to linear scale and then apply TPM-like normalization tpm_mat <- handle_scaling(log2_mat, unlog = TRUE, tpm = TRUE) # Apply TPM-like normalization to already linear protein abundance data linear_data <- matrix(abs(rnorm(12, mean = 500, sd = 200)), nrow = 4, ncol = 3) tpm_only <- handle_scaling(linear_data, unlog = FALSE, tpm = TRUE)
# Create log2-transformed protein abundance matrix log2_mat <- matrix(rnorm(12, mean = 10, sd = 2), nrow = 4, ncol = 3) rownames(log2_mat) <- paste0("Protein", 1:4) colnames(log2_mat) <- paste0("Sample", 1:3) # Convert from log2 to linear scale only linear_mat <- handle_scaling(log2_mat, unlog = TRUE, tpm = FALSE) # Convert from log2 to linear scale and then apply TPM-like normalization tpm_mat <- handle_scaling(log2_mat, unlog = TRUE, tpm = TRUE) # Apply TPM-like normalization to already linear protein abundance data linear_data <- matrix(abs(rnorm(12, mean = 500, sd = 200)), nrow = 4, ncol = 3) tpm_only <- handle_scaling(linear_data, unlog = FALSE, tpm = TRUE)
Maps a character vector of column names to predefined cell type categories by matching against patterns, with optional regex support.
map_cell_groups( column_names, mapping_rules, default_group = "Unknown", verbose = FALSE, use_regex = TRUE )
map_cell_groups( column_names, mapping_rules, default_group = "Unknown", verbose = FALSE, use_regex = TRUE )
column_names |
A character vector of column names to be mapped to cell type groups. |
mapping_rules |
A named list where names are the target cell type groups and values are character vectors of patterns that identify columns belonging to each group. |
default_group |
A string that represents the default group name for columns not matching any pattern. Default is "Unknown". |
verbose |
Logical. If TRUE, prints mapping process messages showing which columns were mapped to which groups and which columns remained unmapped. Default is FALSE. |
use_regex |
Logical. If TRUE (default), treats patterns in mapping_rules as regular expressions. If FALSE, performs exact string matching. |
This function iterates through the mapping_rules
list and
attempts to match each column name against the patterns for each cell type
group. The first matching group in the order of the list will be assigned.
When use_regex = TRUE
, patterns are treated as regular expressions
and matching is case-insensitive. When use_regex = FALSE
, exact
string matching is performed, which is case-sensitive.
A character vector with the same length as column_names
,
containing the mapped cell type group for each column name. Columns that
don't match any pattern will be assigned the default_group
.
# Example column names from a dataset cols <- c("CD8_T_cell", "T_cell", "B_cell", "NK_cell", "Monocyte", "Unknown_cell") # Define simple mapping rules for cell types mapping <- list( "T_cell" = c("CD8_T_cell", "T_cell"), "B_cell" = c("B_cell"), "NK_cell" = c("NK_cell"), "Monocyte" = c("Monocyte") ) # Map column names to cell types using exact matching cell_types <- map_cell_groups(cols, mapping, use_regex = FALSE) print(data.frame(column = cols, cell_type = cell_types)) # Define mapping rules using regex patterns mapping_regex <- list( "T_cell" = c("CD8.*", "T_cell"), "B_cell" = c("B[-_]?cell"), "NK_cell" = c("NK[-_]?cell"), "Monocyte" = c("Mono.*") ) # Map using regex patterns (default) cell_types_regex <- map_cell_groups(cols, mapping_regex) print(data.frame(column = cols, cell_type = cell_types_regex))
# Example column names from a dataset cols <- c("CD8_T_cell", "T_cell", "B_cell", "NK_cell", "Monocyte", "Unknown_cell") # Define simple mapping rules for cell types mapping <- list( "T_cell" = c("CD8_T_cell", "T_cell"), "B_cell" = c("B_cell"), "NK_cell" = c("NK_cell"), "Monocyte" = c("Monocyte") ) # Map column names to cell types using exact matching cell_types <- map_cell_groups(cols, mapping, use_regex = FALSE) print(data.frame(column = cols, cell_type = cell_types)) # Define mapping rules using regex patterns mapping_regex <- list( "T_cell" = c("CD8.*", "T_cell"), "B_cell" = c("B[-_]?cell"), "NK_cell" = c("NK[-_]?cell"), "Monocyte" = c("Mono.*") ) # Map using regex patterns (default) cell_types_regex <- map_cell_groups(cols, mapping_regex) print(data.frame(column = cols, cell_type = cell_types_regex))
This function simulates bulk proteome data by mixing bulk sample measurements.
simulate_data( data, cell_types, seed = NULL, ncells = 100, nsamples = 100, filter_genes = TRUE, scenario = "random", whitelist = NULL, blacklist = NULL )
simulate_data( data, cell_types, seed = NULL, ncells = 100, nsamples = 100, filter_genes = TRUE, scenario = "random", whitelist = NULL, blacklist = NULL )
data |
A numeric matrix containing protein abundance data with protein identifiers as row names and samples as columns. |
cell_types |
A character vector indicating the cell type associated with
each column in the input matrix. Must have the same length as the number of
columns in |
seed |
An integer used as random seed for reproducibility. Default is NULL (no seed). |
ncells |
Integer specifying the number of cells to use for each simulated bulk sample. |
nsamples |
Integer specifying the number of bulk samples to simulate. Default is 100. |
filter_genes |
Logical; whether to filter out proteins/genes with low expression before simulation. Default is TRUE. |
scenario |
String specifying the simulation scenario:
|
whitelist |
Optional character vector of cell types to include in the simulation. If provided, only these cell types will be used. Default is NULL (use all). |
blacklist |
Optional character vector of cell types to exclude from the simulation. Default is NULL (exclude none). |
This function uses the SimBu package to generate synthetic bulk samples by artifically mixing samples.
A list containing two matrices:
simulated_data
: Matrix of simulated bulk protein abundance data
(proteins as rows, simulated samples as columns)
cell_fractions
: Matrix of cell type fractions used for each simulation
(cell types as rows, simulated samples as columns)
# Create example data cell_data <- matrix(abs(rnorm(1500, mean = 500, sd = 200)), nrow = 100, ncol = 15) rownames(cell_data) <- paste0("Protein", 1:100) colnames(cell_data) <- paste0("Cell", 1:15) # Define cell types cell_types <- rep(c("T_cell", "B_cell", "Monocyte"), each = 5) # Run simulation ## Not run: sim_results <- simulate_data( data = cell_data, cell_types = cell_types, seed = 42, nsamples = 20, scenario = "random" ) dim(sim_results$simulated_data) dim(sim_results$cell_fractions) ## End(Not run)
# Create example data cell_data <- matrix(abs(rnorm(1500, mean = 500, sd = 200)), nrow = 100, ncol = 15) rownames(cell_data) <- paste0("Protein", 1:100) colnames(cell_data) <- paste0("Cell", 1:15) # Define cell types cell_types <- rep(c("T_cell", "B_cell", "Monocyte"), each = 5) # Run simulation ## Not run: sim_results <- simulate_data( data = cell_data, cell_types = cell_types, seed = 42, nsamples = 20, scenario = "random" ) dim(sim_results$simulated_data) dim(sim_results$cell_fractions) ## End(Not run)
Transforms log2-transformed expression data back to linear scale by calculating 2^x for each value in the input matrix.
unlog2_data(data)
unlog2_data(data)
data |
A numeric matrix containing log2-transformed data with identifiers as row names and samples as columns. |
This function can be used to convert log2-transformed proteomics or gene expression data back to their original scale for downstream analyses that require linear values. It preserves the row and column names of the input matrix.
A numeric matrix with the same dimensions as the input, with values converted to linear scale (2^x).
# Create log2-transformed data matrix log2_mat <- matrix(rnorm(12, mean = 10, sd = 2), nrow = 4, ncol = 3) rownames(log2_mat) <- paste0("Protein", 1:4) colnames(log2_mat) <- paste0("Sample", 1:3) # View original log2 values print(log2_mat) # Convert to linear scale linear_mat <- unlog2_data(log2_mat) print(linear_mat)
# Create log2-transformed data matrix log2_mat <- matrix(rnorm(12, mean = 10, sd = 2), nrow = 4, ncol = 3) rownames(log2_mat) <- paste0("Protein", 1:4) colnames(log2_mat) <- paste0("Sample", 1:3) # View original log2 values print(log2_mat) # Convert to linear scale linear_mat <- unlog2_data(log2_mat) print(linear_mat)
This function checks and updates gene symbols in a matrix's row names to ensure they conform to the current approved HGNC (HUGO Gene Nomenclature Committee) standards. Non-standard or outdated gene symbols are replaced with their current official symbols.
update_gene_symbols(data, verbose = FALSE)
update_gene_symbols(data, verbose = FALSE)
data |
A numeric matrix containing gene expression or protein abundance data with gene identifiers as row names. |
verbose |
A logical value indicating whether to print detailed messages about the number of approved, non-approved, and unmappable gene symbols. Default is FALSE. |
The function uses the HGNChelper
package to check and standardize
gene symbols based on a predefined gene symbols map. This is important for
ensuring consistency in gene naming across datasets and avoiding issues
with outdated or non-standard gene symbols.
A matrix with updated gene symbols as row names. Rows with unmappable symbols (where no suggested symbol could be found) are removed from the output.