Signature Matrix Creation

batch_correct_datasets(data_loc, cell_file_dict, outfile="batch_corrected_datasets.txt")

Performs batch correction on raw data files, and saves the combined output to a single CSV file. All files must be stored in the same folder.

Parameters:
  • data_loc – String. Path to folder that holds all data files
  • cell_file_dict – Dictionary. Keys = Cell Types, Values = paths to all raw files containing data for the given cell type
  • outfile – String. Name to save the batch corrected datasets file under. Default is “batch_corrected_datasets.txt”
Returns:

None. Batch corrected datasets are written to a text file

create_signature_matrix(infile, cell_types, clustered=False, max_clusters=10, intermfile=None, outfile="kmeans_signature_matrix_qval.txt")

Given a batch corrected dataset as input (such as one created with batch_correct_datasets()), runs clustering on each cell-type (using the silhouette method), does differential expression analysis to get the genes that are significantly expressed among each cluster (using adjusted p value), and outputs a signature matrix to a csv file.

Parameters:
  • infile – String. Path to text file containing the batch corrected datasets
  • cell_types – List of cell types (strings) to include in signature matrix
  • clustered – Boolean. Whether or not the infile is pre-clustered
  • max_clusters – Integer. Maximum value of K to try for k-means clustering. Ignored if clustered=True
  • intermfile – String. Where to save intermediate step of clustered batch corrected datasets. Ignored if clustered=True. Default is infile+"_clustered.txt"
  • outfile – String. Name to save the finished signature matrix file under. Default is “kmeans_signature_matrix_qval.txt”
Returns:

None. Signature matrix and (if clustered=False & intermfile != None) text file of clustered batch corrected datasets are written to csv files

Example: Create a custom signature matrix from raw data files. The data used in this example are single cell profiles from the Gene Expression Omnibus (GEO) repository. They are available to download on the NCBI website, as well as from our GitHub page, at: https://github.com/ShahriyariLab/TumorDecon/tree/master/TumorDecon/data/sig_matrix_tutorial :

To begin, we define a dictionary where the keys are the cell types we wish to consider, and the values are all the raw files that contain data for that cell type:

>>> cell_file_dict = {'CD4': ['GSE107011_Processed_data_TPM2.txt', 'ensembl_version_GSE97861.txt', 'ensembl_version_GSE97862.txt', \
                              'ensembl_version_GSE113891.txt', 'ensembl_version_GSE114407.txt', 'ensembl_version_GSE115978.txt'], \
                              'CD8': ['ensembl_version_GSE98638.txt', 'ensembl_version_GSE114407.txt', 'GSE107011_Processed_data_TPM2.txt'], \
                              'B_': ['GSE107011_Processed_data_TPM2.txt', 'ensembl_version_GSE114407.txt', 'ensembl_version_GSE115978.txt'], \
                              'mono': ['ensembl_version_GSE114407.txt', 'GSE107011_Processed_data_TPM2.txt'], \
                              'NK': ['GSE107011_Processed_data_TPM2.txt', 'ensembl_version_GSE115978.txt'], \
                              'Endo': ['ensembl_version_GSE102767.txt', 'ensembl_version_GSE113839.txt', 'ensembl_version_GSE115978.txt'], \
                              'Fibro': ['ensembl_version_GSE113839.txt', 'GSE109448_rnaseq_gene_tpm.tsv', 'GSE109449_singlecell_rnaseq_gene_tpm.txt'], \
                              'Neutro': ['GSE107011_Processed_data_TPM2.txt']}

Next, remove batch effects with the batch_correct_datasets() function. In the following example, we have saved all the GEO files to “datafolder” and we wish to save the batch corrected data under the name “batch_corrected_sample_data.txt”:

>>> td.batch_correct_datasets("datafolder", cell_file_dict, outfile="batch_corrected_sample_data.txt")
Performing batch correction...
Found 6 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
Found 3 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data

[ ... output truncated for space ... ]

Found 3 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
Saving batch corrected datasets to batch_corrected_sample_data.txt...

Once batch effects are removed, use the data to create a signature matrix:

>>> cell_types_to_include = ["CD8", "CD4", "B_cell", "NK", "mono", "Endothelial", "Fibroblast", "Neutrophils"]
>>> td.create_signature_matrix("batch_corrected_sample_data.txt", cell_types_to_include, outfile="kmeans_signature_matrix_qval.txt")
Reading batch-corrected dataset file batch_corrected_sample_data.txt...
Clustering batch-corrected datasets...
CD8: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
CD4: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
B_cell: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
NK: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
mono: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
Endothelial: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
Fibroblast: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
Neutrophils: Finding optimal K for K-means (max # of clusters = 10)
Optimal K = 2
Saving clustered data sets to batch_corrected_sample_data_clustered.txt...
Generating signature matrix from clustered batch-corrected datasets...
Saving signature matrix to kmeans_signature_matrix_qval.txt...

Note that the create_signature_matrix() function expects a non-clustered data file as input. However, if your datafile is large, running these two processes (clustering & generation of signature matrix) in sequence can take a very long time, or you may run into memory issues. To avoid this, you can also input an already-clustered data file (saved as an intermediate step in a previous run), and include the argument clustered=True, in order to skip the clustering step:

>>> td.create_signature_matrix("batch_corrected_sample_data_clustered.txt", cell_types_to_include, clustered=True, outfile="kmeans_signature_matrix_qval_2.txt")
Reading batch-corrected dataset file batch_corrected_sample_data_clustered.txt...
Generating signature matrix from clustered batch-corrected datasets...
Saving signature matrix to kmeans_signature_matrix_qval_2.txt...