Running Digital Cytometry¶
All digital cytometry in TumorDecon is done with the tumor_deconvolve()
function:
-
tumor_deconvolve
(mixture_data, method, patient_IDs='ALL', sig_matrix=None, up_genes=None, down_genes=None, args={})¶ Given a digital cytometry method, cell signature, and a mixture (RNA Sequence) dataset, return the estimated frequencies/ranks of each cell type in each patient.
Parameters: - mixture_data – Pandas DataFrame of rna gene expression data. Rows are genes (indexed by ‘Hugo_Symbol’) and columns are patients
- method – Method for tumor deconvolution. Must be either ‘CIBERSORT’, ‘DeconRNASeq’, ‘ssGSEA’, or ‘SingScore’
- patient_IDs – Either a list of patient IDs to run digital cytometry on, or the string ‘ALL’ (to run for all patients). The default is ‘ALL’
- sig_matrix – Pandas DataFrame of signature gene expression values for given cell types. Rows are genes (indexed by ‘Hugo_Symbol’) and columns are cell types * Required for ‘CIBERSORT’ and ‘DeconRNASeq’ methods * (ignored for ‘ssGSEA’ and ‘SingScore’ methods)
- up_genes – Dictionary. Keys are cell types, vaues are a list of up-regulated genes (given by Hugo Symbol) for that cell type * Required for ‘ssGSEA’ and ‘SingScore’ methods * (ignored for ‘CIBERSORT’ and ‘DeconRNASeq’ methods)
- down_genes – Dictionary. Keys are cell types, vaues are a list of down-regulated genes (given by Hugo Symbol) for that cell type * Optional for ‘SingScore’ method * (ignored for ‘CIBERSORT’, ‘DeconRNASeq’, and ‘ssGSEA’ methods)
- args – Dictionary. Additional arguments to pass on to the given method. For a list of valid arguments for each method, see the subsections below.
Returns: Pandas DataFrame. Frequencies/scores for each specified patient and cell type in the signature matrix, as solved for using the specified method
Example 1: Run digital cytometry on all patients in the RNA mixture data, using the cibersort method, and save the output to a csv file.
Since CiberSort is a linear method, we need to pass in a signature matrix for the cell_signature
argument, in the form of a Pandas DataFrame. Using the read_sig_file()
function with no argument will read in LM22 by default.
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_RNA_Seq_v2_expression_median.txt')
>>> sig = td.read_sig_file()
>>> freqs = td.tumor_deconvolve(rna, 'cibersort', sig_matrix=sig)
Running CiberSort...
CiberSort has completed!
>>> freqs.to_csv("cibersort_results.csv", index_label='Patient_ID')
Example 2: Run digital cytometry on just a subset of patients in the RNA mixture data, using the ssGSEA method. Then save output to a cvs file:
Since ssGSEA is a rank-based method, we need to pass in a list for the up_genes
argument. Using the read_geneset()
function with no argument will read in the default up-regulated gene set (derived from single-cell reference profiles of LM22, in Le et al. (2020)). We also define the list of patients we want to calculate scores for in a variable named patient_subset
:
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_RNA_Seq_v2_expression_median.txt')
>>> up_reg_genes = td.read_geneset()
>>> patient_subset = ['TCGA-RZ-AB0B-01', 'TCGA-V3-A9ZX-01', 'TCGA-V3-A9ZY-01', 'TCGA-V4-A9E5-01', \
'TCGA-V4-A9E7-01', 'TCGA-V4-A9E8-01', 'TCGA-V4-A9E9-01', 'TCGA-V4-A9EA-01', \
'TCGA-V4-A9EC-01', 'TCGA-V4-A9ED-01', 'TCGA-V4-A9EE-01', 'TCGA-V4-A9EF-01']
>>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_reg_genes, args={'print_progress':True})
Running ssGSEA...
TCGA-RZ-AB0B-01
TCGA-V3-A9ZX-01
TCGA-V3-A9ZY-01
TCGA-V4-A9E5-01
TCGA-V4-A9E7-01
TCGA-V4-A9E8-01
TCGA-V4-A9E9-01
TCGA-V4-A9EA-01
TCGA-V4-A9EC-01
TCGA-V4-A9ED-01
TCGA-V4-A9EE-01
TCGA-V4-A9EF-01
ssGSEA has completed!
>>> ssgsea_scores.to_csv("ssgsea_results.csv", index_label='Patient_ID')
Cibersort¶
The Cibersort method uses NuSVR from sklearn to solve for the cell type frequencies. Since Cibersort is a linear method, the sig_matrix
argument is required. If a value for up_genes
is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the args
Dictionary:
- print_progress: Boolean, whether to print patient ID as cibersort iterates through. Default is False
- scaling: String, must be either ‘None’, ‘zscore’, or ‘minmax’. Determines how to scale the mixture data and signature matrix before applying CIBERSORT. Default is MinMax scaling
- scaling_axis: 0 or 1. Whether to scale mixture data and signature matrix by normalizing each column (patient/celltype) individually (scaling_axis=0) or each row (gene) individually (scaling_axis=1). Default is 0.
- nu: see sklearn’s documentation for NuSVR. If nu=’best’ (default), cibersort will use the nu in the set {0.25, 0.5, 0.75} that minimizes root mean square b/w data and prediction. If nu a float, will compute cibersort for that specific nu
- C: see sklearn’s documentation for NuSVR
- kernel:: see sklearn’s documentation for NuSVR
- shrinking: see sklearn’s documentation for NuSVR
The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains cell type frequencies for each patient in the given ‘patient_IDs’ list. Rows are indexed by cell type, column headings are patient IDs
Example 3: Run CiberSort on all patients in the RNA mixture data, using the custom signature matrix “kmeans_signature_matrix_qval.txt”, and specifying an explicit scaling and value of nu
.
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt')
>>> sig = td.read_sig_file("kmeans_signature_matrix_qval.txt", geneID='Ensembl_Gene_ID')
>>> freqs = td.tumor_deconvolve(rna, 'cibersort', sig_matrix=sig, args={'nu':0.5, 'scaling':'minmax'})
Running CiberSort...
CiberSort has completed!
>>>print(freqs)
Patient_ID CD8_subtype_1 CD8_subtype_2 CD4_subtype_1 ... Fibroblast_subtype_2 Neutrophils_subtype_1 Neutrophils_subtype_2
TCGA-RZ-AB0B-01 0.000000 0.340832 0.179790 ... 0.012600 0.000000 0.000436
TCGA-V3-A9ZX-01 0.028684 0.146691 0.152585 ... 0.010037 0.000000 0.006936
TCGA-V3-A9ZY-01 0.000000 0.488760 0.108457 ... 0.001606 0.000000 0.000814
TCGA-V4-A9E5-01 0.000000 0.399474 0.190793 ... 0.003778 0.000000 0.000999
TCGA-V4-A9E7-01 0.000000 0.455365 0.217109 ... 0.000000 0.000228 0.000000
... ... ... ... ... ... ... ... ...
TCGA-YZ-A980-01 0.009343 0.284902 0.178078 ... 0.015250 0.000000 0.000000
TCGA-YZ-A982-01 0.000000 0.269736 0.150146 ... 0.000000 0.000000 0.006947
TCGA-YZ-A983-01 0.000000 0.592702 0.109028 ... 0.000000 0.000000 0.002387
TCGA-YZ-A984-01 0.000000 0.353790 0.140766 ... 0.001984 0.011312 0.000000
TCGA-YZ-A985-01 0.000000 0.408330 0.153663 ... 0.000720 0.000000 0.003510
[80 rows x 16 columns]
DeconRNASeq¶
The DeconRNASeq method uses the Optimize method from SciPy to solve for the cell type frequencies. Since DeconRNASeq is a linear method, the sig_matrix
argument is required. If a value for up_genes
is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the args
Dictionary:
- check_sig: Boolean. Whether or not to check the condition number of the signature matrix before solving
- scaling: String. Must be either ‘None’, ‘zscore’, or ‘minmax’. Determines how to scale the signature matrix and mixture data before solving. Default is MinMax scaling.
- scaling_axis: 0 or 1. Whether to scale mixture data and signature matrix by normalizing each column (celltype/patient) separately (scaling_axis=0) or each row (gene) separately (scaling_axis=1). Default is 0.
- formulation: String. Must be either ‘qp’, ‘ridge’, or ‘lasso’. Determines how to formulate the optimization problem:
- qp: solve a QP (quadratic programming) problem min_x||Ax-b||^2 with strict condition x.sum()=1, xi>=0
- ridge: solve a ridge regression problem min_x{||Ax-b||^2 + ||x||^2} with condition xi>=0
- lasso: solve a lasso regression problem min_x{||Ax-b||^2 + Sum_i{|x_i|} with condition xi>=0
- reg_constant: Float. Regularization constant for lasso/ridge regression. Ignored if formulation set to ‘qp’
- print_progress: Boolean. Whether or not to print the solver (scipy.optimize) results
The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains cell type frequencies for each patient in the given ‘patient_IDs’ list. Rows are indexed by cell type, column headings are patient IDs
Example 4: Run DeconRNASeq on all patients in the RNA mixture data, using the default LM22 signature matrix. To evaluate our confidence in our solution, we explicitly check the condition number of the signature matrix, and print the solver results:
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt')
>>> sig = td.read_sig_file()
>>> freqs = td.tumor_deconvolve(rna, 'deconRNAseq', sig_matrix=sig, args={'check_sig':True, 'print_progress':True})
Condition number of signature matrix = 12.43933600367927
Running DeconRNASeq...
TCGA-RZ-AB0B-01
fun: 1.768742658372559
jac: array([0.22816905, 0.2551588 , 0.20533609, 0.34062135, 0.22060739,
0.20532408, 0.20530543, 0.20500609, 0.22956975, 0.32579079,
0.21855029, 0.20513448, 0.20529102, 0.20514449, 0.2050613 ,
0.32388473, 0.20581529, 0.20504086, 0.20532118, 0.2950649 ,
0.20484114, 0.30095159])
message: 'Optimization terminated successfully.'
nfev: 243
nit: 10
njev: 10
status: 0
success: True
x: array([4.79865365e-18, 0.00000000e+00, 3.24670795e-01, 4.81395017e-17,
3.27398518e-17, 1.61659557e-02, 5.85440056e-02, 5.81593582e-02,
9.57879942e-18, 4.25466532e-17, 4.63744135e-18, 3.15186488e-02,
7.94907975e-02, 2.76273695e-01, 7.08312745e-04, 7.96339326e-17,
0.00000000e+00, 6.24736273e-02, 8.95223995e-02, 1.22308564e-17,
2.47240499e-03, 1.70412859e-17])
TCGA-V3-A9ZX-01
fun: 2.3380210562625794
jac: array([0.08302617, 0.07165793, 0.07166326, 0.09838021, 0.11366048,
0.07198009, 0.07167557, 0.07217818, 0.07183281, 0.09873474,
0.07061729, 0.08603653, 0.08129218, 0.07147855, 0.07210144,
0.07260156, 0.07217726, 0.07224679, 0.07091093, 0.16353929,
0.12585488, 0.20857742])
message: 'Optimization terminated successfully.'
nfev: 242
nit: 10
njev: 10
status: 0
success: True
x: array([0.00000000e+00, 5.13881657e-03, 2.38885934e-01, 0.00000000e+00,
0.00000000e+00, 2.61113903e-02, 1.20781286e-02, 1.09609396e-02,
6.60321613e-02, 1.75537745e-17, 2.11868602e-02, 6.54156460e-18,
4.40696071e-18, 3.96846445e-01, 5.41708212e-02, 0.00000000e+00,
1.21155050e-01, 6.53890726e-03, 4.08945470e-02, 1.69936550e-17,
1.54190292e-18, 2.90338837e-17])
[ ... output truncated for space ... ]
DeconRNASeq has completed!
>>> print(freqs)
Patient_ID B cells naive B cells memory Plasma cells T cells CD8 ... Mast cells resting Mast cells activated Eosinophils Neutrophils
TCGA-RZ-AB0B-01 4.798654e-18 0.000000e+00 0.324671 4.813950e-17 ... 8.952240e-02 1.223086e-17 2.472405e-03 1.704129e-17
TCGA-V3-A9ZX-01 0.000000e+00 5.138817e-03 0.238886 0.000000e+00 ... 4.089455e-02 1.699366e-17 1.541903e-18 2.903388e-17
TCGA-V3-A9ZY-01 0.000000e+00 3.518049e-17 0.283670 0.000000e+00 ... 6.147303e-02 0.000000e+00 2.351062e-18 0.000000e+00
TCGA-V4-A9E5-01 0.000000e+00 0.000000e+00 0.258718 2.794810e-17 ... 7.591108e-02 0.000000e+00 3.969250e-18 1.350584e-17
TCGA-V4-A9E7-01 1.937510e-17 1.667966e-03 0.258230 0.000000e+00 ... 6.860309e-02 7.706740e-18 4.057952e-18 2.948868e-17
... ... ... ... ... ... ... ... ... ...
TCGA-YZ-A980-01 2.581715e-17 0.000000e+00 0.272212 5.280067e-18 ... 6.036439e-02 0.000000e+00 0.000000e+00 0.000000e+00
TCGA-YZ-A982-01 1.078221e-17 3.272811e-18 0.283501 6.271637e-17 ... 8.536993e-02 0.000000e+00 0.000000e+00 4.242467e-17
TCGA-YZ-A983-01 4.291653e-03 0.000000e+00 0.320183 0.000000e+00 ... 8.946226e-02 5.102526e-18 9.125998e-03 0.000000e+00
TCGA-YZ-A984-01 1.928266e-17 2.356858e-03 0.258814 5.212959e-18 ... 6.631310e-02 3.597047e-17 7.543348e-18 0.000000e+00
TCGA-YZ-A985-01 0.000000e+00 1.442125e-02 0.309380 9.133074e-17 ... 5.799046e-02 2.838456e-17 0.000000e+00 0.000000e+00
[80 rows x 22 columns]
ssGSEA¶
Since ssGSEA is a rank-based method, the up_genes
argument is required. If a value for sig_matrix
is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the args
Dictionary:
- print_progress: Boolean. Whether to print patient ID as ssGSEA iterates through. Default is False
- alpha: Float. Weight used in ssGSEA method. Default is 1.0
- ties_method: Describes how to treat ties in rankings. See pandas.DataFrame.rank() for list of valid options.
- norm: Boolean. Whether to normalize enrichment scores by using the entire data set, as indicated by Barbie et al., 2009, online methods, pg. 2. Default is False.
The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains ssGSEA enrichment scores for each cell type declared in ‘up_genes’ dictionary, for each patient in the given ‘patient_IDs’ list. Rows are indexed by cell type, column headings are patient IDs
Example 5: Run ssGSEA on 5 individual patients in the RNA mixture data, using the default gene set derived from LM22 and alpha = 0.5, and then normalize the results
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt')
>>> up_reg_genes = td.read_geneset()
>>> patient_subset = ['TCGA-RZ-AB0B-01', 'TCGA-V3-A9ZX-01', 'TCGA-V3-A9ZY-01', 'TCGA-V4-A9E5-01', 'TCGA-V4-A9E7-01', 'TCGA-V4-A9E8-01', 'TCGA-V4-A9E9-01', 'TCGA-V4-A9EA-01']
>>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_reg_genes, args={'alpha':0.5, 'norm':True})
Running ssGSEA...
ssGSEA has completed!
>>>> print(ssgsea_scores)
Patient_ID B cells naive B cells memory Plasma cells T cells CD8 ... Mast cells resting Mast cells activated Eosinophils Neutrophils
TCGA-RZ-AB0B-01 0.327240 0.249076 0.986881 0.326747 ... 0.735592 0.490964 0.558033 0.435243
TCGA-V3-A9ZX-01 0.276838 0.220086 0.982511 0.405832 ... 0.681473 0.527347 0.565607 0.481490
TCGA-V3-A9ZY-01 0.291156 0.239979 0.898351 0.489769 ... 0.646602 0.451504 0.532254 0.477449
TCGA-V4-A9E5-01 0.279829 0.238470 0.888191 0.521628 ... 0.619921 0.447962 0.506716 0.467605
TCGA-V4-A9E7-01 0.318732 0.155308 0.946145 0.452830 ... 0.671831 0.468233 0.485592 0.453880
TCGA-V4-A9E8-01 0.263715 0.196422 0.955722 0.547990 ... 0.608545 0.456964 0.473649 0.514065
TCGA-V4-A9E9-01 0.316531 0.217946 0.912949 0.542386 ... 0.585311 0.455807 0.461873 0.447807
TCGA-V4-A9EA-01 0.382925 0.229779 0.855658 0.572038 ... 0.583329 0.438876 0.458896 0.458504
[8 rows x 22 columns]
SingScore¶
Since SingScore is a rank-based method, the up_genes
argument is required. If a value for sig_matrix
is also declared, it will be ignored.
If both the down_genes
and up_genes
arguments are defined, the tumor_deconvolve()
function will run PySingScore’s implementation of bidirectional singscore. If only up_genes
is defined, it will run PySingScore’s implementation of unidirectional singscore.
The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains SingScore scores for each cell type declared in ‘up_genes’ dictionary, for each patient in the given ‘patient_IDs’ list. Rows are indexed by cell type, column headings are patient IDs
Example 6: Run both unidirectional and bidirectional SingScore using the default Gene Sets. To derive a down-regulated gene set, we use the td.find_up_down_genes_from_sig()
function (see Creating Gene Sets). We will create this gene set from the LM6 signature matrix, which is included in the package:
>>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt')
>>> LM6 = td.read_sig_file('LM6.txt')
>>> up_genes_LM6, down_genes_LM6 = td.find_up_down_genes_from_sig(LM6, down_cutoff=0.4, up_cutoff=4.0)
>>> singscore_unidirectional = td.tumor_deconvolve(rna, 'singscore', up_genes=up_genes_LM6)
Running SingScore...
SingScore has completed!
>>> print(singscore_unidirectional)
B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils
Patient_ID
TCGA-RZ-AB0B-01 -0.040879 -0.042245 -0.015696 -0.068842 -0.004216 -0.067957
TCGA-V3-A9ZX-01 -0.016157 0.006778 0.025657 -0.024595 0.040129 -0.035982
TCGA-V3-A9ZY-01 -0.047875 -0.054082 -0.005818 -0.085125 0.007932 -0.061282
TCGA-V4-A9E5-01 -0.053520 -0.070746 -0.033882 -0.102350 0.001286 -0.067848
TCGA-V4-A9E7-01 -0.077335 -0.070958 -0.043286 -0.098458 -0.035024 -0.090904
... ... ... ... ... ... ...
TCGA-YZ-A980-01 -0.027205 -0.025515 0.020872 -0.041668 0.000913 -0.065293
TCGA-YZ-A982-01 -0.061571 -0.051486 -0.021256 -0.073456 0.018704 -0.058435
TCGA-YZ-A983-01 -0.058798 -0.065018 -0.035282 -0.082523 -0.006307 -0.065574
TCGA-YZ-A984-01 -0.068884 -0.046064 -0.007893 -0.079870 -0.008476 -0.075183
TCGA-YZ-A985-01 -0.072839 -0.054141 -0.018756 -0.089481 -0.011167 -0.081205
[80 rows x 6 columns]
>>> singscore_bidirectional = td.tumor_deconvolve(rna, 'singscore', up_genes=up_genes_LM6, down_genes=down_genes_LM6)
Running SingScore...
SingScore has completed!
>>> print(singscore_bidirectional)
B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils
Patient_ID
TCGA-RZ-AB0B-01 0.478281 0.489102 0.489548 0.485272 0.489559 0.484873
TCGA-V3-A9ZX-01 0.488396 0.496139 0.495398 0.494335 0.491098 0.485794
TCGA-V3-A9ZY-01 0.463718 0.468867 0.474098 0.466825 0.482648 0.472018
TCGA-V4-A9E5-01 0.459967 0.460788 0.466172 0.457230 0.478536 0.463606
TCGA-V4-A9E7-01 0.457835 0.465507 0.465669 0.463028 0.479901 0.471592
... ... ... ... ... ... ...
TCGA-YZ-A980-01 0.485412 0.491924 0.493563 0.491461 0.495019 0.491583
TCGA-YZ-A982-01 0.455890 0.470598 0.471025 0.475132 0.484155 0.469897
TCGA-YZ-A983-01 0.470838 0.479412 0.478998 0.482687 0.488062 0.481390
TCGA-YZ-A984-01 0.460155 0.471427 0.473025 0.463111 0.480844 0.466355
TCGA-YZ-A985-01 0.443784 0.463111 0.467764 0.459697 0.470359 0.456920
[80 rows x 6 columns]