.. highlight:: rst Running Digital Cytometry +++++++++++++++++++++++++ All digital cytometry in TumorDecon is done with the ``tumor_deconvolve()`` function: .. function:: tumor_deconvolve (mixture_data, method, patient_IDs='ALL', sig_matrix=None, up_genes=None, down_genes=None, args={}) Given a digital cytometry method, cell signature, and a mixture (RNA Sequence) dataset, return the estimated frequencies/ranks of each cell type in each patient. :param mixture_data: Pandas DataFrame of rna gene expression data. Rows are genes (indexed by 'Hugo_Symbol') and columns are patients :param method: Method for tumor deconvolution. Must be either 'CIBERSORT', 'DeconRNASeq', 'ssGSEA', or 'SingScore' :param patient_IDs: Either a list of patient IDs to run digital cytometry on, or the string 'ALL' (to run for all patients). The default is 'ALL' :param sig_matrix: Pandas DataFrame of signature gene expression values for given cell types. Rows are genes (indexed by 'Hugo_Symbol') and columns are cell types *** Required for 'CIBERSORT' and 'DeconRNASeq' methods *** (ignored for 'ssGSEA' and 'SingScore' methods) :param up_genes: Dictionary. Keys are cell types, vaues are a list of up-regulated genes (given by Hugo Symbol) for that cell type *** Required for 'ssGSEA' and 'SingScore' methods *** (ignored for 'CIBERSORT' and 'DeconRNASeq' methods) :param down_genes: Dictionary. Keys are cell types, vaues are a list of down-regulated genes (given by Hugo Symbol) for that cell type *** Optional for 'SingScore' method *** (ignored for 'CIBERSORT', 'DeconRNASeq', and 'ssGSEA' methods) :param args: Dictionary. Additional arguments to pass on to the given method. For a list of valid arguments for each method, see the subsections below. :return: Pandas DataFrame. Frequencies/scores for each specified patient and cell type in the signature matrix, as solved for using the specified method **Example 1:** Run digital cytometry on all patients in the RNA mixture data, using the cibersort method, and save the output to a csv file. Since CiberSort is a linear method, we need to pass in a signature matrix for the ``cell_signature`` argument, in the form of a Pandas DataFrame. Using the ``read_sig_file()`` function with no argument will read in LM22 by default. :: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_RNA_Seq_v2_expression_median.txt') >>> sig = td.read_sig_file() >>> freqs = td.tumor_deconvolve(rna, 'cibersort', sig_matrix=sig) Running CiberSort... CiberSort has completed! >>> freqs.to_csv("cibersort_results.csv", index_label='Patient_ID') **Example 2:** Run digital cytometry on just a subset of patients in the RNA mixture data, using the ssGSEA method. Then save output to a cvs file: Since ssGSEA is a rank-based method, we need to pass in a list for the ``up_genes`` argument. Using the ``read_geneset()`` function with no argument will read in the default up-regulated gene set (derived from single-cell reference profiles of LM22, in Le et al. (2020)). We also define the list of patients we want to calculate scores for in a variable named ``patient_subset``:: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_RNA_Seq_v2_expression_median.txt') >>> up_reg_genes = td.read_geneset() >>> patient_subset = ['TCGA-RZ-AB0B-01', 'TCGA-V3-A9ZX-01', 'TCGA-V3-A9ZY-01', 'TCGA-V4-A9E5-01', \ 'TCGA-V4-A9E7-01', 'TCGA-V4-A9E8-01', 'TCGA-V4-A9E9-01', 'TCGA-V4-A9EA-01', \ 'TCGA-V4-A9EC-01', 'TCGA-V4-A9ED-01', 'TCGA-V4-A9EE-01', 'TCGA-V4-A9EF-01'] >>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_reg_genes, args={'print_progress':True}) Running ssGSEA... TCGA-RZ-AB0B-01 TCGA-V3-A9ZX-01 TCGA-V3-A9ZY-01 TCGA-V4-A9E5-01 TCGA-V4-A9E7-01 TCGA-V4-A9E8-01 TCGA-V4-A9E9-01 TCGA-V4-A9EA-01 TCGA-V4-A9EC-01 TCGA-V4-A9ED-01 TCGA-V4-A9EE-01 TCGA-V4-A9EF-01 ssGSEA has completed! >>> ssgsea_scores.to_csv("ssgsea_results.csv", index_label='Patient_ID') Cibersort ********* The Cibersort method uses NuSVR from sklearn to solve for the cell type frequencies. Since Cibersort is a linear method, the ``sig_matrix`` argument is required. If a value for ``up_genes`` is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the ``args`` Dictionary: - **print_progress**: Boolean, whether to print patient ID as cibersort iterates through. Default is False - **scaling**: String, must be either 'None', 'zscore', or 'minmax'. Determines how to scale the mixture data and signature matrix before applying CIBERSORT. Default is MinMax scaling - **scaling_axis**: 0 or 1. Whether to scale mixture data and signature matrix by normalizing each column (patient/celltype) individually (scaling_axis=0) or each row (gene) individually (scaling_axis=1). Default is 0. - **nu**: see `sklearn's documentation for NuSVR `_. If nu='best' (default), cibersort will use the nu in the set {0.25, 0.5, 0.75} that minimizes root mean square b/w data and prediction. If nu a float, will compute cibersort for that specific nu - C: see `sklearn's documentation for NuSVR `_ - kernel:: see `sklearn's documentation for NuSVR `_ - shrinking: see `sklearn's documentation for NuSVR `_ The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains cell type frequencies for each patient in the given 'patient_IDs' list. Rows are indexed by cell type, column headings are patient IDs **Example 3:** Run CiberSort on all patients in the RNA mixture data, using the :ref:`custom signature matrix ` "kmeans_signature_matrix_qval.txt", and specifying an explicit scaling and value of ``nu``. :: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt') >>> sig = td.read_sig_file("kmeans_signature_matrix_qval.txt", geneID='Ensembl_Gene_ID') >>> freqs = td.tumor_deconvolve(rna, 'cibersort', sig_matrix=sig, args={'nu':0.5, 'scaling':'minmax'}) Running CiberSort... CiberSort has completed! >>>print(freqs) Patient_ID CD8_subtype_1 CD8_subtype_2 CD4_subtype_1 ... Fibroblast_subtype_2 Neutrophils_subtype_1 Neutrophils_subtype_2 TCGA-RZ-AB0B-01 0.000000 0.340832 0.179790 ... 0.012600 0.000000 0.000436 TCGA-V3-A9ZX-01 0.028684 0.146691 0.152585 ... 0.010037 0.000000 0.006936 TCGA-V3-A9ZY-01 0.000000 0.488760 0.108457 ... 0.001606 0.000000 0.000814 TCGA-V4-A9E5-01 0.000000 0.399474 0.190793 ... 0.003778 0.000000 0.000999 TCGA-V4-A9E7-01 0.000000 0.455365 0.217109 ... 0.000000 0.000228 0.000000 ... ... ... ... ... ... ... ... ... TCGA-YZ-A980-01 0.009343 0.284902 0.178078 ... 0.015250 0.000000 0.000000 TCGA-YZ-A982-01 0.000000 0.269736 0.150146 ... 0.000000 0.000000 0.006947 TCGA-YZ-A983-01 0.000000 0.592702 0.109028 ... 0.000000 0.000000 0.002387 TCGA-YZ-A984-01 0.000000 0.353790 0.140766 ... 0.001984 0.011312 0.000000 TCGA-YZ-A985-01 0.000000 0.408330 0.153663 ... 0.000720 0.000000 0.003510 [80 rows x 16 columns] DeconRNASeq *********** The DeconRNASeq method uses the Optimize method from SciPy to solve for the cell type frequencies. Since DeconRNASeq is a linear method, the ``sig_matrix`` argument is required. If a value for ``up_genes`` is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the ``args`` Dictionary: - **check_sig**: Boolean. Whether or not to check the condition number of the signature matrix before solving - **scaling**: String. Must be either 'None', 'zscore', or 'minmax'. Determines how to scale the signature matrix and mixture data before solving. Default is MinMax scaling. - **scaling_axis**: 0 or 1. Whether to scale mixture data and signature matrix by normalizing each column (celltype/patient) separately (scaling_axis=0) or each row (gene) separately (scaling_axis=1). Default is 0. - **formulation**: String. Must be either 'qp', 'ridge', or 'lasso'. Determines how to formulate the optimization problem: - **qp**: solve a QP (quadratic programming) problem min_x||Ax-b||^2 with strict condition x.sum()=1, xi>=0 - **ridge**: solve a ridge regression problem min_x{||Ax-b||^2 + ||x||^2} with condition xi>=0 - **lasso**: solve a lasso regression problem min_x{||Ax-b||^2 + Sum_i{|x_i|} with condition xi>=0 - **reg_constant**: Float. Regularization constant for lasso/ridge regression. Ignored if formulation set to 'qp' - **print_progress**: Boolean. Whether or not to print the solver (scipy.optimize) results The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains cell type frequencies for each patient in the given 'patient_IDs' list. Rows are indexed by cell type, column headings are patient IDs **Example 4:** Run DeconRNASeq on all patients in the RNA mixture data, using the default LM22 signature matrix. To evaluate our confidence in our solution, we explicitly check the condition number of the signature matrix, and print the solver results: :: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt') >>> sig = td.read_sig_file() >>> freqs = td.tumor_deconvolve(rna, 'deconRNAseq', sig_matrix=sig, args={'check_sig':True, 'print_progress':True}) Condition number of signature matrix = 12.43933600367927 Running DeconRNASeq... TCGA-RZ-AB0B-01 fun: 1.768742658372559 jac: array([0.22816905, 0.2551588 , 0.20533609, 0.34062135, 0.22060739, 0.20532408, 0.20530543, 0.20500609, 0.22956975, 0.32579079, 0.21855029, 0.20513448, 0.20529102, 0.20514449, 0.2050613 , 0.32388473, 0.20581529, 0.20504086, 0.20532118, 0.2950649 , 0.20484114, 0.30095159]) message: 'Optimization terminated successfully.' nfev: 243 nit: 10 njev: 10 status: 0 success: True x: array([4.79865365e-18, 0.00000000e+00, 3.24670795e-01, 4.81395017e-17, 3.27398518e-17, 1.61659557e-02, 5.85440056e-02, 5.81593582e-02, 9.57879942e-18, 4.25466532e-17, 4.63744135e-18, 3.15186488e-02, 7.94907975e-02, 2.76273695e-01, 7.08312745e-04, 7.96339326e-17, 0.00000000e+00, 6.24736273e-02, 8.95223995e-02, 1.22308564e-17, 2.47240499e-03, 1.70412859e-17]) TCGA-V3-A9ZX-01 fun: 2.3380210562625794 jac: array([0.08302617, 0.07165793, 0.07166326, 0.09838021, 0.11366048, 0.07198009, 0.07167557, 0.07217818, 0.07183281, 0.09873474, 0.07061729, 0.08603653, 0.08129218, 0.07147855, 0.07210144, 0.07260156, 0.07217726, 0.07224679, 0.07091093, 0.16353929, 0.12585488, 0.20857742]) message: 'Optimization terminated successfully.' nfev: 242 nit: 10 njev: 10 status: 0 success: True x: array([0.00000000e+00, 5.13881657e-03, 2.38885934e-01, 0.00000000e+00, 0.00000000e+00, 2.61113903e-02, 1.20781286e-02, 1.09609396e-02, 6.60321613e-02, 1.75537745e-17, 2.11868602e-02, 6.54156460e-18, 4.40696071e-18, 3.96846445e-01, 5.41708212e-02, 0.00000000e+00, 1.21155050e-01, 6.53890726e-03, 4.08945470e-02, 1.69936550e-17, 1.54190292e-18, 2.90338837e-17]) [ ... output truncated for space ... ] DeconRNASeq has completed! >>> print(freqs) Patient_ID B cells naive B cells memory Plasma cells T cells CD8 ... Mast cells resting Mast cells activated Eosinophils Neutrophils TCGA-RZ-AB0B-01 4.798654e-18 0.000000e+00 0.324671 4.813950e-17 ... 8.952240e-02 1.223086e-17 2.472405e-03 1.704129e-17 TCGA-V3-A9ZX-01 0.000000e+00 5.138817e-03 0.238886 0.000000e+00 ... 4.089455e-02 1.699366e-17 1.541903e-18 2.903388e-17 TCGA-V3-A9ZY-01 0.000000e+00 3.518049e-17 0.283670 0.000000e+00 ... 6.147303e-02 0.000000e+00 2.351062e-18 0.000000e+00 TCGA-V4-A9E5-01 0.000000e+00 0.000000e+00 0.258718 2.794810e-17 ... 7.591108e-02 0.000000e+00 3.969250e-18 1.350584e-17 TCGA-V4-A9E7-01 1.937510e-17 1.667966e-03 0.258230 0.000000e+00 ... 6.860309e-02 7.706740e-18 4.057952e-18 2.948868e-17 ... ... ... ... ... ... ... ... ... ... TCGA-YZ-A980-01 2.581715e-17 0.000000e+00 0.272212 5.280067e-18 ... 6.036439e-02 0.000000e+00 0.000000e+00 0.000000e+00 TCGA-YZ-A982-01 1.078221e-17 3.272811e-18 0.283501 6.271637e-17 ... 8.536993e-02 0.000000e+00 0.000000e+00 4.242467e-17 TCGA-YZ-A983-01 4.291653e-03 0.000000e+00 0.320183 0.000000e+00 ... 8.946226e-02 5.102526e-18 9.125998e-03 0.000000e+00 TCGA-YZ-A984-01 1.928266e-17 2.356858e-03 0.258814 5.212959e-18 ... 6.631310e-02 3.597047e-17 7.543348e-18 0.000000e+00 TCGA-YZ-A985-01 0.000000e+00 1.442125e-02 0.309380 9.133074e-17 ... 5.799046e-02 2.838456e-17 0.000000e+00 0.000000e+00 [80 rows x 22 columns] ssGSEA ****** Since ssGSEA is a rank-based method, the ``up_genes`` argument is required. If a value for ``sig_matrix`` is also declared, it will be ignored. When using this method, the following additional arguments may be passed to the function via the ``args`` Dictionary: - **print_progress**: Boolean. Whether to print patient ID as ssGSEA iterates through. Default is False - **alpha**: Float. Weight used in ssGSEA method. Default is 1.0 - **ties_method**: Describes how to treat ties in rankings. See pandas.DataFrame.rank() for list of valid options. - **norm**: Boolean. Whether to normalize enrichment scores by using the entire data set, as indicated by Barbie et al., 2009, online methods, pg. 2. Default is False. The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains ssGSEA enrichment scores for each cell type declared in 'up_genes' dictionary, for each patient in the given 'patient_IDs' list. Rows are indexed by cell type, column headings are patient IDs **Example 5:** Run ssGSEA on 5 individual patients in the RNA mixture data, using the default gene set derived from LM22 and alpha = 0.5, and then normalize the results :: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt') >>> up_reg_genes = td.read_geneset() >>> patient_subset = ['TCGA-RZ-AB0B-01', 'TCGA-V3-A9ZX-01', 'TCGA-V3-A9ZY-01', 'TCGA-V4-A9E5-01', 'TCGA-V4-A9E7-01', 'TCGA-V4-A9E8-01', 'TCGA-V4-A9E9-01', 'TCGA-V4-A9EA-01'] >>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_reg_genes, args={'alpha':0.5, 'norm':True}) Running ssGSEA... ssGSEA has completed! >>>> print(ssgsea_scores) Patient_ID B cells naive B cells memory Plasma cells T cells CD8 ... Mast cells resting Mast cells activated Eosinophils Neutrophils TCGA-RZ-AB0B-01 0.327240 0.249076 0.986881 0.326747 ... 0.735592 0.490964 0.558033 0.435243 TCGA-V3-A9ZX-01 0.276838 0.220086 0.982511 0.405832 ... 0.681473 0.527347 0.565607 0.481490 TCGA-V3-A9ZY-01 0.291156 0.239979 0.898351 0.489769 ... 0.646602 0.451504 0.532254 0.477449 TCGA-V4-A9E5-01 0.279829 0.238470 0.888191 0.521628 ... 0.619921 0.447962 0.506716 0.467605 TCGA-V4-A9E7-01 0.318732 0.155308 0.946145 0.452830 ... 0.671831 0.468233 0.485592 0.453880 TCGA-V4-A9E8-01 0.263715 0.196422 0.955722 0.547990 ... 0.608545 0.456964 0.473649 0.514065 TCGA-V4-A9E9-01 0.316531 0.217946 0.912949 0.542386 ... 0.585311 0.455807 0.461873 0.447807 TCGA-V4-A9EA-01 0.382925 0.229779 0.855658 0.572038 ... 0.583329 0.438876 0.458896 0.458504 [8 rows x 22 columns] SingScore ********* Since SingScore is a rank-based method, the ``up_genes`` argument is required. If a value for ``sig_matrix`` is also declared, it will be ignored. If both the ``down_genes`` and ``up_genes`` arguments are defined, the ``tumor_deconvolve()`` function will run PySingScore's implementation of bidirectional singscore. If only ``up_genes`` is defined, it will run PySingScore's implementation of unidirectional singscore. The output for this method is a Pandas DataFrame (num_celltypes x num_patients) that contains SingScore scores for each cell type declared in 'up_genes' dictionary, for each patient in the given 'patient_IDs' list. Rows are indexed by cell type, column headings are patient IDs **Example 6:** Run both unidirectional and bidirectional SingScore using the default Gene Sets. To derive a down-regulated gene set, we use the ``td.find_up_down_genes_from_sig()`` function (see :ref:`Creating Gene Sets `). We will create this gene set from the LM6 signature matrix, which is included in the package: :: >>> rna = td.read_rna_file('uvm_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt') >>> LM6 = td.read_sig_file('LM6.txt') >>> up_genes_LM6, down_genes_LM6 = td.find_up_down_genes_from_sig(LM6, down_cutoff=0.4, up_cutoff=4.0) >>> singscore_unidirectional = td.tumor_deconvolve(rna, 'singscore', up_genes=up_genes_LM6) Running SingScore... SingScore has completed! >>> print(singscore_unidirectional) B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils Patient_ID TCGA-RZ-AB0B-01 -0.040879 -0.042245 -0.015696 -0.068842 -0.004216 -0.067957 TCGA-V3-A9ZX-01 -0.016157 0.006778 0.025657 -0.024595 0.040129 -0.035982 TCGA-V3-A9ZY-01 -0.047875 -0.054082 -0.005818 -0.085125 0.007932 -0.061282 TCGA-V4-A9E5-01 -0.053520 -0.070746 -0.033882 -0.102350 0.001286 -0.067848 TCGA-V4-A9E7-01 -0.077335 -0.070958 -0.043286 -0.098458 -0.035024 -0.090904 ... ... ... ... ... ... ... TCGA-YZ-A980-01 -0.027205 -0.025515 0.020872 -0.041668 0.000913 -0.065293 TCGA-YZ-A982-01 -0.061571 -0.051486 -0.021256 -0.073456 0.018704 -0.058435 TCGA-YZ-A983-01 -0.058798 -0.065018 -0.035282 -0.082523 -0.006307 -0.065574 TCGA-YZ-A984-01 -0.068884 -0.046064 -0.007893 -0.079870 -0.008476 -0.075183 TCGA-YZ-A985-01 -0.072839 -0.054141 -0.018756 -0.089481 -0.011167 -0.081205 [80 rows x 6 columns] >>> singscore_bidirectional = td.tumor_deconvolve(rna, 'singscore', up_genes=up_genes_LM6, down_genes=down_genes_LM6) Running SingScore... SingScore has completed! >>> print(singscore_bidirectional) B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils Patient_ID TCGA-RZ-AB0B-01 0.478281 0.489102 0.489548 0.485272 0.489559 0.484873 TCGA-V3-A9ZX-01 0.488396 0.496139 0.495398 0.494335 0.491098 0.485794 TCGA-V3-A9ZY-01 0.463718 0.468867 0.474098 0.466825 0.482648 0.472018 TCGA-V4-A9E5-01 0.459967 0.460788 0.466172 0.457230 0.478536 0.463606 TCGA-V4-A9E7-01 0.457835 0.465507 0.465669 0.463028 0.479901 0.471592 ... ... ... ... ... ... ... TCGA-YZ-A980-01 0.485412 0.491924 0.493563 0.491461 0.495019 0.491583 TCGA-YZ-A982-01 0.455890 0.470598 0.471025 0.475132 0.484155 0.469897 TCGA-YZ-A983-01 0.470838 0.479412 0.478998 0.482687 0.488062 0.481390 TCGA-YZ-A984-01 0.460155 0.471427 0.473025 0.463111 0.480844 0.466355 TCGA-YZ-A985-01 0.443784 0.463111 0.467764 0.459697 0.470359 0.456920 [80 rows x 6 columns]