.. highlight:: rst Quickstart ++++++++++ Installation ************* To install the software with pip, use the following two commands: :: pip install TumorDecon pip install git+https://github.com/kristyhoran/singscore To load the module into python, use: .. code-block:: python import TumorDecon as td Tutorial ********* Here we provide a basic tutorial for importing a dataset and running each of the 4 methods provided in TumorDecon: **Step 1:** Read in gene expression profiles. Profiles can either be downloaded directly from cBioPortal or UCSC Xena, or loaded in from a local CSV file: .. code-block:: python # Location of sample data (included with the TumorDecon package): >>> data_loc = td.get_td_Home()+"data/" # Read in sample data (original source - Colorectal Adenocarcinoma RNA Seq v2 from cBioPortal.org): >>> rna = td.read_rna_file(data_loc+'coadred_data_RNA_Seq_v2_expression_median.txt') >>> print(rna) TCGA-3L-AA1B-01 TCGA-4N-A93T-01 TCGA-4T-AA8H-01 ... TCGA-AG-A02N-01 TCGA-AG-A02X-01 TCGA-AG-A032-01 Hugo_Symbol ... A1BG 22.1470 171.2680 20.9980 ... 6.936515 16.165630 22.836611 A1CF 220.9870 100.6290 174.0080 ... 174.525498 191.864642 159.905526 A2BP1 2.4178 10.1597 0.7311 ... 1.183092 75.017749 23.387740 A2LD1 177.4080 371.3640 295.6750 ... 111.407100 70.135640 141.760142 A2M 15911.5000 1494.3300 1333.5700 ... 2198.884111 1163.266249 2785.679374 ... ... ... ... ... ... ... ... ZYG11A 3.3849 0.4838 0.0000 ... 0.265865 -0.402069 0.898260 ZYG11B 543.0370 290.7600 669.7130 ... 914.956269 544.076931 685.383167 ZYX 6259.1900 4653.1200 4460.6100 ... 2768.765832 5001.453757 4643.428584 ZZEF1 1358.3200 1220.1300 3002.0100 ... 1151.317795 1822.211921 2125.966458 ZZZ3 798.3560 333.8170 530.0680 ... 996.284836 735.466401 343.951649 [17493 rows x 592 columns] # Can alternatively download additional data sets directly from cBioPortal, and fetch any missing Hugo Symbols from NCBI website: >>> rna_cbio = td.download_by_name('cbio', 'Colorectal Adenocarcinoma', fetch_missing_hugo=True) Downloading data from cbioportal to /Users/RachelAronow/testenv/lib/python3.9/site-packages/TumorDecon/data/downloaded//coadread_tcga_pan_can_atlas_2018.tar.gz... 100% [......................................................................] 221792847 / 221792847 Fetching missing Hugo Symbols for genes by Entrez ID... Found 1 / 13 of the missing values >>> print(rna_cbio) TCGA-3L-AA1B-01 TCGA-4N-A93T-01 TCGA-4T-AA8H-01 ... TCGA-AG-A02N-01 TCGA-AG-A02X-01 TCGA-AG-A032-01 Hugo_Symbol ... UBE2Q2P2 15.7640 4.2767 11.3032 ... 2.815511 5.598328 13.724609 HMGB1P1 144.4000 142.6610 143.1990 ... 333.491605 373.021521 409.240194 LOC155060 441.9730 522.0130 288.0640 ... 138.619145 170.620655 301.792311 EFCAB8 2.4178 2.4190 2.9245 ... 0.866737 2.568721 1.408164 SRP14P1 2.9014 2.9028 3.6556 ... 9.461217 3.138180 6.967566 ... ... ... ... ... ... ... ... ZYG11A 3.3849 0.4838 0.0000 ... 0.265865 -0.402069 0.898260 ZYG11B 543.0370 290.7600 669.7130 ... 914.956269 544.076931 685.383167 ZYX 6259.1900 4653.1200 4460.6100 ... 2768.765832 5001.453757 4643.428584 ZZEF1 1358.3200 1220.1300 3002.0100 ... 1151.317795 1822.211921 2125.966458 ZZZ3 798.3560 333.8170 530.0680 ... 996.284836 735.466401 343.951649 [17498 rows x 592 columns] # Or from UCSC Xena: >>> rna_xena = td.download_by_name('xena', 'Colon and Rectal Cancer') Downloading data from Xena Hub to /Users/RachelAronow/testenv/lib/python3.9/site-packages/TumorDecon/data/downloaded//HiSeqV2.gz... 100% [........................................................................] 23048575 / 23048575 >>> print(rna_xena) TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 ... TCGA-AF-A56K-01 TCGA-DC-6154-01 TCGA-AG-3592-01 Hugo_Symbol ... ARHGEF10L 1144.371583 2226.484998 2082.076345 ... 1685.246812 2598.827162 1751.623022 HIF3A 12.151906 4.076028 3.245104 ... 104.595167 67.693420 33.003233 RNF17 0.000000 0.000000 0.463578 ... 0.000000 0.000000 0.000000 RNF10 1432.871476 3297.970130 2907.783085 ... 2209.719087 3005.355704 2588.575669 RNF11 1463.000164 1334.347953 976.802985 ... 778.173067 856.623127 619.978711 ... ... ... ... ... ... ... ... PTRF 712.712881 486.041197 1630.066627 ... 5007.628873 2691.256217 1859.016882 BCL6B 53.360997 28.531853 129.345774 ... 393.587105 237.575155 132.333917 GSTK1 6002.965618 3106.847504 5442.428906 ... 2451.776390 3604.041120 4599.898773 SELP 14.265469 16.813511 64.439962 ... 108.205371 36.614057 74.656913 SELS 1901.525401 1535.572368 1308.772019 ... 1163.181273 1722.588903 1359.197095 [20530 rows x 434 columns] **Step 2-A:** Running Linear Methods (cibersort and DeconRNASeq) Linear methods require a signature matrix, and output cell fractions: .. code-block:: python # read in LM22 file (the default) as the signature matrix >>> sig = td.read_sig_file() >>> print(sig) B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils Hugo_Symbol ... ABCB4 555.713449 10.744235 7.225819 ... 34.113659 15.030530 14.906888 ABCB9 15.603544 22.094787 653.392328 ... 23.615746 29.786442 33.679147 ACAP1 215.305951 321.621021 38.616872 ... 73.607932 696.442802 596.025961 ACHE 15.117949 16.648847 22.123737 ... 20.501249 40.414927 22.766494 ACP5 605.897384 1935.201479 1120.104684 ... 323.381277 860.563374 307.142798 ... ... ... ... ... ... ... ... ZNF204P 77.876897 42.946213 41.075062 ... 25.206700 65.487837 27.485546 ZNF222 9.800664 7.203218 11.691368 ... 211.493742 346.556451 26.825597 ZNF286A 737.859904 331.908761 73.882022 ... 25.833293 7.473171 441.313469 ZNF324 28.211994 23.464424 21.173326 ... 45.775175 35.293284 35.424698 ZNF442 19.713197 22.447533 19.391583 ... 3.646156 40.296509 20.157522 [547 rows x 22 columns] # Can also use a custom signature matrix, such as the example "kmeans_signature_matrix_qval.txt" from the Signature Matrix tutorial. # Note this signature matrix uses Ensembl Gene IDs instead of Hugo Symbols - specify this in order to have them converted to Hugo Symbols >>> custom_sig = td.read_sig_file(td.get_td_Home()+'data/kmeans_signature_matrix_qval.txt', geneID='Ensembl_Gene_ID') >>> print(custom_sig) CD8_subtype_1 CD8_subtype_2 CD4_subtype_1 ... Fibroblast_subtype_2 Neutrophils_subtype_1 Neutrophils_subtype_2 Hugo_Symbol ... FGR 116.000139 24.380169 8.453629 ... 5.412057 558.770835 739.358363 NFYA 22.503403 0.752393 15.056201 ... 2.073508 46.759730 37.213350 LASP1 154.161176 36.456214 110.461637 ... 10.793872 259.252004 250.373201 TSR3 39.389183 48.998150 16.131921 ... 0.371443 2.038300 3.715013 NADK 16.429699 12.887361 9.760582 ... 1.503814 871.354356 1006.496017 ... ... ... ... ... ... ... ... TRBJ1-3 239.011558 239.012712 281.496628 ... 0.898271 0.000000 0.000000 TRBJ1-5 607.825543 607.828500 734.535240 ... 1.117383 0.000000 0.000000 TRBJ1-1 379.176464 379.177563 378.034351 ... 2.021114 0.000000 0.000000 LINC01002 0.673294 0.928260 12.979473 ... 5.542829 1121.606395 1058.730521 MIR1248 925.340683 925.015091 49.996583 ... 1.854341 0.000000 0.000000 [972 rows x 16 columns] # Run cibersort method on ALL patients: # optional argments include: # 'scaling': how to scale the data (rna mixture & signature matrix) before applying Cibersort (must be either 'None', 'zscore', or 'minmax') # 'scaling_axis': whether to scale data by normalizing each cell types / patient individually (0) or each gene individually (1). default is 0 # 'nu','C','kernel': all arguments to be passed to sklearn's implementation of NuSVR. See sklearn documentation for details # Additionally, can pass in 'nu':'best' to choose the nu from [0.25, 0.5, 0.75] that minimizes the root mean square error # Defaults are nu='best', C=1.0, kernel=linear # 'print_progress': whether to print patient ID as cibersort works through each patient (default: False) .. code-block:: python >>> ciber_freqs = td.tumor_deconvolve(rna, 'cibersort', patient_IDs='ALL', sig_matrix=sig, args={'nu':'best', 'scaling':'minmax'}) Running CiberSort... CiberSort has completed! >>> print(ciber_freqs) Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils TCGA-3L-AA1B-01 0.076577 0.000000 0.019106 ... 0.066274 0.000000 0.0 TCGA-4N-A93T-01 0.004291 0.054434 0.045281 ... 0.185916 0.007008 0.0 TCGA-4T-AA8H-01 0.013573 0.000000 0.118549 ... 0.160762 0.033994 0.0 TCGA-5M-AAT4-01 0.000000 0.066355 0.000000 ... 0.266957 0.000000 0.0 TCGA-5M-AAT5-01 0.022618 0.000000 0.024259 ... 0.359011 0.021282 0.0 ... ... ... ... ... ... ... ... TCGA-AG-A026-01 0.000000 0.022986 0.003191 ... 0.253747 0.006140 0.0 TCGA-AG-A02G-01 0.021680 0.000175 0.000000 ... 0.169971 0.000374 0.0 TCGA-AG-A02N-01 0.016291 0.007369 0.004204 ... 0.179287 0.009278 0.0 TCGA-AG-A02X-01 0.005250 0.000000 0.003403 ... 0.181726 0.004704 0.0 TCGA-AG-A032-01 0.072213 0.000000 0.034568 ... 0.216468 0.009628 0.0 [592 rows x 22 columns] # Save results to a file: >>> ciber_freqs.to_csv("cibersort_results.csv", index_label='Patient_ID') # Run DeconRNASeq: # optional argments include: # 'scaling': Same as in cibersort # 'scaling_axis': Same as in cibersort # 'check_sig': whether to check the condition number of the signature matrix before solving (default: False) # 'print_progress': whether to print the results of each fit while function is running (default: False) >>> decon_freqs = td.tumor_deconvolve(rna, 'DeconRNASeq', patient_IDs='ALL', sig_matrix=sig, args={'scaling':'minmax', 'print_progress':False}) Running DeconRNASeq... DeconRNASeq has completed! >>> print(decon_freqs) Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils TCGA-3L-AA1B-01 2.582540e-02 0.000000e+00 0.165577 ... 3.088621e-17 0.000000e+00 3.234649e-17 TCGA-4N-A93T-01 2.305311e-02 1.298978e-16 0.141491 ... 0.000000e+00 5.750904e-18 1.934592e-17 TCGA-4T-AA8H-01 1.053894e-02 5.177404e-17 0.159989 ... 7.645319e-18 1.235998e-02 5.827587e-19 TCGA-5M-AAT4-01 5.394090e-17 1.991418e-02 0.109398 ... 3.574299e-17 4.039499e-17 2.120351e-17 TCGA-5M-AAT5-01 1.089422e-02 4.913046e-17 0.144355 ... 2.883490e-17 6.280517e-18 9.021213e-18 ... ... ... ... ... ... ... ... TCGA-AG-A026-01 2.113573e-02 1.137586e-17 0.141309 ... 1.324951e-17 1.269026e-18 5.087319e-17 TCGA-AG-A02G-01 1.517414e-02 0.000000e+00 0.131320 ... 0.000000e+00 2.747775e-18 1.425387e-17 TCGA-AG-A02N-01 2.555726e-02 6.160640e-17 0.130763 ... 0.000000e+00 0.000000e+00 3.678833e-17 TCGA-AG-A02X-01 2.959905e-02 5.615945e-18 0.182780 ... 1.647317e-17 4.197899e-18 1.486629e-17 TCGA-AG-A032-01 3.053816e-02 4.980554e-17 0.140349 ... 1.588356e-17 4.061692e-17 0.000000e+00 [592 rows x 22 columns] **Step 2-B:** Running Rank-Based Methods (ssGSEA and SingScore) Rank-based methods require a list of up-regulated genes, and output scores/ranks for each cell type: .. code-block:: python # Read in up-regulated gene set derived from LM22 (default): >>> up_geneset = td.read_geneset() # When running tumor_deconvolve(), the default is to apply the method on all patients in the rna data frame (patient_IDs='ALL'). # To instead run the method on only a select list of patients, pass in a list of patient identifiers (column names) >>> patient_subset = ['TCGA-3L-AA1B-01', 'TCGA-4N-A93T-01', 'TCGA-4T-AA8H-01', 'TCGA-5M-AAT4-01', \ 'TCGA-5M-AAT5-01', 'TCGA-5M-AAT6-01', 'TCGA-5M-AATA-01', 'TCGA-5M-AATE-01', \ 'TCGA-A6-2675-01', 'TCGA-A6-2682-01', 'TCGA-A6-2684-01', 'TCGA-A6-2685-01'] # Run ssGSEA on only a few patients: # optional arguments include: # 'alpha': Weight used in ssGSEA method (default: 1.0) # 'norm': whether to normalize enrichment scores, as done by Barbie et al., 2009, online methods, pg. 2 # 'print_progress': whether to print patient ID as ssGSEA iterates through (default: False) >>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_geneset, args={'alpha':0.5, 'norm':True, 'print_progress':False}) .. code-block:: python Running ssGSEA... ssGSEA has completed! >>> print(ssgsea_scores) Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils TCGA-3L-AA1B-01 0.259867 0.194624 0.885539 ... 0.476357 0.487919 0.347988 TCGA-4N-A93T-01 0.279005 0.085207 0.876560 ... 0.487487 0.507909 0.403473 TCGA-4T-AA8H-01 0.229115 0.094173 0.872763 ... 0.440532 0.515211 0.398734 TCGA-5M-AAT4-01 0.213934 0.146207 0.838287 ... 0.484334 0.478675 0.329150 TCGA-5M-AAT5-01 0.226495 0.083503 0.887130 ... 0.570072 0.490234 0.360500 TCGA-5M-AAT6-01 0.220341 0.196759 0.928414 ... 0.495182 0.551787 0.444604 TCGA-5M-AATA-01 0.264451 0.242712 0.843088 ... 0.562232 0.517319 0.385985 TCGA-5M-AATE-01 0.255101 0.159893 0.860415 ... 0.487739 0.502279 0.389226 TCGA-A6-2675-01 0.239517 0.216401 0.880584 ... 0.442438 0.552197 0.480925 TCGA-A6-2682-01 0.215060 0.159228 0.934567 ... 0.572144 0.557205 0.490885 TCGA-A6-2684-01 0.200944 0.187249 0.888014 ... 0.466842 0.530049 0.442364 TCGA-A6-2685-01 0.174148 0.183445 0.901240 ... 0.512232 0.559692 0.475958 [12 rows x 22 columns] # SingScore can be run with just an up-regulated gene set (unidirectional): >>> singscore_unidirectional = td.tumor_deconvolve(rna, 'singscore', patient_IDs=patient_subset, up_genes=up_geneset) Running SingScore... SingScore has completed! >>> print(singscore_unidirectional) B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils Patient_ID ... TCGA-3L-AA1B-01 0.043939 0.019919 0.390517 ... 0.143151 0.170208 0.057520 TCGA-4N-A93T-01 0.042402 -0.018819 0.388269 ... 0.130023 0.159335 0.047293 TCGA-4T-AA8H-01 -0.001167 -0.034557 0.383143 ... 0.121452 0.173807 0.054449 TCGA-5M-AAT4-01 -0.000663 -0.034056 0.352491 ... 0.143960 0.147913 0.072633 TCGA-5M-AAT5-01 0.001516 -0.026920 0.387989 ... 0.176622 0.160018 0.047570 TCGA-5M-AAT6-01 0.033225 0.028672 0.410925 ... 0.153879 0.205684 0.129255 TCGA-5M-AATA-01 0.023101 0.036137 0.361934 ... 0.177856 0.182411 0.094270 TCGA-5M-AATE-01 0.028539 -0.005005 0.373973 ... 0.135569 0.170774 0.075416 TCGA-A6-2675-01 0.024810 0.021903 0.382503 ... 0.141583 0.213163 0.167502 TCGA-A6-2682-01 -0.001531 -0.011193 0.411903 ... 0.194219 0.203540 0.185292 TCGA-A6-2684-01 0.011188 0.009044 0.387331 ... 0.149757 0.202112 0.137950 TCGA-A6-2685-01 0.017267 0.014811 0.396467 ... 0.170028 0.221638 0.165198 [12 rows x 22 columns] # or with both an up-regulated and down-regulated gene set (bidirectional): # Note we can derive up-regulated and down-regulated genes from an existing signature matrix, such as LM6: >>> LM6 = td.read_sig_file(data_loc+'LM6.txt') >>> up_geneset_LM6, down_geneset_LM6 = td.find_up_down_genes_from_sig(LM6, down_cutoff=0.4, up_cutoff=4.0) >>> singscore_bidirectional = td.tumor_deconvolve(rna, 'singscore', patient_IDs=patient_subset, up_genes=up_geneset_LM6, down_genes=down_geneset_LM6) Running SingScore... SingScore has completed! >>> print(singscore_bidirectional) B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils Patient_ID TCGA-3L-AA1B-01 0.499283 0.498565 0.499251 0.497644 0.498842 0.498363 TCGA-4N-A93T-01 0.497755 0.496919 0.497468 0.494885 0.495496 0.494552 TCGA-4T-AA8H-01 0.491771 0.493391 0.495615 0.491519 0.493491 0.491843 TCGA-5M-AAT4-01 0.495321 0.496566 0.497346 0.495163 0.498167 0.497388 TCGA-5M-AAT5-01 0.495463 0.495565 0.497145 0.494091 0.495991 0.494845 TCGA-5M-AAT6-01 0.498709 0.498768 0.498831 0.498318 0.499212 0.498525 TCGA-5M-AATA-01 0.497041 0.498285 0.498800 0.497634 0.498982 0.498472 TCGA-5M-AATE-01 0.495452 0.495420 0.497194 0.493926 0.496085 0.494777 TCGA-A6-2675-01 0.498108 0.498180 0.498828 0.497645 0.499401 0.498931 TCGA-A6-2682-01 0.497307 0.497994 0.498878 0.497267 0.499530 0.499082 TCGA-A6-2684-01 0.498561 0.498945 0.499311 0.498411 0.499402 0.498990 TCGA-A6-2685-01 0.499290 0.499078 0.499545 0.498534 0.499574 0.499424 **Step 3:** Visualize Results (for LM22 cell types only). .. code-block:: python # Select a subset of the patients to visualize: >>>results = ciber_freqs.loc[patient_subset] # Create and save plots: >>>td.cell_frequency_boxplot(results, save_as="boxplots.png") >>>td.cell_frequency_barchart(results, save_as="barcharts.png") >>>td.pair_plot(results, save_as="pairplots.png") >>>td.hierarchical_clustering(results, save_as="clustermaps.png") # Each of the plots above use matplotlib and seaborn, and users can customize plot parameters as in these packages. For example: >>>td.cell_frequency_boxplot(results, font_scale=0.75, rcParams={'figure.figsize':(5,3)}, save_as="customized-boxplot.png")