Quickstart¶
Installation¶
To install the software with pip, use the following two commands:
pip install TumorDecon
pip install git+https://github.com/kristyhoran/singscore
To load the module into python, use:
import TumorDecon as td
Tutorial¶
Here we provide a basic tutorial for importing a dataset and running each of the 4 methods provided in TumorDecon:
Step 1: Read in gene expression profiles. Profiles can either be downloaded directly from cBioPortal or UCSC Xena, or loaded in from a local CSV file:
# Location of sample data (included with the TumorDecon package):
>>> data_loc = td.get_td_Home()+"data/"
# Read in sample data (original source - Colorectal Adenocarcinoma RNA Seq v2 from cBioPortal.org):
>>> rna = td.read_rna_file(data_loc+'coadred_data_RNA_Seq_v2_expression_median.txt')
>>> print(rna)
TCGA-3L-AA1B-01 TCGA-4N-A93T-01 TCGA-4T-AA8H-01 ... TCGA-AG-A02N-01 TCGA-AG-A02X-01 TCGA-AG-A032-01
Hugo_Symbol ...
A1BG 22.1470 171.2680 20.9980 ... 6.936515 16.165630 22.836611
A1CF 220.9870 100.6290 174.0080 ... 174.525498 191.864642 159.905526
A2BP1 2.4178 10.1597 0.7311 ... 1.183092 75.017749 23.387740
A2LD1 177.4080 371.3640 295.6750 ... 111.407100 70.135640 141.760142
A2M 15911.5000 1494.3300 1333.5700 ... 2198.884111 1163.266249 2785.679374
... ... ... ... ... ... ... ...
ZYG11A 3.3849 0.4838 0.0000 ... 0.265865 -0.402069 0.898260
ZYG11B 543.0370 290.7600 669.7130 ... 914.956269 544.076931 685.383167
ZYX 6259.1900 4653.1200 4460.6100 ... 2768.765832 5001.453757 4643.428584
ZZEF1 1358.3200 1220.1300 3002.0100 ... 1151.317795 1822.211921 2125.966458
ZZZ3 798.3560 333.8170 530.0680 ... 996.284836 735.466401 343.951649
[17493 rows x 592 columns]
# Can alternatively download additional data sets directly from cBioPortal, and fetch any missing Hugo Symbols from NCBI website:
>>> rna_cbio = td.download_by_name('cbio', 'Colorectal Adenocarcinoma', fetch_missing_hugo=True)
Downloading data from cbioportal to /Users/RachelAronow/testenv/lib/python3.9/site-packages/TumorDecon/data/downloaded//coadread_tcga_pan_can_atlas_2018.tar.gz...
100% [......................................................................] 221792847 / 221792847
Fetching missing Hugo Symbols for genes by Entrez ID...
Found 1 / 13 of the missing values
>>> print(rna_cbio)
TCGA-3L-AA1B-01 TCGA-4N-A93T-01 TCGA-4T-AA8H-01 ... TCGA-AG-A02N-01 TCGA-AG-A02X-01 TCGA-AG-A032-01
Hugo_Symbol ...
UBE2Q2P2 15.7640 4.2767 11.3032 ... 2.815511 5.598328 13.724609
HMGB1P1 144.4000 142.6610 143.1990 ... 333.491605 373.021521 409.240194
LOC155060 441.9730 522.0130 288.0640 ... 138.619145 170.620655 301.792311
EFCAB8 2.4178 2.4190 2.9245 ... 0.866737 2.568721 1.408164
SRP14P1 2.9014 2.9028 3.6556 ... 9.461217 3.138180 6.967566
... ... ... ... ... ... ... ...
ZYG11A 3.3849 0.4838 0.0000 ... 0.265865 -0.402069 0.898260
ZYG11B 543.0370 290.7600 669.7130 ... 914.956269 544.076931 685.383167
ZYX 6259.1900 4653.1200 4460.6100 ... 2768.765832 5001.453757 4643.428584
ZZEF1 1358.3200 1220.1300 3002.0100 ... 1151.317795 1822.211921 2125.966458
ZZZ3 798.3560 333.8170 530.0680 ... 996.284836 735.466401 343.951649
[17498 rows x 592 columns]
# Or from UCSC Xena:
>>> rna_xena = td.download_by_name('xena', 'Colon and Rectal Cancer')
Downloading data from Xena Hub to /Users/RachelAronow/testenv/lib/python3.9/site-packages/TumorDecon/data/downloaded//HiSeqV2.gz...
100% [........................................................................] 23048575 / 23048575
>>> print(rna_xena)
TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 ... TCGA-AF-A56K-01 TCGA-DC-6154-01 TCGA-AG-3592-01
Hugo_Symbol ...
ARHGEF10L 1144.371583 2226.484998 2082.076345 ... 1685.246812 2598.827162 1751.623022
HIF3A 12.151906 4.076028 3.245104 ... 104.595167 67.693420 33.003233
RNF17 0.000000 0.000000 0.463578 ... 0.000000 0.000000 0.000000
RNF10 1432.871476 3297.970130 2907.783085 ... 2209.719087 3005.355704 2588.575669
RNF11 1463.000164 1334.347953 976.802985 ... 778.173067 856.623127 619.978711
... ... ... ... ... ... ... ...
PTRF 712.712881 486.041197 1630.066627 ... 5007.628873 2691.256217 1859.016882
BCL6B 53.360997 28.531853 129.345774 ... 393.587105 237.575155 132.333917
GSTK1 6002.965618 3106.847504 5442.428906 ... 2451.776390 3604.041120 4599.898773
SELP 14.265469 16.813511 64.439962 ... 108.205371 36.614057 74.656913
SELS 1901.525401 1535.572368 1308.772019 ... 1163.181273 1722.588903 1359.197095
[20530 rows x 434 columns]
Step 2-A: Running Linear Methods (cibersort and DeconRNASeq)
Linear methods require a signature matrix, and output cell fractions:
# read in LM22 file (the default) as the signature matrix
>>> sig = td.read_sig_file()
>>> print(sig)
B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils
Hugo_Symbol ...
ABCB4 555.713449 10.744235 7.225819 ... 34.113659 15.030530 14.906888
ABCB9 15.603544 22.094787 653.392328 ... 23.615746 29.786442 33.679147
ACAP1 215.305951 321.621021 38.616872 ... 73.607932 696.442802 596.025961
ACHE 15.117949 16.648847 22.123737 ... 20.501249 40.414927 22.766494
ACP5 605.897384 1935.201479 1120.104684 ... 323.381277 860.563374 307.142798
... ... ... ... ... ... ... ...
ZNF204P 77.876897 42.946213 41.075062 ... 25.206700 65.487837 27.485546
ZNF222 9.800664 7.203218 11.691368 ... 211.493742 346.556451 26.825597
ZNF286A 737.859904 331.908761 73.882022 ... 25.833293 7.473171 441.313469
ZNF324 28.211994 23.464424 21.173326 ... 45.775175 35.293284 35.424698
ZNF442 19.713197 22.447533 19.391583 ... 3.646156 40.296509 20.157522
[547 rows x 22 columns]
# Can also use a custom signature matrix, such as the example "kmeans_signature_matrix_qval.txt" from the Signature Matrix tutorial.
# Note this signature matrix uses Ensembl Gene IDs instead of Hugo Symbols - specify this in order to have them converted to Hugo Symbols
>>> custom_sig = td.read_sig_file(td.get_td_Home()+'data/kmeans_signature_matrix_qval.txt', geneID='Ensembl_Gene_ID')
>>> print(custom_sig)
CD8_subtype_1 CD8_subtype_2 CD4_subtype_1 ... Fibroblast_subtype_2 Neutrophils_subtype_1 Neutrophils_subtype_2
Hugo_Symbol ...
FGR 116.000139 24.380169 8.453629 ... 5.412057 558.770835 739.358363
NFYA 22.503403 0.752393 15.056201 ... 2.073508 46.759730 37.213350
LASP1 154.161176 36.456214 110.461637 ... 10.793872 259.252004 250.373201
TSR3 39.389183 48.998150 16.131921 ... 0.371443 2.038300 3.715013
NADK 16.429699 12.887361 9.760582 ... 1.503814 871.354356 1006.496017
... ... ... ... ... ... ... ...
TRBJ1-3 239.011558 239.012712 281.496628 ... 0.898271 0.000000 0.000000
TRBJ1-5 607.825543 607.828500 734.535240 ... 1.117383 0.000000 0.000000
TRBJ1-1 379.176464 379.177563 378.034351 ... 2.021114 0.000000 0.000000
LINC01002 0.673294 0.928260 12.979473 ... 5.542829 1121.606395 1058.730521
MIR1248 925.340683 925.015091 49.996583 ... 1.854341 0.000000 0.000000
[972 rows x 16 columns]
# Run cibersort method on ALL patients:
# optional argments include:
# 'scaling': how to scale the data (rna mixture & signature matrix) before applying Cibersort (must be either 'None', 'zscore', or 'minmax')
# 'scaling_axis': whether to scale data by normalizing each cell types / patient individually (0) or each gene individually (1). default is 0
# 'nu','C','kernel': all arguments to be passed to sklearn's implementation of NuSVR. See sklearn documentation for details
# Additionally, can pass in 'nu':'best' to choose the nu from [0.25, 0.5, 0.75] that minimizes the root mean square error
# Defaults are nu='best', C=1.0, kernel=linear
# 'print_progress': whether to print patient ID as cibersort works through each patient (default: False)
>>> ciber_freqs = td.tumor_deconvolve(rna, 'cibersort', patient_IDs='ALL', sig_matrix=sig, args={'nu':'best', 'scaling':'minmax'})
Running CiberSort...
CiberSort has completed!
>>> print(ciber_freqs)
Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils
TCGA-3L-AA1B-01 0.076577 0.000000 0.019106 ... 0.066274 0.000000 0.0
TCGA-4N-A93T-01 0.004291 0.054434 0.045281 ... 0.185916 0.007008 0.0
TCGA-4T-AA8H-01 0.013573 0.000000 0.118549 ... 0.160762 0.033994 0.0
TCGA-5M-AAT4-01 0.000000 0.066355 0.000000 ... 0.266957 0.000000 0.0
TCGA-5M-AAT5-01 0.022618 0.000000 0.024259 ... 0.359011 0.021282 0.0
... ... ... ... ... ... ... ...
TCGA-AG-A026-01 0.000000 0.022986 0.003191 ... 0.253747 0.006140 0.0
TCGA-AG-A02G-01 0.021680 0.000175 0.000000 ... 0.169971 0.000374 0.0
TCGA-AG-A02N-01 0.016291 0.007369 0.004204 ... 0.179287 0.009278 0.0
TCGA-AG-A02X-01 0.005250 0.000000 0.003403 ... 0.181726 0.004704 0.0
TCGA-AG-A032-01 0.072213 0.000000 0.034568 ... 0.216468 0.009628 0.0
[592 rows x 22 columns]
# Save results to a file:
>>> ciber_freqs.to_csv("cibersort_results.csv", index_label='Patient_ID')
# Run DeconRNASeq:
# optional argments include:
# 'scaling': Same as in cibersort
# 'scaling_axis': Same as in cibersort
# 'check_sig': whether to check the condition number of the signature matrix before solving (default: False)
# 'print_progress': whether to print the results of each fit while function is running (default: False)
>>> decon_freqs = td.tumor_deconvolve(rna, 'DeconRNASeq', patient_IDs='ALL', sig_matrix=sig, args={'scaling':'minmax', 'print_progress':False})
Running DeconRNASeq...
DeconRNASeq has completed!
>>> print(decon_freqs)
Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils
TCGA-3L-AA1B-01 2.582540e-02 0.000000e+00 0.165577 ... 3.088621e-17 0.000000e+00 3.234649e-17
TCGA-4N-A93T-01 2.305311e-02 1.298978e-16 0.141491 ... 0.000000e+00 5.750904e-18 1.934592e-17
TCGA-4T-AA8H-01 1.053894e-02 5.177404e-17 0.159989 ... 7.645319e-18 1.235998e-02 5.827587e-19
TCGA-5M-AAT4-01 5.394090e-17 1.991418e-02 0.109398 ... 3.574299e-17 4.039499e-17 2.120351e-17
TCGA-5M-AAT5-01 1.089422e-02 4.913046e-17 0.144355 ... 2.883490e-17 6.280517e-18 9.021213e-18
... ... ... ... ... ... ... ...
TCGA-AG-A026-01 2.113573e-02 1.137586e-17 0.141309 ... 1.324951e-17 1.269026e-18 5.087319e-17
TCGA-AG-A02G-01 1.517414e-02 0.000000e+00 0.131320 ... 0.000000e+00 2.747775e-18 1.425387e-17
TCGA-AG-A02N-01 2.555726e-02 6.160640e-17 0.130763 ... 0.000000e+00 0.000000e+00 3.678833e-17
TCGA-AG-A02X-01 2.959905e-02 5.615945e-18 0.182780 ... 1.647317e-17 4.197899e-18 1.486629e-17
TCGA-AG-A032-01 3.053816e-02 4.980554e-17 0.140349 ... 1.588356e-17 4.061692e-17 0.000000e+00
[592 rows x 22 columns]
Step 2-B: Running Rank-Based Methods (ssGSEA and SingScore)
Rank-based methods require a list of up-regulated genes, and output scores/ranks for each cell type:
# Read in up-regulated gene set derived from LM22 (default):
>>> up_geneset = td.read_geneset()
# When running tumor_deconvolve(), the default is to apply the method on all patients in the rna data frame (patient_IDs='ALL').
# To instead run the method on only a select list of patients, pass in a list of patient identifiers (column names)
>>> patient_subset = ['TCGA-3L-AA1B-01', 'TCGA-4N-A93T-01', 'TCGA-4T-AA8H-01', 'TCGA-5M-AAT4-01', \
'TCGA-5M-AAT5-01', 'TCGA-5M-AAT6-01', 'TCGA-5M-AATA-01', 'TCGA-5M-AATE-01', \
'TCGA-A6-2675-01', 'TCGA-A6-2682-01', 'TCGA-A6-2684-01', 'TCGA-A6-2685-01']
# Run ssGSEA on only a few patients:
# optional arguments include:
# 'alpha': Weight used in ssGSEA method (default: 1.0)
# 'norm': whether to normalize enrichment scores, as done by Barbie et al., 2009, online methods, pg. 2
# 'print_progress': whether to print patient ID as ssGSEA iterates through (default: False)
>>> ssgsea_scores = td.tumor_deconvolve(rna, 'ssGSEA', patient_IDs=patient_subset, up_genes=up_geneset, args={'alpha':0.5, 'norm':True, 'print_progress':False})
Running ssGSEA...
ssGSEA has completed!
>>> print(ssgsea_scores)
Patient_ID B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils
TCGA-3L-AA1B-01 0.259867 0.194624 0.885539 ... 0.476357 0.487919 0.347988
TCGA-4N-A93T-01 0.279005 0.085207 0.876560 ... 0.487487 0.507909 0.403473
TCGA-4T-AA8H-01 0.229115 0.094173 0.872763 ... 0.440532 0.515211 0.398734
TCGA-5M-AAT4-01 0.213934 0.146207 0.838287 ... 0.484334 0.478675 0.329150
TCGA-5M-AAT5-01 0.226495 0.083503 0.887130 ... 0.570072 0.490234 0.360500
TCGA-5M-AAT6-01 0.220341 0.196759 0.928414 ... 0.495182 0.551787 0.444604
TCGA-5M-AATA-01 0.264451 0.242712 0.843088 ... 0.562232 0.517319 0.385985
TCGA-5M-AATE-01 0.255101 0.159893 0.860415 ... 0.487739 0.502279 0.389226
TCGA-A6-2675-01 0.239517 0.216401 0.880584 ... 0.442438 0.552197 0.480925
TCGA-A6-2682-01 0.215060 0.159228 0.934567 ... 0.572144 0.557205 0.490885
TCGA-A6-2684-01 0.200944 0.187249 0.888014 ... 0.466842 0.530049 0.442364
TCGA-A6-2685-01 0.174148 0.183445 0.901240 ... 0.512232 0.559692 0.475958
[12 rows x 22 columns]
# SingScore can be run with just an up-regulated gene set (unidirectional):
>>> singscore_unidirectional = td.tumor_deconvolve(rna, 'singscore', patient_IDs=patient_subset, up_genes=up_geneset)
Running SingScore...
SingScore has completed!
>>> print(singscore_unidirectional)
B cells naive B cells memory Plasma cells ... Mast cells activated Eosinophils Neutrophils
Patient_ID ...
TCGA-3L-AA1B-01 0.043939 0.019919 0.390517 ... 0.143151 0.170208 0.057520
TCGA-4N-A93T-01 0.042402 -0.018819 0.388269 ... 0.130023 0.159335 0.047293
TCGA-4T-AA8H-01 -0.001167 -0.034557 0.383143 ... 0.121452 0.173807 0.054449
TCGA-5M-AAT4-01 -0.000663 -0.034056 0.352491 ... 0.143960 0.147913 0.072633
TCGA-5M-AAT5-01 0.001516 -0.026920 0.387989 ... 0.176622 0.160018 0.047570
TCGA-5M-AAT6-01 0.033225 0.028672 0.410925 ... 0.153879 0.205684 0.129255
TCGA-5M-AATA-01 0.023101 0.036137 0.361934 ... 0.177856 0.182411 0.094270
TCGA-5M-AATE-01 0.028539 -0.005005 0.373973 ... 0.135569 0.170774 0.075416
TCGA-A6-2675-01 0.024810 0.021903 0.382503 ... 0.141583 0.213163 0.167502
TCGA-A6-2682-01 -0.001531 -0.011193 0.411903 ... 0.194219 0.203540 0.185292
TCGA-A6-2684-01 0.011188 0.009044 0.387331 ... 0.149757 0.202112 0.137950
TCGA-A6-2685-01 0.017267 0.014811 0.396467 ... 0.170028 0.221638 0.165198
[12 rows x 22 columns]
# or with both an up-regulated and down-regulated gene set (bidirectional):
# Note we can derive up-regulated and down-regulated genes from an existing signature matrix, such as LM6:
>>> LM6 = td.read_sig_file(data_loc+'LM6.txt')
>>> up_geneset_LM6, down_geneset_LM6 = td.find_up_down_genes_from_sig(LM6, down_cutoff=0.4, up_cutoff=4.0)
>>> singscore_bidirectional = td.tumor_deconvolve(rna, 'singscore', patient_IDs=patient_subset, up_genes=up_geneset_LM6, down_genes=down_geneset_LM6)
Running SingScore...
SingScore has completed!
>>> print(singscore_bidirectional)
B cells CD8 T cells CD4 T cells NK cells Monocytes Neutrophils
Patient_ID
TCGA-3L-AA1B-01 0.499283 0.498565 0.499251 0.497644 0.498842 0.498363
TCGA-4N-A93T-01 0.497755 0.496919 0.497468 0.494885 0.495496 0.494552
TCGA-4T-AA8H-01 0.491771 0.493391 0.495615 0.491519 0.493491 0.491843
TCGA-5M-AAT4-01 0.495321 0.496566 0.497346 0.495163 0.498167 0.497388
TCGA-5M-AAT5-01 0.495463 0.495565 0.497145 0.494091 0.495991 0.494845
TCGA-5M-AAT6-01 0.498709 0.498768 0.498831 0.498318 0.499212 0.498525
TCGA-5M-AATA-01 0.497041 0.498285 0.498800 0.497634 0.498982 0.498472
TCGA-5M-AATE-01 0.495452 0.495420 0.497194 0.493926 0.496085 0.494777
TCGA-A6-2675-01 0.498108 0.498180 0.498828 0.497645 0.499401 0.498931
TCGA-A6-2682-01 0.497307 0.497994 0.498878 0.497267 0.499530 0.499082
TCGA-A6-2684-01 0.498561 0.498945 0.499311 0.498411 0.499402 0.498990
TCGA-A6-2685-01 0.499290 0.499078 0.499545 0.498534 0.499574 0.499424
Step 3: Visualize Results (for LM22 cell types only).
# Select a subset of the patients to visualize:
>>>results = ciber_freqs.loc[patient_subset]
# Create and save plots:
>>>td.cell_frequency_boxplot(results, save_as="boxplots.png")
>>>td.cell_frequency_barchart(results, save_as="barcharts.png")
>>>td.pair_plot(results, save_as="pairplots.png")
>>>td.hierarchical_clustering(results, save_as="clustermaps.png")
# Each of the plots above use matplotlib and seaborn, and users can customize plot parameters as in these packages. For example:
>>>td.cell_frequency_boxplot(results, font_scale=0.75, rcParams={'figure.figsize':(5,3)}, save_as="customized-boxplot.png")