Analysis of TCGA breast cancer patients

Step 0: Installing and attaching the package

As the nof1 package is not available on a public repository, the package must first be downloaded to a local destination. Then install using the following command (after updating with the system-specific path and version):

install.packages("/path/to/source/nof1_X.X.X.tar.gz", repos = NULL, type="source")

Be sure to employ the usual library() call in any data analysis scripts after installation.

## Load and attach the nof1 package after installation.
library(nof1)
## Retrieve the library location to access files needed for this tutorial
my_path <- find.package("nof1")

Step 1: Read gene expression data and define sample pairings

Utility functions are available to read/write gene expression, gene set (pathway) definitions, gene set descriptions, and sample data pairing specifications.

1.1: Reading in the gene expression

The first step is to read gene expression data into the workspace. In this tutorial, we will employ matched normal/tumor pairs of RNA-seq data from TCGA⁴ breast cancer patients as example data. We have preprocessed four matched pairs (8 transcriptomes) into an R data file to maintain a reasonable package size. Samples are labeled as TCGA.XX.XXXX.N or TCGA.XX.XXXX.T, where XX.XXXX identifies the patient and N or T indicates normal or tumor, respectively.

Notably, gene expression data can also be read into R using utilities provided by the package. The format for an input text file is specified in the read_gene_set function documentation. In short, the rows are the genes and the columns are the samples in a tab-delimited file.

## Example syntax for a text file:
## my_data <- read_gene_data("/path/to/file/gene_expr.txt")
## Load compressed gene expression data for the tutorial 
data(BRCA_TCGA_4_patients)
## Display a few genes for the first two pairs of transcriptomes.
BRCA_TCGA_4_patients[1:5, 1:4]

##       TCGA.A7.A0DB.N TCGA.A7.A0DB.T TCGA.AC.A2FM.N TCGA.AC.A2FM.T
## A1BG         19.0722       302.9294       355.1020      1299.5289
## A1CF          0.0000         0.7664         0.0000         0.3341
## A2BP1         0.0000         2.2991         2.8548         0.0000
## A2LD1       153.5670       108.7537        81.4478       108.5667
## A2ML1         9.7938         1.9159         2.4470         0.0000

(Recommended) Normalize the gene expression data

It is crucial for the downstream analysis that the pair of transcriptomes have been normalized to make valid differential expression calculations. There are not many options for single-sample normalization: so we use the simple transcripts per million approach for RNA-seq data.

## Read four matched pairs of RNA-seq data sets.
gene_data <- normalize_multi_gene_data(BRCA_TCGA_4_patients, method = "tpm")
## Check the result.
apply(gene_data, 2, sum)

## TCGA.A7.A0DB.N TCGA.A7.A0DB.T TCGA.AC.A2FM.N TCGA.AC.A2FM.T TCGA.BH.A1EU.N 
##          1e+06          1e+06          1e+06          1e+06          1e+06 
## TCGA.BH.A1EU.T TCGA.E2.A153.N TCGA.E2.A153.T 
##          1e+06          1e+06          1e+06

1.2: Define sample pairings

N-of-1-pathways is a paired analysis framework. Two transcriptomes (usually from the same patient) are analyzed for differential expression at the level gene sets (pathways). As such, one must provide the software guidance as to which samples are to be compared and, in particular, which sample is considered the “baseline” and “case” samples.

It is important to note that in gene_data above that the sample names are in a highly specific order. compute_multi_nof1_pathways assumes that the column (sample) order provided is of the form BaselinePair1, CasePair1, BaselinePair2, CasePair2, etc. However, one can specify the pairings via use of the read_sample_data and write_sample_data functions for complete control of the pairing specifications.

Also, it is important to note that patient/sample IDs (columns) cannot include the ‘-’ character as pairs are automatically named using this to show the direction of subtraction.

## Define pairings using a text file (included in the package as an example).
sample_data <- read_sample_data(file.path( my_path, "extdata/example_sample_data_tcga_brca_4_patients.txt"))
## Display the how these pairs are defined using this format.
sample_data

##   PairNum     SampleName BaselineOrCase
## 1       1 TCGA.A7.A0DB.N       BASELINE
## 2       1 TCGA.A7.A0DB.T           CASE
## 3       2 TCGA.AC.A2FM.N       BASELINE
## 4       2 TCGA.AC.A2FM.T           CASE
## 5       3 TCGA.BH.A1EU.N       BASELINE
## 6       3 TCGA.BH.A1EU.T           CASE
## 7       4 TCGA.E2.A153.N       BASELINE
## 8       4 TCGA.E2.A153.T           CASE

## Provide the total num of pairs.
max(as.numeric(sample_data$PairNum))

## [1] 4

Step 2: Define pathways via a knowledgebase

In gene set (i.e, pathway) analysis genes are annotated into functional groups through a variety of approaches. In the nof1 package, we include pathways defined by the Pathway Interaction Database (PID)⁵. These are signaling pathways that are often indicated in cancer. The choice of ontology (knowledgebase) is an important consideration and will vary depending on the context and research objectives. However, that discussion is outside the scope of this tutorial.

From a practical standpoint, the format required is a ‘long’ structure. A tab-delimited text file is expected with pathway ID as the first variable and the HUGO gene symbol as the second.

## An example of how to read in a pathway definitions.
pid_filtered <- read_gene_set(file.path(my_path,"extdata/PID_filtered15-X.txt"))
head(pid_filtered)

##   path_id symbol
## 1  200002   SSPO
## 2  200002  CHEK1
## 3  200002  FANCG
## 4  200002  FANCA
## 5  200002  XRCC3
## 6  200002   RAD1

It is often useful to include pathway descriptions in the output of the N-of-1-pathways framework. The format is very similar to above, but with the second variable indicated the pathway description - not pathway membership.

## note that read_gene_set reads both gene set definitions and descriptions
pid_desc <- read_gene_set(file.path( my_path, "extdata/PID_description.txt"))
head(pid_desc)

##   path_id                                   description
## 1  200002                        Fanconi anemia pathway
## 2  200003       Regulation of nuclear SMAD2/3 signaling
## 3  200004 Fc-epsilon receptor I signaling in mast cells
## 4  200005                                   Endothelins
## 5  200006                         BCR signaling pathway
## 6  200007              Signaling events mediated by PRL

Step 3: Compute pathway dysregulation profiles

Now that the gene expression data and pathway annotation steps are complete, the actual calculation of pathway differential expression and statistical significance can be performed. Again, the details of the methods are provided in the references below and outside of the scope of this document.

In the example below, N-of-1-pathways Mahalanobis distance is applied to the four breast cancer patients in one convenient function call. The result is a list of data frames containing pathway scores for each of the 187 PID signaling pathways.

Often, a clinician or researcher a priori selects pathways of interest to study for treatment decisions or hypothesis testing. Here we arbitrary select the top five dysregulated pathways from the first patient to compare to the other three patients. In this context, the effect size and p-value for each pathway represent dysregulation as tumor tissue is compared to similar, non-tumor tissue from the same patient.

## Run and store Nof1-pathways Mahalanobis distance for the 4 pairs of data.
pathway_scores <- compute_multi_nof1_pathways(gene_data, sample_data = sample_data,
                                              gene_set = pid_filtered,
                                              model = "md", gene_set_desc = pid_desc)
## Explore the top five dysregulated pathways for the first patient.
pathway_scores[[1]][1:5,]

##        pathway_score direction      p_value num_genes_annot
## 200038     1.2262335         1 5.226671e-09              39
## 200002     1.0583383         1 2.617160e-08              48
## 200004    -0.6767436        -1 1.399637e-05              60
## 200079    -0.6144119        -1 1.263655e-04              48
## 200147    -0.6270801        -1 1.388339e-04              45
##        num_genes_measured model    zscore
## 200038                 39    md  5.839790
## 200002                 44    md  5.565291
## 200004                 57    md -4.343918
## 200079                 48    md -3.833436
## 200147                 45    md -3.810237
##                                         pathway_desc  adj_p_value
## 200038                         ATR signaling pathway 9.773874e-07
## 200002                        Fanconi anemia pathway 2.447044e-06
## 200004 Fc-epsilon receptor I signaling in mast cells 8.724403e-04
## 200079 Angiopoietin receptor Tie2-mediated signaling 5.192388e-03
## 200147                 IL6-mediated signaling events 5.192388e-03

## Store the top five pathways ID ("path_id") to compare to other patients.
top_ids <- row.names(pathway_scores[[1]])[1:5]

Note for larger data sets and ontologies, memory issues/speed could preclude the computation. In that case, it is advised to break up the analysis into smaller pieces and/or use the parallelization parameter in the function call. Also, the use of high-performance computing clusters is applicable.

Step 4: Visualize top pathway scores across patients

Finally, the nof1 package includes novel visualization tools to study and compare pathway dysregulation profiles. One such plot a rose plot.

## Shorten the pair names for display.
names(pathway_scores) <- gsub("TCGA\\.", "", names(pathway_scores))
## Create multiple rose plots, paneled by patient.
create_multi_rose_plot(pathway_scores, which_pathways = top_ids)

Here the negative natural logarithm of the p-values corresponding the each of the top five dysregulated pathways for patient A7.A0DB determine the radii for the ‘petals’. The petal areas are properly scaled so that the areas are proportional to the statistical significance of dysregulation. Note that there are a total of five petals for each rose plot. Petals above the horizontal axis (in blue) represent pathways that are higher expressed in the tumor tissue and conversely for the pathways that are down-regulated (in brown).

Note striking heterogeneity for these four patients. Patient E2.A153 has all five pathways that are significantly down-regulated. Whereas, both BH.A1EU and AC.A2FM are only significantly up-regulated in ATR signaling and Fanconi anemia in these five pathways. Comparisons of patient characteristics and/or determination of treatment plans could lead to better patient outcomes using these visualizations.

Gardeux, V. et al. N-of-1-pathways unveils personal deregulated mechanisms from a single pair of RNA-Seq samples: towards precision medicine. J. Am. Med. Inform. Assoc. 21, 1015–25 (2014).↩
Schissler, A. G. et al. Dynamic changes of RNA-sequencing expression for precision medicine: N-of-1-pathways Mahalanobis distance within pathways of single subjects predicts breast cancer survival. Bioinformatics 31, i293–i302 (2015).↩
Schissler, A. Grant, et al. “Analysis of aggregated cell–cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells.” Bioinformatics 32.12 (2016): i80-i89.↩
Chang K, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics. 2013 Sep 26; 45(10):1113–20. doi: 10.1038/ng. 2764 PMID: 24071849↩
Schaefer, Carl F., et al. “PID: the pathway interaction database.” Nucleic acids research 37.suppl 1 (2009): D674-D679.↩

N-of-1-pathways via the `nof1` R package

A. Grant Schissler

2017-02-07

Introductory remarks

Brief overview of N-of-1-pathways

Scope of the tutorial

Analysis of TCGA breast cancer patients

Step 0: Installing and attaching the package

Step 1: Read gene expression data and define sample pairings

1.1: Reading in the gene expression

(Recommended) Normalize the gene expression data

1.2: Define sample pairings

Step 2: Define pathways via a knowledgebase

Step 3: Compute pathway dysregulation profiles

Step 4: Visualize top pathway scores across patients

Concluding remarks

N-of-1-pathways via the nof1 R package

A. Grant Schissler

2017-02-07

Introductory remarks

Brief overview of N-of-1-pathways

Scope of the tutorial

Analysis of TCGA breast cancer patients

Step 0: Installing and attaching the package

Step 1: Read gene expression data and define sample pairings

1.1: Reading in the gene expression

(Recommended) Normalize the gene expression data

1.2: Define sample pairings

Step 2: Define pathways via a knowledgebase

Step 3: Compute pathway dysregulation profiles

Step 4: Visualize top pathway scores across patients

Concluding remarks

N-of-1-pathways via the `nof1` R package