nof1
R packageWelcome to the N-of-1-pathways analysis tutorial! The intent of this document is to familiarize users with the basic functions included in the nof1
R
package through a demonstration of a typical analytic workflow.
The software transforms a pair of transcriptomes from a single subject to a personal profile of pathway dysregulation1. The algorithm quantifies dysregulation2 (effect size) within a gene set along with statistical significance (p-value) in high throughput. Lastly, personal dysregulation profiles are visualized for pattern recognition through novel ’omic rose plots3. Moreover, these paired calculations can be applied in related problems, such as single-cell RNA-seq^3. Please refer to the references provided in the footnotes for details on these methods. The schematic below illustrates the N-of-1-pathways workflow:
In this demonstration, we will apply N-of-1-pathways to study the transcriptomes from four breast cancer patients. As we are applying the framework to multiple patients in one workflow, the majority of nof1
function used will have a multi
designation. For example, the function compute_multi_nof1_pathways
will be employed vs. compute_nof1_pathways
. A truly one subject analysis is rare in practice, but rest assured that the analogous functions behave in a similar manner. Moreover, compute_nof1_pathways
(and other non-multi
functions) allow for more advanced users to flexibly calculate in non-standard scenarios.
As the nof1
package is not available on a public repository, the package must first be downloaded to a local destination. Then install using the following command (after updating with the system-specific path and version):
install.packages("/path/to/source/nof1_X.X.X.tar.gz", repos = NULL, type="source")
Be sure to employ the usual library()
call in any data analysis scripts after installation.
## Load and attach the nof1 package after installation.
library(nof1)
## Retrieve the library location to access files needed for this tutorial
my_path <- find.package("nof1")
Utility functions are available to read/write gene expression, gene set (pathway) definitions, gene set descriptions, and sample data pairing specifications.
The first step is to read gene expression data into the workspace. In this tutorial, we will employ matched normal/tumor pairs of RNA-seq data from TCGA4 breast cancer patients as example data. We have preprocessed four matched pairs (8 transcriptomes) into an R data file to maintain a reasonable package size. Samples are labeled as TCGA.XX.XXXX.N
or TCGA.XX.XXXX.T
, where XX.XXXX
identifies the patient and N
or T
indicates normal or tumor, respectively.
Notably, gene expression data can also be read into R using utilities provided by the package. The format for an input text file is specified in the read_gene_set
function documentation. In short, the rows are the genes and the columns are the samples in a tab-delimited file.
## Example syntax for a text file:
## my_data <- read_gene_data("/path/to/file/gene_expr.txt")
## Load compressed gene expression data for the tutorial
data(BRCA_TCGA_4_patients)
## Display a few genes for the first two pairs of transcriptomes.
BRCA_TCGA_4_patients[1:5, 1:4]
## TCGA.A7.A0DB.N TCGA.A7.A0DB.T TCGA.AC.A2FM.N TCGA.AC.A2FM.T
## A1BG 19.0722 302.9294 355.1020 1299.5289
## A1CF 0.0000 0.7664 0.0000 0.3341
## A2BP1 0.0000 2.2991 2.8548 0.0000
## A2LD1 153.5670 108.7537 81.4478 108.5667
## A2ML1 9.7938 1.9159 2.4470 0.0000
It is crucial for the downstream analysis that the pair of transcriptomes have been normalized to make valid differential expression calculations. There are not many options for single-sample normalization: so we use the simple transcripts per million approach for RNA-seq data.
## Read four matched pairs of RNA-seq data sets.
gene_data <- normalize_multi_gene_data(BRCA_TCGA_4_patients, method = "tpm")
## Check the result.
apply(gene_data, 2, sum)
## TCGA.A7.A0DB.N TCGA.A7.A0DB.T TCGA.AC.A2FM.N TCGA.AC.A2FM.T TCGA.BH.A1EU.N
## 1e+06 1e+06 1e+06 1e+06 1e+06
## TCGA.BH.A1EU.T TCGA.E2.A153.N TCGA.E2.A153.T
## 1e+06 1e+06 1e+06
N-of-1-pathways is a paired analysis framework. Two transcriptomes (usually from the same patient) are analyzed for differential expression at the level gene sets (pathways). As such, one must provide the software guidance as to which samples are to be compared and, in particular, which sample is considered the “baseline” and “case” samples.
It is important to note that in gene_data
above that the sample names are in a highly specific order. compute_multi_nof1_pathways
assumes that the column (sample) order provided is of the form BaselinePair1, CasePair1, BaselinePair2, CasePair2, etc. However, one can specify the pairings via use of the read_sample_data
and write_sample_data
functions for complete control of the pairing specifications.
Also, it is important to note that patient/sample IDs (columns) cannot include the ‘-’ character as pairs are automatically named using this to show the direction of subtraction.
## Define pairings using a text file (included in the package as an example).
sample_data <- read_sample_data(file.path( my_path, "extdata/example_sample_data_tcga_brca_4_patients.txt"))
## Display the how these pairs are defined using this format.
sample_data
## PairNum SampleName BaselineOrCase
## 1 1 TCGA.A7.A0DB.N BASELINE
## 2 1 TCGA.A7.A0DB.T CASE
## 3 2 TCGA.AC.A2FM.N BASELINE
## 4 2 TCGA.AC.A2FM.T CASE
## 5 3 TCGA.BH.A1EU.N BASELINE
## 6 3 TCGA.BH.A1EU.T CASE
## 7 4 TCGA.E2.A153.N BASELINE
## 8 4 TCGA.E2.A153.T CASE
## Provide the total num of pairs.
max(as.numeric(sample_data$PairNum))
## [1] 4
In gene set (i.e, pathway) analysis genes are annotated into functional groups through a variety of approaches. In the nof1
package, we include pathways defined by the Pathway Interaction Database (PID)5. These are signaling pathways that are often indicated in cancer. The choice of ontology (knowledgebase) is an important consideration and will vary depending on the context and research objectives. However, that discussion is outside the scope of this tutorial.
From a practical standpoint, the format required is a ‘long’ structure. A tab-delimited text file is expected with pathway ID as the first variable and the HUGO gene symbol as the second.
## An example of how to read in a pathway definitions.
pid_filtered <- read_gene_set(file.path(my_path,"extdata/PID_filtered15-X.txt"))
head(pid_filtered)
## path_id symbol
## 1 200002 SSPO
## 2 200002 CHEK1
## 3 200002 FANCG
## 4 200002 FANCA
## 5 200002 XRCC3
## 6 200002 RAD1
It is often useful to include pathway descriptions in the output of the N-of-1-pathways framework. The format is very similar to above, but with the second variable indicated the pathway description - not pathway membership.
## note that read_gene_set reads both gene set definitions and descriptions
pid_desc <- read_gene_set(file.path( my_path, "extdata/PID_description.txt"))
head(pid_desc)
## path_id description
## 1 200002 Fanconi anemia pathway
## 2 200003 Regulation of nuclear SMAD2/3 signaling
## 3 200004 Fc-epsilon receptor I signaling in mast cells
## 4 200005 Endothelins
## 5 200006 BCR signaling pathway
## 6 200007 Signaling events mediated by PRL
Now that the gene expression data and pathway annotation steps are complete, the actual calculation of pathway differential expression and statistical significance can be performed. Again, the details of the methods are provided in the references below and outside of the scope of this document.
In the example below, N-of-1-pathways Mahalanobis distance is applied to the four breast cancer patients in one convenient function call. The result is a list of data frames containing pathway scores for each of the 187 PID signaling pathways.
Often, a clinician or researcher a priori selects pathways of interest to study for treatment decisions or hypothesis testing. Here we arbitrary select the top five dysregulated pathways from the first patient to compare to the other three patients. In this context, the effect size and p-value for each pathway represent dysregulation as tumor tissue is compared to similar, non-tumor tissue from the same patient.
## Run and store Nof1-pathways Mahalanobis distance for the 4 pairs of data.
pathway_scores <- compute_multi_nof1_pathways(gene_data, sample_data = sample_data,
gene_set = pid_filtered,
model = "md", gene_set_desc = pid_desc)
## Explore the top five dysregulated pathways for the first patient.
pathway_scores[[1]][1:5,]
## pathway_score direction p_value num_genes_annot
## 200038 1.2262335 1 5.226671e-09 39
## 200002 1.0583383 1 2.617160e-08 48
## 200004 -0.6767436 -1 1.399637e-05 60
## 200079 -0.6144119 -1 1.263655e-04 48
## 200147 -0.6270801 -1 1.388339e-04 45
## num_genes_measured model zscore
## 200038 39 md 5.839790
## 200002 44 md 5.565291
## 200004 57 md -4.343918
## 200079 48 md -3.833436
## 200147 45 md -3.810237
## pathway_desc adj_p_value
## 200038 ATR signaling pathway 9.773874e-07
## 200002 Fanconi anemia pathway 2.447044e-06
## 200004 Fc-epsilon receptor I signaling in mast cells 8.724403e-04
## 200079 Angiopoietin receptor Tie2-mediated signaling 5.192388e-03
## 200147 IL6-mediated signaling events 5.192388e-03
## Store the top five pathways ID ("path_id") to compare to other patients.
top_ids <- row.names(pathway_scores[[1]])[1:5]
Note for larger data sets and ontologies, memory issues/speed could preclude the computation. In that case, it is advised to break up the analysis into smaller pieces and/or use the parallelization parameter in the function call. Also, the use of high-performance computing clusters is applicable.
Finally, the nof1
package includes novel visualization tools to study and compare pathway dysregulation profiles. One such plot a rose plot.
## Shorten the pair names for display.
names(pathway_scores) <- gsub("TCGA\\.", "", names(pathway_scores))
## Create multiple rose plots, paneled by patient.
create_multi_rose_plot(pathway_scores, which_pathways = top_ids)
Here the negative natural logarithm of the p-values corresponding the each of the top five dysregulated pathways for patient A7.A0DB determine the radii for the ‘petals’. The petal areas are properly scaled so that the areas are proportional to the statistical significance of dysregulation. Note that there are a total of five petals for each rose plot. Petals above the horizontal axis (in blue) represent pathways that are higher expressed in the tumor tissue and conversely for the pathways that are down-regulated (in brown).
Note striking heterogeneity for these four patients. Patient E2.A153 has all five pathways that are significantly down-regulated. Whereas, both BH.A1EU and AC.A2FM are only significantly up-regulated in ATR signaling and Fanconi anemia in these five pathways. Comparisons of patient characteristics and/or determination of treatment plans could lead to better patient outcomes using these visualizations.
That concludes this tutorial to the nof1
package to implement the N-of-1-pathways framework. The package is in heavy development with more features, methods, and tutorials forthcoming. Please refer to the package documentation for detailed explanations and examples on all the functions and data sets.
Also, we greatly appreciate your support by citing the N-of-1-pathways framework:
citation("nof1")
Please also cite any particular method within the framework such as the Mahlanobis distance.
Thanks for reading!
Gardeux, V. et al. N-of-1-pathways unveils personal deregulated mechanisms from a single pair of RNA-Seq samples: towards precision medicine. J. Am. Med. Inform. Assoc. 21, 1015–25 (2014).↩
Schissler, A. G. et al. Dynamic changes of RNA-sequencing expression for precision medicine: N-of-1-pathways Mahalanobis distance within pathways of single subjects predicts breast cancer survival. Bioinformatics 31, i293–i302 (2015).↩
Schissler, A. Grant, et al. “Analysis of aggregated cell–cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells.” Bioinformatics 32.12 (2016): i80-i89.↩
Chang K, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics. 2013 Sep 26; 45(10):1113–20. doi: 10.1038/ng. 2764 PMID: 24071849↩
Schaefer, Carl F., et al. “PID: the pathway interaction database.” Nucleic acids research 37.suppl 1 (2009): D674-D679.↩