Abstract |
INTEGRATIVE GENE REGULATORY NETWORK INFERENCE USING MULTI-OMICS DATA A Thesis by NEDA ZARAYENEH Submitted to the Office of Graduate Studies of Texas A&M University-Commerce in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE May 2017 INTEGRATIVE GENE REGULATORY NETWORK INFERENCE USING MULTI-OMICS DATA A Thesis by NEDA ZARAYENEH Approved by: Advisor: Sang C. Suh Committee: Mingon Kang Dongeun Lee Head of Department: Sang C. Suh Dean of the College: Brent Don Ham Dean of Graduate Studies: Mary Beth Sampson iii Copyright © 2017 Neda Zarayeneh iv ABSTRACT INTEGRATIVE GENE REGULATORY NETWORK INFERENCE USING MULTI-OMICS DATA Neda Zarayeneh, MS Texas A&M University-Commerce, 2017 Advisor: Sang C. Suh, PhD Biological network inference is of importance inunderstanding the underlying biological mechanisms. Gene regulatory network describes molecular interactions of complex biological processes by using a graph model, where nodes and edges represent genes and their regulations respectively. In most research studies, the molecular interactions (edges) of the gene regulatory networks are inferred from a single type of genomic data, for example, gene expression data. However, gene expression is a product of sequential interactions of DNA sequence variations, single nucleotide polymorphism, copy number variation, histone modifications, transcription factor, DNA methylation, and many other factors. The multiple types of genomics data are called multi-omics data. In this research, I proposed an Integrative Gene Regulatory Network inference method (iGRN) that can incorporate multi-omics data and their interactions in the graph model of gene regulatory network, and subsequently compared this proposed method with some state-of-the-art methods. Copy number variations and DNA methylations were considered for multi-omics data in this paper. The extensiveexperiments were carried out with simulation data, where its capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN showed it performs better on model representation and v interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to human brain data of patients with psychiatric disorders, and the biological networks of the psychiatric disorderswereanalyzed. vi ACKNOWLEDGEMENTS I would first like to thank my thesis advisor, Dr. Sang C. Suh, for all his help and guidance that he has given me over the past two years. You have set an example of excellence as a researcher, mentor, and instructor. I would especially like to thank my co-advisor, Dr. Mingon Kang. He was always available whenever I ran into a trouble spot or had a question about my research. I undoubtedly could not have done this without your guidance. I would also like to express my gratitude to my thesis committee member Dr. Dongeun Lee for his time. vii TABLE OF CONTENTS LIST OF TABLES ....................................................................................................................... ix LIST OF FIGURES .......................................................................................................................x CHAPTER 1. INTRODUCTION ........................................................................................................1 Causal Relationships ................................................................................................4 Data Integration .......................................................................................................6 Definition of Terms..................................................................................................7 Statement of the Problem .........................................................................................9 Purpose of the Study ..............................................................................................9 Significance of the Study ......................................................................................12 Limitations ...........................................................................................................12 Assumptions .........................................................................................................12 Organization of the Remaining Thesis Chapters .................................................13 2. REVIEW OF THE LITERATURE ............................................................................14 3. METHOD OF PROCEDURE .....................................................................................18 Problem Definitions ...............................................................................................18 Inference of Integrative Gene Regulatory Network ...............................................18 4. SIMULATION STUDY ...............................................................................................22 Simulation Settings ................................................................................................22 Simulation Study 1 ................................................................................................25 Simulation Study 2 ................................................................................................29 Simulation Study 3 ................................................................................................32 viii 5. EXPERIMENT RESULTS ...........................................................................................34 Human Brain Data for Psychiatric Diseases .........................................................34 Preprocessing .........................................................................................................34 Comparison ...........................................................................................................35 6. CONCULSIONS AND FUTURE WORK ...................................................................38 REFRENCES ................................................................................................................................40 APPENDICES ...............................................................................................................................45 Appendix A. LIBRARIES ...........................................................................................................45 B. PREPROCESSING ................................................................................................47 C. METHODS ............................................................................................................51 D. EXPERIMENT ......................................................................................................60 E. SIMULATION STUDY 1 .....................................................................................66 F. SIMULATION STUDY 2 .....................................................................................74 G. SIMULATION STUDY 3 .....................................................................................84 VITA ...........................................................................................................................................93 ix LIST OF TABLES TABLE 1. AUROC with simulation data .............................................................................................. 27 2. Comparison with biological replicates ................................................................................. 38 x LIST OF FIGURES FIGURE 1. Structure of a gene regulatory network ................................................................................. 2 2. Control process of a gene regulatory network ...................................................................... 2 3. Central Dogma of Biology .................................................................................................. 3 4. Different causal relationships between genes ..................................................................... 5 5. Creating a Copy number of variation by gene duplication ............................................. 111 6. Inactivating a gene by DNA methylation ....................................................................... 111 7. Example of iGRN. The interactive effects of Copy number variation (CNV) and DNA methylation (DM) are incorporated in the gene regulatory network model ....................... 21 8. ROC plot for GRN, DCGRN, and iGRN ............................................................................ 27 9. TPRs for interaction effects of CNVs and DNA methylation .......................................... 28 10. Sensitivity ....................................................................................................................... 30 11. Sensitivity on DNA methylation and copy number variations ....................................... 31 12. FDR rate for GRN, DCGRN, and iGRN ........................................................................ 34 13. Volcano plot for preprocessing ....................................................................................... 36 1 Chapter 1 INTRODUCTION Inside the cells, each copy of DNA has geneson it and these genes encode proteins; they do this in a two-step process: 1. Transcription: the process in which the gene is copied into mRNA (messenger RNA) 2. Translation: the process in which the nucleotide sequences in mRNA are translated into an amino acid sequence and this makes the protein. The reason why these processes are important is because any function that might happen in a cell is being carried out by a protein, so in order to know what the cell’s capabilities are, it is necessary to know what proteins are expressed in the cell and what genes are being turned on. Therefore, proteins perform cellular functions, including regulatory gene expression. A gene on DNA that regulates another gene somewhere farther down is called a regulatory gene. There isa set of instructions in each gene that helps organisms to survive by making required molecules. Genes cannot perform any function by themselves; they must be converted into gene products. Gene expression is a product of a gene that is produced by information included within a gene. Figure 1 illustrates how genes are made or expressed. In sequence, DNA makes mRNA, which eventually makes proteins that eventually make humans, so in any step along the way, genes can be regulated; they can be regulated post-translationary or post-transcriptionary, but in general most of the gene regulations take place from DNA to mRNA to either express a gene or not. In Figure 2, the feedback circuitry for gene regulation isshown. 2 Figure 1. Structure of a gene regulatory network Figure 2. Control process of a gene regulatory network 3 Hence, the expression of genes exerts their effects on cellular metabolism and function by synthesizing new protein products. The journey from the genes to mRNA to proteins is called central dogma of biology (Figure 3). Figure 3. Central Dogma of Biology. A gene is a piece of DNA that is identical in the entire organism. In some tissues, a gene is expressed and produces mRNA. On the contrary, when it is not expressed, it remains inactive. When the gene is expressed, a gene product (a protein or RNA) with a role is made to implement vital functions in the human body. Actually, gene expression is the production of mRNA. 4 Gene regulatory networks (GRNs) are sets of genes that interact with each other, and complex biological processes can be completely understood using network inferences(Chai et al., 2014; Hill et al., 2016). GRNs conventionally use a graph model, where genes are nodes and interaction between genes are the edges. However, the inference of GRN is a complicated issue because gene expression data are often large-scale data and noisy. Causal Relationships Genes are not independent; they regulate each other and act collectively. This collective behavior can be observed using microarray. A number of genes control the reaction of the cell to alter the environs by regulating other genes. Causal relationships in GRN infer how genes interact with each other. There are four different types of causal relationships in GRN (Figure 4): one gene can regulate more than one gene (gene A activates genes B,C, and D) one gene can be activated by more than one gene (gene F can be activated by gene B and D) a gene can regulate itself (gene B regulates itself) a gene can inhibit another gene (gene D inhibits gene E); all three of the above conditions can be true for inhibition as well. 5 Figure 4. Different causal relationships between genes. 6 Data Integration There are different types of molecular networks such as protein-protein interaction (PPI) networks, metabolic interaction (MI) networks, gene interaction (GI) networks, and gene co-expression (co-ex) networks.Therefore, creating an integrated network that uses all or some of these biological networks can provide an in-depth understanding of biological systems. It is possible to divide integration methods into two groups based upon the integrated type of data: homogeneous and heterogeneous integration methods. An integrated network with the same type of nodes and various types of links among nodes is called homogeneous, but if both types of nodes and links between nodes are different, it is called heterogeneous. For instance, if only GI networks are integrated, then the resulting integrated network is a homogeneous integration, while integrating GI and MI or more molecular networks will result in a heterogeneous interaction. The techniques for integrating data can be subdivided into the following three groups (Gligorijević & Pržulj, 2015): 1. Decision integration: This method first considers each dataset in isolation and constructs models for each of them, then combines all of the models into a single model. Constructing models for each isolated dataset without considering their reciprocal relationships with other datasets might end up abating the functioning of the final model (Gevaert, Smet, Timmerman, Moreau, & Moor, 2006) 2. Partial integration: This method combines data through inference of a joint model. Many researchers have used this technique due to its high predictive accuracy. However, it should be mentioned that there are some studies that provide better accuracy of performance for the other two techniques than forPartial integration 7 (Özen, Gönen, Alpaydın, & Haliloğlu, 2009). Thismethod does not requireany transformation; therefore,no information is lost. 3. Full integration: This method first combines the datasets into a simple dataset using an integration approach. This usually is implemented through a computational framework, which in some instances may cause information loss. Then, it builds the model based on the integrated dataset (Lanckriet, De Bie, Cristianini, Jordan, & Noble, 2004). Definition of Terms Multi-omics data.The term multi-omics data describes multiple biological data that are required to draw a complete picture of biological processes. RSS.Residual sum of squares (RSS) is a criterion used to see how well a statistical model fits to the data. It estimates the total difference between real data and the predicted ones by using anestimation model. In other words, RSS is a distance from a data point to a regression line. The smaller the RSS result, the better the model fits. LASSO. Least Absolute Shrinkage and Selection Operator (LASSO) is used for variable selection when dealing with high dimensional data, when the number of variables is much greater than the sample size. To a certain extent, by adding an L1 penalty to the usual regression objective function, the important variables can be identified among the many available. This is on the condition that the number of important variables is a small portion of the overall, which is referred to as the sparse condition. It is important to note that LASSO is only useful for variable selection and not for estimation. The coefficients or lambdas in a regression setting given by LASSO are most often biased. Regularization penalizes parameters for being too large and reduces the weight. Typically, the penalty grows exponentially, so the larger a coefficient gets, the heavier the penalty. 8 In genome data, LASSO is used to sparsify a model. The network may have thousands of genes, and the model can be naively built with almost every gene connected to each other. By using LASSO, a lot of these interactions will be a true zero, which reduces the number of connections and vastly simplifies further analyses. LASSO regression is a powerful technique to create sparing models that have a large number of features. The objective function for LASSO that should be minimized is the RSS plus the sum of the absolute value of the weights. This can be represented through a mathematical formula as: Cost(W) RSS(W) (Sum of absolute value of weights) M j j N i i i ij y w x w 0 2 1 WhereX is a matrix of features having N rows and M + 1 columns, Y is the real outcome variables of length N, and W is the vector of weights or coefficients of length M +1. Also, N is the entire number of data points and M is the entire number of features. The reason why X has the M + 1 columns is because it has M features and one intercept. In the above formula, if = 0 the outcome will result in the same coefficients like ordinary linear regression; if = ∞, the coefficients of LASSO will be all zero, and if 0 < < ∞, the LASSO coefficients will be between 0 and that of ordinary linear regression. HPN-DREAMchallenge. DREAM challenges are designed to accelerate the creation of new predictive models in biomedicine—they can, for instance, catalyze the development of innovative methods for inferring cellular networks, thereby advancing network and systems biology. 9 Statement of the Problem In this research, a new method called Integrative Gene Regulatory Network inference (iGRN) was proposed in order to integrate multiple genomic data and their reciprocal actions into gene regulatory networks. This study can be described as follows: Development of a new method for gene regulatory network inference through the integration of CNV, DNA methylation, and gene expression (multi-omics data), while most studies that infer gene regulatory network have used only gene expression; Biological interpretation of the integrated network of the multi-omics data by considering the effects of reciprocal actions between these multi-omics data. Firstly, annotations and iGRN method are described in Chapter 3. In Chapter 4, the extensive experimental results with simulation data are shown and the performance with current state-of-the-art methods is compared. Then, in Chapter 5 of the thesis, experimental results are presented in which iGRN is implemented withpsychiatric disorders data. Finally, conclusions and future works are discussed in Chapter 6. Purpose of the Study Studying different aspects of integrative genomic methods is becoming more important as different kinds of high-performance biological data are rapidly developing. Moreover, human diseases affect a number of complicated gene-environment relations in various biological procedures like genetic, epigenetic, and transcriptional regulation (Cisek, Krochmal, Klein, & Mischak, 2015; Higdon et al., 2015; Kristensen et al., 2014; W. Zhang, Li, & Nie, 2010), which proves the necessityfor integrative genomic research. For instance, gene expression is the procedure in which instructions from a gene are used in the combination of a functional gene product. These products are DNA methylation, single nucleotide polymorphism (SNP), histone 10 modifications, transcription factor, copy number variation (CNV), sequential interactions of DNA sequence variations, and several more aspects (Aure et al., 2013; Kang, Kim, Liu, & Gao, 2015; Kang, Tang, & Gao, 2016; Wagner et al., 2014). Thus, other factors than SNP can explain biological processes. Copy number of variation is a type of structural variation that occurs when a portion of the DNA is lost. These CNVs can affect gene expression and can be correlated to particular phenotypes and diseases. This variation covers approximately 12% of the human genome and includes deletions and duplications (Figure 5). Similarly, DNA methylation that is an epigenetic mechanism that occurs by the addition of a methyl ( 3 CH ) group to DNA and often modifies the function of the genes. DNA methylation often occurs in promoter regions of a gene and normally suppresses a gene expression (Figure 6). For instance, monozygotic twins have the same DNA sequence but different CNVs, which often result in developing congenital heart disease (Breckpot et al., 2012; Henrichsen, Chaignat, & Reymond, 2009). 11 Figure 5. Creating a Copy number of variation by gene duplication. Figure 6. Inactivating a gene by DNA methylation. 12 Thus, variant genomic aspects like multi-omics data should be considered to integrate GRNs to have a comprehensive understandingof the mechanism in genetic processes. In order to obtain such understanding, the integration of multi-omics into GRN was considered in this study. Significance of the Study Implementing gene regulatory network inference is a complicated problem because the number of genes is relatively high in comparison to the number of data points. Integrating multi-omics data (DNA methylation and copy number of variation) with gene expression data can make the process of inferring gene network even more complicated. There are a few studies that investigate this subject; most of them have considered gene expression data as the only feature in the network (Chai et al., 2014; Hill et al., 2016). In this thesis, my goal was to find an integrated gene regulatory network in order to consider features of genes other than gene expression. DNA methylation and copy number of variation were considered in Kim et al.(2014) as independent features in the network. Limitations In this study, the relationships among the genes are considered to be linear. This can prove to be a limitation, because any nonlinear interactions among the genes may cause the predictive model to be inaccurate. Assumptions Inferring gene regulatory networks with non-linear regulatory relationships is a complex task. This study assumes all the relationships between variables are linear. 13 Organization of Remaining Thesis Chapters The thesis chapters are organized as follows: Chapter 2 includes the literature review, which is a description of the related works that are dealt with in this study. Chapter 3 covers the methods that were used to infer the GRN. Chapter 4 includes the simulation study where data was generated and the performance of the proposed method with current state-of-the-art methods was evaluated. Chapter 5 presents the following tasks: 1. Performance of the proposed method (iGRN) with human brain data including psychiatric diseases and control on 131 samples. 2. Preprocessing of the dataset because it contains 25,833 gene expression data points, but not all of them are of high importance. Therefore, a volcano plot was used in order to obtain significant genes. 3. Evaluation of the proposed iGRN by implementing experiments using two biological data samples that are related to psychiatric disorders of the human brain. The output of iGRN was compared with simple regression-based gene regulatory network inference (GRN) and DCGRN(Kim et al., 2014). Finally, in Chapter 6, the summary of the proposed method in this study, as well as its future implications, are provided. 14 Chapter 2 LITERATURE REVIEW There are three GRN inference approaches that are primarily used: (1) correlation-based techniques;(2) Bayesian methods, which are based on graphical modeling;and (3) regression-based approaches. Correlation-based approaches define association measures among genes based on their linear correlation to each other and the Pearson correlation coefficient is applied to Boolean networks. This model can be considered the preliminary model for GRNs. Boolean networks usually have few connections and represent a graph with a set of states together with a set of Boolean functions. In Boolean networks, gene expression has two states, active (1) and inactive (0), and has a certain threshold; active interactions of genes can be identified based on the correlation coefficient if it is higher than the threshold. Since genomic data such as gene expression data and variant data have very high dimensionality, when there is a gene expression dataset, it is important to identify groups of genes thatshow similar expression patterns. One of the ways to do this is WGCNA or weighted gene coexpression network analysis. In simple terms, the WGCNA identifies genes thatshow similar expression patterns across samples or conditions. These gene groups are called modules. WGCNA identifies modules by using a type of principle component analysis (PCA). Here, each module is represented by an expression value thatbelongs to the module eigengene. This value is identified from the PCA. However, none of the actual genes in the module need to have the same expression value as the module eigengene. Since each eigengene represents a module, the distance of a gene from the eigengene, which is the centre of the module, can be calculated. This shows the module that each gene lies in. WGCNA can be considered as an illustrative algorithm of the correlation-based methods (Langfelder & Horvath, 2008). Besides the correlation 15 coefficients, other methods have also been used on Boolean networks to infer GRNs, such as mutual information (MI; Altay & Emmert-Streib, 2010), maximum information correlation (MIC; Reshef et al., 2011), and conditional mutual information(CMI; Liang & Wang, 2008; X. Zhang et al., 2012). The main drawback of the correlation-based approaches is the deficiency of elucidation of causal influence among genes due to the use of an undirected graph. Bayesian-based methods are graph-based models used to infer causal relationships between genes. The network consists of two main parts: (1) a directed acyclic graph (DAG) comprised of nodes thatelucidate arbitrary variables and directed arcs, which represent probabilistic dependencies between them (Husmeier & Werhli, 2007; Njah & Jamoussi, 2015; Werhli & Husmeier, 2007; Young, Raftery, & Yeung, 2014); (2) the detail of a conditional probability distribution for each of the arbitrary variables, which on account of discrete-valued variables, appears as a conditional probability table (CPT). This indicates the probability of a child node going up against a specific value, given the estimations of its parent nodes. In addition, Bayesian network inference approaches take advantage of the random nature of gene expression data and the missing value problem. A Bayesian network is based on both prior knowledge and observed data to create a model thatcan be clearly trusted. Prior knowledge, such as a pathway database, could be informative, and it can be incorporated to increase the accuracy and efficiency of gene regulatory network inference (Husmeier & Werhli, 2007; Werhli & Husmeier, 2007). However, the main problem with Bayesian network learning is that it restricts the number of genes to infer the GRN, because, as the number of genes increase, the networks expand in an exponential way. Regression-based approaches decompose the inference of gene regulatory networks into P regression problems (Gustafsson, Hörnquist, & Lombardi, 2005; Kim et al., 2014; Oh &Deasy, 16 2014; Omranian, Eloundou-Mbebi, Mueller-Roeber, & Nikoloski, 2016), in which 𝑃 is the number of genes. As a regression-based approach, LASSO has often been applied to decomposed regression models to find the biologically decipherable solutions. Such approaches showed ahigh robustness inthe performance ofconstructing large scale GRNs. For example, regression-based approaches were rated as the best methods in the HPN-DREAM breast cancer network inference challenge (Hill et al., 2016). However, there are only a few studies that have addressed integrating multi-omics data into GRNs, since finding an efficient algorithm that incorporates these heterogeneous data into a regulated form is challenging. The method DCGRN (short for DNA methylation, copy number variation, and gene regulatory network)developed byKim et al. (2014) proposed an approach that integrates gene expression, CNV, and DNA methylation for gene regulatory network inference. The drawback for DCGRN is, in this model CNV and DNA methylation aretreated as independent factors along with genes. Hence, there may be a lack of interpretation between genes and multi-omics data in the model. SGRN is another integrated network introduced by Kim, Liu, Wu, Zhang, andGao (2013) in which single nucleotide polymorphism (SNP) is integrated into the gene regulatory network so that SNP is part of the gene regulator. This method is used to illustrate the relationships between gene expressions and genetic perturbations.Aregression-based approach is applied in SGRN that is consideredadirect network model. Petralia, Wang, Yang, and Tu (2015) used an algorithm to integrate heterogeneous data types; that is, the information from other sources of data or prior belief of interactions between geneswas integrated into GRN by using random forest. According to Cai, Bazerque, and Giannakis (2013), biological system gene networks are sparse; therefore, a sparse Structural Equation Models (SEM) is used to infer gene regulatory 17 networks by integrating gene expression data and cis-expression quantitative trait loci (cis-eQTL). 18 Chapter 3 METHOD OF PROCEDURE Problem Definitions The symbols that were dealt with throughout this research are described in this section. Let G describe a matrix containing gene expression, N P G R , where N is the number of the observations and P is the number of genes in microarray data. For the th i gene, the vector of gene expressions is presented as N i g R , and i G is a matrix that consists of gene expressions other than gene i , ( 1) N P i G R . The matrices of CNV and DNA methylation are defined as C and D , respectively. CNV and DNA methylation are supposed to be related to a gene if they are placed around (upstream or downstream) the gene on the same chromosome. The CNV and DNA methylation corresponding to gene i are denoted as N Vi i C R and N Mi i D R respectively, where i V and i M are the numbers of CNV and DNA methylation that are related to gene i . The matrix ( ) P P B R is the adjacency matrix of gene expression in a gene regulatory network. It is supposed that there is no self-regulation, that is, 0 ii B where i {1,..., P} . Inference of Integrative Gene Regulatory Network In this research, construction of an adjacency matrix to infer an integrative gene regulatory network of multi-omics data was considered. Moreover, the interaction results of CNVs and DNA methylations with the correlating genes were investigated. When a gene (G1 in Figure 7) regulates another gene (G4), iGRN represents the gene (G1) with a CNV (CNV2) and a DNA methylation (DM2) as having an integrative interaction with the target gene (G4).The integrative relationships between a gene i and a set of SNPs corresponding to the gene can be defined by Fisher's reciprocal action pattern: 19 , i i g C i i g D (1) Where is an element-by-element multiplication. The equation demonstrates differentially activated gene expression on alterations of CNVs or DNA methylations. For example, CNVs that are almost replicated may influence various expressions of the gene. Therefore, gene expression cannot completely show the biological processes without considering the effects of the interactions. The proposed iGRN was demonstrated by integrating theinteraction effects among multi-omics data. Genes interact with a selection of genetic regulators expressed by the multi-omics data. Thus, a reasonable hypothesis is that a sparse linear model of genes as well as CNVs and DNA methylations, which are related to the gene i , can demonstrate the expression of gene i . Hence, the gene expression i g for gene i can be shownby: i d i i i c i i i g i i i g G b C g b D g b ( ) ( ) (2) Where g i b , c i b and d i b are the coefficients of other gene expressions, CNVs and DNA methylations of gene i , respectively. The residual is represented as i . The gene regulatory network has the adjacency matrix B that is composed of b i (1 i P) g in Equation 2, that is, g T P g B {b ,..., b } 1 . An inference for gene regulatory network can be obtained by optimizing the parameters. Least squares with a sparse setting are used to attain a learning function for the optimal parameters. argmax ( , , ) g ci di F b g b 2 ( ) ( ) d i i ci i i i g i i i g G b C g b D g b d d i c c g g i b b b i (3) 20 Where g , c and d are hyper-parameters for scarcity regularization, and . 2 and . show L- 2 and L-1 norms, respectively. The optimization function can be investigated as the following LASSO problem: argmax 2 g Xb b i (4) Where X is the augmented matrix, { , , } i i i i i X G C g D g . However, the number of genes ( ) i G is usually much greater than the number of CNVs and DNA methylations correlated to gene i . For instance, there are just a few CNVs ( ) i C or DNA methylations ( ) i D for a gene in the psychiatric disorder data that was used for the experiment in the research, whereas the number of gene expression ( ) i G was hundreds after pre-processing. Therefore, the solution of LASSO may tend to neglect most CNVs and DNA methylations even though they are significant. Thus, the optimization problem was solved step by step. Firstly, significant genes (among i G ) that interact with gene i were denoted by LASSO: argmax 2 g i g i i i g G b b (5) Secondly, a matrix of i G was constituted with only the genes of non-zero coefficients and p-values were computed by the following linear regression: i d i i i c i i i g i i i g G b C g b D g b ( ) ( ) (6) Then, the coefficients of the genes, CNVs and DNA methylations, where p-values 0.05, were designated to the adjacency matrix, B .The procedure is demonstrated in Algorithm 1. 21 Algorithm 1. iGRN Step Action(s) 1 For i{1,..., P} do 2 g i b = LASSO ( , ) i i G g in (5) 3 Compute the linear regression of (6) 4 b otherwise if b or p value b b g ij g ij g g ij ij 0 0 ( ) 0.05 5 b otherwise if b or p value b b c ij c ij c c ij ij 0 0 ( ) 0.05 6 b otherwise if b or p value b b d ij d ij d d ij ij 0 0 ( ) 0.05 7 g ij i B b 8 end for Figure 7. Example of iGRN. The interactive effects of Copy number variation (CNV) and DNA methylation (DM) are incorporated in the gene regulatory network model. 22 Chapter 4 SIMULATION STUDY In order to evaluate the proposed method and compare the performance with current methods, extensivesimulation experiments were conducted. The assessment of gene regulatory network inference in complex organisms such as humans is challenging due to there only being a few true models of biological networks available. Thus, the performance was indirectly evaluated with simulation data that implemented integrative biological networks where the true model was given. The simulation data were generated as a result of the hypothesis for the integrative gene regulatory networks. In the simulation studies, the aim wasto (1) verify that the proposed method produces robust performance to identify the true models of gene regulatory networks from multi-omics data, and (2) to compare the performance with current state-of-the-art methods on the given hypothesis. The following three experiments were carried out with the simulation data: (1) Receiver Operating Characteristic (ROC) curve, (2) sensitivity, and (3) false discovery rate. Simulation Settings In the integrative gene regulatory network model, gene expression can be represented by two components: (1) gene expression regulated by other genes ( ) g G and, (2) interactions of CNVs and DNA methylations ( ) i G , as shown in Equation 2: g i G G G (7) where g i i g G G b ( ) ( ) . { } { } d i i i c i i i i G C g b D g b First, g G was generated by the given adjacency matrix, Z : 23 1 ( ) G E I Z g (8) Where P P I R is an identity matrix, and N P E R is normally distributed random values for noise, E ~ N(0,0.01) . The adjacency matrix Z is a sparse acyclic graph without self-loop. The CNV data ( ) N P C R were implemented by taking the values {0, 1, 2, 3, 4}with the corresponding probabilities{0.01, 0.02, 0.4, 0.2 , 0.1}. The given probabilities were directly acquired from CNVs of the human brain data that are used in Chapter 5. The DNA methylation ( ) N M D R was randomly obtained by the uniform distribution on the interval [0,1]. In practice, CNVs and DNA methylations are annotated to nearby genes by using their loci and gene regions. The associations were designated by sparse Boolean mapping matrices V P W R and M P F R for CNVs and DNA methylations, where only a couple of CNVs and DNA methylations can be annotated to a gene. In this simulation data, it is assumed that all of the CNVs and DNA methylations near a gene are significantly regulating the gene expression. The gene expression regulated by the interactions of CNVs and DNA methylations was generated by: G CW G DF G i (9) The gene expression controls the gene expression levels of other genes with the interaction effects of multi-omics data in gene regulatory networks. Therefore, Equations 8and 9 are repeated until G converges. It should be noted that Z ,W and F are the biadjacency matrices of ground truth in the simulation studies. The algorithm is described in Algorithm 2. 24 1 1 ( ) G E I Z g 2 do 3 1 ( )( ) G E CW G DF G I Z 4 while G converges The LASSO-based GRN method (GRN) was considered as the baseline. Moreover, DCGRN, which is an integrative gene regulatory network inference method, uses multi-omics data. The GRN infers the gene regulatory relationship on gene i with the LASSO regularization: , i g i i i g G b Subject to b C. g i (10) The GRN identifies significant gene regulations by the LASSO solution, but it considers only gene expression data for the network inference. On the other hand, DCGRN incorporates multi-omics data of CNVs and DNA methylations in the model: Subject to d d g i g c i c g i g i b C , b C , b C ,and b C (11) Algorithm 2. Generate data Step Action , { } { } i d i i c i i g i i i g G b C b D b 25 Simulation Study 1 Forthe first set of simulations, the performance of the proposed method was evaluated by computing the Area Under the ROC curve (AUROC). The confusion matrix of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) is defined as: TP: correctly identified positive gene regulations of non-zero coefficients FP: incorrectly identified positive gene regulations as zero coefficients TN: correctly identified negative gene regulations as zero coefficients FN: incorrectly identified negative gene regulations as non-zero coefficients The non-zero coefficients of d i g i b ,b , and c i b were considered as positives, while the coefficients of zero were negatives. The confusion matrices for gene regulations and integrative interactions of CNVs and DNA methylations were separately computed. The Receiver Operating Characteristic (ROC) curves were traced over different thresholds to examine the trade-off between True Positive Rate (TPR = TP/[TP + FN]) and False Positive Rate (FPR = FP/[FP + TN]). The hyper-parameters ( g , c , and d ) in Equation 3determine the sparsity of significant components of non-zero coefficients in the multi-omics data. It should be noted that all of the coefficients are non-zero when the parameter is zero, while all coefficient values become zero when the infinite value is given for the parameter.The sparsity step (1 P V M) determines the hyper-parameters in the LASSO solution. In this simulation study for the ROC curves, only the coefficient values were considered to determine the positive interactions, where p-values were not computed. The GRN computes only the confusion matrix for gene regulations, while the DCGRN and iGRN have confusion matrices for CNVs and DNA methylations as well as gene expression. 26 Therefore, overall ROC curves were considered where only the confusion matrix of gene regulation was reflected on GRN, while the three confusion matrices were combined to compute ROC curves in DCGRN and iGRN. The overall ROC curves are illustrated in Figure 8, and the Area Under the ROC curves (AUROC) is shown in Table 1. The experimental result of the overall AUROC supports that iGRN (0.938) provides better performance than GRN (0.895) and DCGRN (0.843). TPRs on interaction of the CNVs and DNA methylations were measured for DCGRN and iGRN. Since the simulation data did not include negatives on CNVs and DNA methylations, only the efficacy of the methods used to identify the true positives was examined. The TPRs are shown in Figure 9, where iGRN outperforms DCGRN in identifying true integrative interaction of DNA methylations and CNVs. 27 Figure 8. ROC plot for GRN, DCGRN, and iGRN Table 1. AUROC with simulation data. 28 Figure 9. TPRs for interaction effects of CNVs and DNA methylation. 29 Simulation Study 2 Forthe second set of simulations, the overall sensitivity, which is the probability of identifying the true positives, was measured. In this simulation study, the hyper-parameters were optimized by 10-fold cross-validation. The multi-omics elements of non-zero coefficients with p-values less than 0.05 were considered positives. The overall sensitivity is depicted inFigure 10. iGRN produced the best sensitivity (0.300 0.034), while GRN and DCGRN showed (0.199 0.030) and (0.269 0.035) , respectively. The sensitivities of CNVs and DNA methylations on iGRN and DCGRN are shown in Figure 11. The sensitivities for iGRN and DCGRN were (0.054 0.030) and (0.102 0.035), respectively. 30 Figure 10. Sensitivity. 31 Figure 11. Sensitivity on DNA methylation and copy number variations. 32 Simulation Study 3 Forthe third set of simulations, the False Discovery Rate (FDR) was conducted. In this study, simulation data that had no gene-gene regulation in the biological network were generated. All positive predictions that the methods infer were false positives, since the true adjacency matrix was all zero. FDR was computed as FP/(TP + FP). The FDRs of GRN, DCGRN, and iGRN less than 0.02 are indicated in Figure 12, where the FDRs were (0.019 0.003), (0.019 0.003), and (0.019 0.003), respectively. This showed that iGRN had less than 2% chance of identifying insignificant interactions. 33 34 Figure 12. FDR rate for GRN, DCGRN, and iGRN. 35 Chapter 5 EXPERIMENT RESULTS Human Brain Data for Psychiatric Diseases The proposed method (iGRN) was implemented on 131 samples of human brain data from individuals with psychiatric diseases. In this research, multi-omics data of gene expression, CNV and DNA methylation, were acquired from the prefrontal cortex of psychiatric patients(Liu et al., 2010). Among the 131 samples, 39 of them were from individuals diagnosed with schizophrenia, 35 were from individuals diagnosed with bipolar disorder, 12 were from individuals diagnosed with major depression, and 44 were healthy and served as the control group, ofwhich each specimen had 25,833 gene expressions, 1,028 of CNV, and 24,399 of DNA methylation. The psychiatric disorder data were considered as a group incorporating the three samples. Two biological replicas of gene expression were used to test the consistency of the proposed method and evaluate its performance. Preprocessing There were 25,833 gene expression data, of which only 495 genes were selected. The samples were classified into two groups, psychiatric diseases and healthy control. Then, the fold changes (FC) and p-values of the two groups were determined. In addition, 495 genes were preferred where their p-values 0.05and FC < 0.09orFC > 1.1(see Figure 13). CNVs and DNA methylations were elucidated according to their locations. If a CNV’s region has some overlaps with the th i gene in the same chromosome, it is elucidated as correlated to gene i . DNA methylation is correlated to gene i if its locus is positioned within 2Kbp around the Transcription Start Site (TSS) of gene i . 36 Figure 13. Volcano plot for preprocessing. Comparison Experiments were conducted on two biological replicas of human brain data to evaluate the proposed iGRN of psychiatric disorders. The performance of iGRN was contrasted with ordinary regression-based gene regulatory network inference (GRN) and DCGRN (Kim et al., 2014). To be concise, GRN infers the interaction with other genes on gene i by: i g i i i g G b (2) and, DCGRN does: i d i i c i i g i i i g G b C b Db (3) The following two experiments were performed: (1) root-mean-square error (RMSE) and, (2) enumerating the numbers of statistically substantial CNVs and DNA methylations. Firstly, RMSE between i g and their functions, ( ) i g , were calculated where, 37 G b C g b D g b for iGRN G b C b D b forDCGRN G b for GRN g d i i i c i i i g i i d i i c i i g i i g i i i ( ) ( ) ( ) The RMSEs of GRN, DCGRN, and iGRN were (6.82 2.22), (6.23 2.49) , and (4.54 3.02) , respectively forSample 1, and (7.23 2.25) , (6.17 2.66), and (4.11 2.90) forSample 2 (see Table 2). The best illustration of gene expressions was demonstrated by the RMSE of the proposed method iGRN. Moreover, the Wilcoxon signed-rank test was implemented, which is a non-parametric paired two-sided test to statistically assess the RMSE performance. In comparison with experiments conducted on DCGRN including both biological replicas, iGRN showed a statistically significant enhancement of RMSE. However, DCGRN produced a better result than GRN. The p-values of the Wilcoxon signed-rank test between iGRN and DCGRN represented in Table 2 are much less than 0.05, which indicatesthat the calculated difference of RMSE between iGRN and DCGRN is statistically significant. Next, the numbers of the CNVs and DNA methylations that were distinguished as statistically significant in the model were compared. The comparison was only performed for DCGRN as there was no CNV or DNA methylation in GRN. There were only one CNV and 12 DNA methylations distinguished as significant by the DCGRN method. In contrast, 10 CNVs and 188 DNA methylations of significance in iGRN were identified with Replicate 1. This may imply that iGRN can result in a better inference of the integrative structure of the gene regulatory network than DCGRN. The results were identical toReplicate 2. Two discoveries were represented in the experiments with iGRN. This illustrated that FKBP5’s gene expression was regulated by ZBTB16, MT2A, FYB, and PDK4, and CNV of CNP11052, and DNA methylation of cg00862770 were affected by this interaction. CNV or DNA methylation was not affected in the integrative network in DCGRN. Inanother instance, 38 SNRPN’s gene expression was regulated by SNORD115-3, CNV of CNP12294, and DNA methylation of cg02171545. In a similar way, there was no CNV or DNA methylation in DCGRN's model. Table 2. Comparison with biological replicates. Replicate #1 GRN DCGRN iGRN RMSE 6.82 2.22 6.23 2.49 4.54 3.02 P-value(Wilcoxon Signed-Rank Test with) - 1.80 e-50 (GRN) 2.30e-32 (DCGRN) # of significant CNV - 1 10 # of significant DM - 12 188 Replicate #2 GRN DCGRN iGRN RMSE 7.23 2.25 6.17 2.66 4.11 2.90 P-value(Wilcoxon Signed-Rank Test with) - 2.61e-47 (GRN) 3.50e-27 (DCGRN) # of significant CNV - 1 3 # of significant DM - 7 153 39 Chapter 6 CONCLUSIONS AND FUTURE WORK Multi-omics data play an important role in modeling gene regulatory networks. The recent rapid advances in high-throughput omics technologies have triggered the integrative multi-omics study for the in-depth understanding of complex biological processes. However, only a few studies have considered the multi-omics data in gene regulatory network inference. In this research, an integrative gene regulatory network inference method was proposed, where multi-omics data and their interaction effects were integrated in the mathematical graph model. The proposed method, iGRN, can infer gene regulatory networks from multi-omics data of CNVs and DNA methylations as well as gene expression data, and produce the homogeneous network where nodes are only genes. The inference capability of iGRN was assessed by the extensiveexperiments with simulation data. iGRN was applied to human brain data from individuals withpsychiatric disorders, and the biological network of the psychiatric disorders was analyzed. The future trend of this research will include: 1. Incorporating SNP and miRNA in the model 2. Applying the proposed method to multiple types of multi-omics data such as The Cancer Genome Atlas (TCGA) 3. More elaborate experiments with explicit genomic data 4. Evaluation with well-known gene regulatory network databases 5. In-depth biological interpretation 40 REFRENCES Altay, G., & Emmert-Streib, F. (2010). Inferring the conservative causal core of gene regulatory networks. BMC Systems Biology, 4, 132. Aure, M. R., Leivonen, S. K., Fleischer, T., Zhu, Q., Overgaard, J., Alsner, J., … & Kristensen, V. N. (2013). Individual and combined effects of DNA methylation and copy number alterations on miRNA expression in breast tumors. Genome Biology, 14(11), R126. https://doi.org/10.1186/gb-2013-14-11-r126 Breckpot, J., Thienpont, B., Gewillig, M., Allegaert, K., Vermeesch, J. R., & Devriendt, K. (2012). Differences in copy number variation between discordant monozygotic twins as a model for exploring chromosomal mosaicism in congenital heart defects. Molecular Syndromology, 2(2), 81–87. https://doi.org/10.1159/000335284 Cai, X., Bazerque, J.A., & Giannakis, G.B. (2013). Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Computational Biology, 9(5), e1003068. doi:10.1371/journal.pcbi.1003068 Chai, L. E., Loh, S. K., Low, S. T., Mohamad, M. S., Deris, S., & Zakaria, Z. (2014). A review on the computational approaches for gene regulatory network construction. Computers in Biology and Medicine, 48(1), 55-65. https://doi.org/10.1016/j.compbiomed.2014.02.011 Cisek, K., Krochmal, M., Klein, J., & Mischak, H. (2015). The application of multi-omics and systems biology to identify therapeutic targets in chronic kidney disease. Nephrology Dialysis Transplantation, 31(12), 2003-2011. https://doi.org/10.1093/ndt/gfv364 Gevaert, O., Smet, F.D., Timmerman, D., Moreau, Y., & Moor, B.D. (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics, 22, e184–e190. Gligorijević, V., & Pržulj, N. (2015). Methods for biological data integration: Perspectives and 41 challenges. Journal of the Royal Society Interface,12(112), 20150571.doi: 10.1098/rsif.2015.0571 Gustafsson, M., Hörnquist, M., & Lombardi, A. (2005). Constructing and analyzing a large-scale gene-to-gene regulatory network-lasso-constrained inference and biological validation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(3), 254–261. Henrichsen, C. N., Chaignat, E., & Reymond, A. (2009). Copy number variants, diseases and gene expression. Human Molecular Genetics, 18(R1), R 1-8.https://doi.org/10.1093/hmg/ddp011 Higdon, R., Earl, R. K., Stanberry, L., Hudac, C. M., Montague, E., Stewart, E., … & Kolker, E. (2015). The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. Omics: A Journal of Integrative Biology, 19(4), 197–208. https://doi.org/10.1089/omi.2015.0020 Hill, S. M., Heiser, L. M., Cokelaer, T., Unger, M., Nesser, N. K., Carlin, D. E., … & Mukherjee, S. (2016). Inferring causal molecular networks: Empirical assessment through a community-based effort. Nature Methods, 13(4), 310–318. https://doi.org/10.1038/nmeth.3773 Husmeier, D., & Werhli, A. V. (2007). Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks with Bayesian networks. Computational Systems Bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference, 6, 85–95. Kang, M., Kim, D. C., Liu, C., & Gao, J. (2015). Multiblock discriminant analysis for integrative genomic study. BioMed Research International, 2015(2015),Article ID 783592, 1-10. https://doi.org/10.1155/2015/783592 42 Kang, M., Tang, L., & Gao, J. (2016). Computational modeling of phagocyte transmigration for foreign body responses to subcutaneous biomaterial implants in mice. BMC Bioinformatics, 17(1), 111. https://doi.org/10.1186/s12859-016-0947-3 Kim, D.C., Kang, M., Zhang, B., Wu, X., Liu, C., & Gao, J. (2014). Integration of DNA methylation, copy Number variation, and gene expression for gene regulatory network inference and application to psychiatric disorders. 2014 IEEE International Conference on Bioinformatics and Bioengineering, 238–242. https://doi.org/10.1109/BIBE.2014.71 Kim, D.C., Liu, C., Wu, X., Zhang, B., & Gao, J. (2013). Inference of SNP-gene regulatory networks by integrating gene expressions and genetic perturbations. IEEE, 2014(2014). doi: 10.1109/BIBM.2013.6732484 Kristensen, V. N., Lingjaerde, O. C., Russnes, H. G., Vollan, H. K., Frigessi, A., & Borresen-Dale, A. L. (2014). Principles and methods of integrative genomic analyses in cancer. Nature Reviews Cancer, 14(5), 299–313. https://doi.org/10.1038/nrc3721 Lanckriet, G.R.G., Bie, T.D., Cristianini, N., Jordan, M.I., & Noble, W.S. (2004). A statistical framework for genomic data fusion. Bioinformatics, 20, 2626–2635.doi:10.1093/bioinformatics/bth294 Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559. Liang, K.C., & Wang, X. (2008). Gene regulatory network reconstruction using conditional mutual information. EURASIP Journal on Bioinformatics & Systems Biology, 2008(1), 253894. https://doi.org/10.1155/2008/253894 Liu, C., Cheng, L., Badner J.A., Zhang, D., Craig, D.W., Redman, M., & Gershon, E.S. (2010). Whole-genome association mapping of gene expression in the human prefrontal cortex. 43 Molecular Psychiatry, 15(8), 779-784. doi: 10.1038/mp.2009.128. Njah, H., & Jamoussi, S. (2015). Weighted ensemble learning of Bayesian network for gene regulatory networks. Neurocomputing, 150(PB), 404–416. Oh, J. H., & Deasy, J. O. (2014). Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm. BMC Bioinformatics, 15(7), S5. https://doi.org/10.1186/1471-2105-15-S7-S5 Omranian, N., Eloundou-Mbebi, J. M. O., Mueller-Roeber, B., & Nikoloski, Z. (2016). Gene regulatory network inference using fused LASSO on multiple data sets. Scientific Reports, 6, 20533. Ozen, A., Gonen, M., Alpaydin, E., & Haliloglu, T. (2009). Machine learning integration for predicting the effect of single amino acid substitutions on protein stability. BMC Structural Biology, 9(66). doi: 10.1186/1472-6807-9-66 Petralia, F., Wang, P., Yang. J., & Tu, Z. (2015). Integrative random forest for gene regulatory network inference. Bioinformatics, 31(12), i197-i205. doi: https://doi.org/10.1093/bioinformatics/btv268 Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., … & Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science, 334(6062), 1518–1524. https://doi.org/10.1126/science.1205438 Wagner, J. R., Busche, S., Ge, B., Kwan, T., Pastinen, T., & Blanchette, M. (2014). The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts. Genome Biology, 15(2), R37. https://doi.org/10.1186/gb-2014-15-2-r37 Werhli, A. V., & Husmeier, D. (2007). Reconstructing gene regulatory networks with bayesian 44 networks by combining expression data with multiple sources of prior knowledge. Statistical Applications in Genetics and Molecular Biology, 6(1), Article15. Young, W. C., Raftery, A. E., & Yeung, K. Y. (2014). Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Systems Biology, 8(1), 47. Zhang, W., Li, F., & Nie, L. (2010). Integrating multiple ―omics‖ analysis for microbial biology: Application and methodologies. Microbiology, 156(2), 287-301. https://doi.org/10.1099/mic.0.034793-0 Zhang, X., Zhao, X. M., He, K., Lu, L., Cao, Y., Liu, J., …& Chen, L. (2012). Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics, 28(1), 98–104. https://doi.org/10.1093/bioinformatics/btr626 45 APPENDICES 46 APPENDIX A LIBRARIES 47 LIBRARIES #check if a package is installed or not. is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) if(!is.installed("ggplot2")) install.packages("ggplot2") library(ggplot2) if(!is.installed("Matrix")) install.packages("Matrix") library(Matrix) if(!is.installed("lars")) install.packages("lars") library(lars) if(!is.installed("igraph")) install.packages("igraph") library(igraph) if(!is.installed("ROCR")) install.packages("ROCR") library(ROCR) if(!is.installed("Bolstad2")) install.packages("Bolstad2") require(Bolstad2) if(!is.installed("MESS")) install.packages("MESS") require(MESS) if(!is.installed("sfsmisc")) install.packages("sfsmisc") require(sfsmisc) if(!is.installed("Bolstad2")) install.packages("Bolstad2") require(Bolstad2) 48 APPENDIX B PREPROCESSING 49 PREPROCESSING #set path of project directory #setwd("C:/Users/neda/Dropbox/Neda/BIBM2016_Workshop") source("Src/FileUtil.R") allowWGCNAThreads() options(stringsAsFactors = FALSE) #Read in the expression gene data set expData <- read.csv("../Data/Gene expression/gene expression.csv") dim(expData) names(expData) #Remove gene information and transpose the expression data datExpr0 <- as.data.frame(t(expData[, -1])) names(datExpr0) <- expData$ProbeID rownames(datExpr0) <- names(expData)[-1] # Now we read in the physiological trait data traitData <- read.csv("../Data/Gene expression/Phenotype.csv") dim(traitData) names(traitData) traitData$disease <- as.character(traitData$disease) traitData$disease[ traitData$disease == "Schiz." ] <- 1 traitData$disease[traitData$disease == "BP" ] <- 1 traitData$disease[traitData$disease == "Dep." ] <- 1 traitData$disease[traitData$disease == "BP (not BP)" ] <- 1 traitData$disease[traitData$disease == "Unaffected control"] <- 0 #================================================================= # Sort the rows of traitData so that # they match those of datExpr0. patients <- rownames(datExpr0) traitRows <- match(patients, traitData$id) t <- traitData[traitRows, 1] datTraits <- traitData[traitRows, -1] datTraits <- as.matrix(datTraits) rownames(datTraits) <- rownames(datExpr0) # show that row names agree table(rownames(datTraits) == rownames(datExpr0)) head(datTraits) #================================================================= #divide samples into two groups:healthy and unhealthy 50 traitData$disease <- as.character(traitData$disease) traitData$disease[ traitData$disease == "Schiz." ] <- 1 traitData$disease[traitData$disease == "BP" ] <- 1 traitData$disease[traitData$disease == "Dep." ] <- 1 traitData$disease[traitData$disease == "BP (not BP)" ] <- 1 traitData$disease[traitData$disease == "Unaffected control"] <- 0 traitData$disease = as.factor(traitData$disease) d <- traitData$disease healthy <- subset( traitData, d == 0) unhealthy <- subset( traitData, d == 1) #================================================================= #find healthy and unhealthy samples in expression data healthy.gen <- datExpr0[healthy$id,] unhealthy.gen <- datExpr0[unhealthy$id,] #delete samples which we don't have in the expression data row.has.na <- apply(unhealthy.gen, 1, function(x){any(is.na(x))}) sum(row.has.na) unhealthy.gen.filtered <- unhealthy.gen[!row.has.na,] row.has.na2 <- apply(healthy.gen, 1, function(x){any(is.na(x))}) sum(row.has.na2) healthy.gen.filtered <- healthy.gen[!row.has.na2,] fc <- apply(unhealthy.gen.filtered,2,mean)/apply(healthy.gen.filtered,2,mean) log2FoldChange <- log2(fc) #================================================================= # Define a function to return the p-value of a Welch test return.t <- function(x,y) { t.test(x, y, alternative = "two.sided", var.equal = FALSE, paired = FALSE)$p.value } #find p-value for each gene using above function pv <- vector() n <- ncol(datExpr0) for (i in 1:n) { pv[i] <- return.t(healthy.gen.filtered[, i], unhealthy.gen.filtered[, i]) } #================================================================= #drawing volcanoplot pvalue <- as.vector(pv) 51 plot(log2FoldChange, -1 * log10(pvalue), pch = 1, cex = 0.7, xlab = "Log2(Fold Change)", ylab = "log10(1/p-value)", col = "darkgray") grid() #================================================================= #threshold t <- cbind(fc, pvalue) volcano <- as.data.frame(t) #define a threshold with pvalue less than 0.05 and fold change,increase/decrease 1.1 times volcano$threshold <- factor((volcano$fc >1.1 volcano$fc <0.9) & volcano$pvalue <0.05) ggplot(data = volcano, aes(x=log2FoldChange, y=-log10(pvalue), colour = threshold)) + geom_point(alpha = 0.1, size = 3) + scale_y_log10() #determine indexes which are satisfy the threshold and consider genes correspond to these indexes index <- which((fc >1.1 fc <0.9) & pvalue <0.05) #this number is the number of genes that we would consider since now length(index) expData <- expData[index,] datExpr0 <- datExpr0[,index] 52 APPENDIX C METHODS 53 METHODS #normalize the data expression X <- scale(datExpr0) dim(datExpr0) x <- list() y <- list() las <- list() cvlas<- list() frac <- list() las.coef <- list() b <- list() nc <- ncol(datExpr0) nr <- nrow(datExpr0) #each time x[[i]] is all expression data with column i substituited with 1, and y[[i]] is column i for (i in 1:nc) { x[[i]] <- as.matrix(X) x[[i]][,i] = 1 y[[i]] <- as.matrix(X[,i]) } #================================================================= #fit LASSO model #find the coefficients using LASSO for each pair of x[[i]] and y[[i]] for gene i for (i in 1:nc) { las[[i]] <- lars(x[[i]], y[[i]], type = "lasso") cvlas[[i]] <- cv.lars(x[[i]], y[[i]], type = "lasso", plot.it = FALSE) frac[[i]] <- cvlas[[i]]$index[which.min(cvlas[[i]]$cv)] las.coef[[i]] <- predict.lars(las[[i]], type = "coefficients", mode = "fraction", s = frac[[i]]) b[[i]] <- las.coef[[i]]$coefficients } #find lm for nonzero columns of gene expression affter applying lasso for #each gene i and consider tose columns that corresponding p-value #is less than 0.05 A <- list() idx <- list() pval <- list() dat <- list() B <- list() fit.lm <- list() 54 for(i in 1:nc) { idx[[i]] <- which(b[[i]]!= 0) A[[i]] <- X[, idx[[i]]] dat[[i]] <- cbind(y[[i]], A[[i]]) fit.lm[[i]] <- lm(dat[[i]][, 1] ~. , data = as.data.frame(dat[[i]])) pval[[i]] <- summary(fit.lm[[i]])$coefficients[, 4] pval[[i]]=pval[[i]][-c(1, 2)] d <- which(pval[[i]] <0.05) if (length(d) == 0) B[[i]] <- 0 else { B[[i]] <- as.matrix(A[[i]][, d]) colnames(B[[i]]) <- colnames(A[[i]])[d] } } #================================================================= #match the CNV with gene expreesion cnv <- read.csv("../Data/CNV/SMRI.csv") d <- match(colnames(cnv), colnames(expData)) cnp_id <- cnv[, 1] m <- which(!is.na(d)) cnv <- cnv[, m] cnv <- cbind(cnp_id, cnv) #match the Methylation with gene expreesion dm <- read.csv("../Data/DNA methylation/Methylation.csv") d <- match(colnames(dm), colnames(expData)) ID <- dm[, 1] m <- which(!is.na(d)) dm <- dm[, m] dm <- cbind(ID, dm) #================================================================= #match the column order of expression, CNV, and DNA methylation data cnp_id <- cnv[, 1] cnv <- cnv[, -1] ID <- dm[, 1] dm <- dm[, -1] ProbeID <- expData[, 1] expData <- expData[, -1] #make CNV and gene expression of the same order 55 # "colnames(cnv[,z])" is column names of cnv which are match with those columns in dm #show column number of cnv match with colnames(cnv[,z]) g <- sort(match(colnames(expData),colnames(cnv))) #map the common columns in cnv into dm cnv[,g] <- cnv[,c(colnames(expData))] #rename the mapped columns in dm colnames(cnv)[g] <- colnames(expData) #================================================================= #make DNA methylation and expression data to the of the same order #show the columns in expData that match with columns in dm match_dm <- match(colnames(expData),colnames(dm)) #show the column number in dm that are match with those in expData z <- which(!is.na(match_dm)) # "colnames(dm[,z])" is column names of dm which are match with those columns in expData #show column numbers of expData match with "colnames(dm[,z])" g <- sort(match(colnames(expData[, z]),colnames(dm))) #map the common columns in dm into expData dm[, g] <- dm[, c(colnames(expData[, z]))] #rename the mapped columns in expData colnames(dm)[g] <- colnames(expData[, z]) #new ordered cnv,dm, and expData cnv <- cbind(cnp_id, cnv) expData <- cbind(ProbeID, expData) dm <- cbind(ID, dm) datExpr0 <- as.data.frame(t(expData[, -1])) names(datExpr0) <- expData$ProbeID rownames(datExpr0) <- names(expData)[-1] #================================================================= #fing DNA methylation for each gene. HuGene_csv <- read.csv("../Data/Gene expression/HuGene.csv") HM <- read.csv("../Data/DNA methylation/HumanMethylation27-annotation.csv") idx_HM <-list() idx_dm <- list() D <-list() for(i in 1:nc) { idx_HuGene <- match(colnames(datExpr0)[i], HuGene_csv[, 1]) #find ith gene symbol in HuGene and match symbols in HM idx_HM[[i]] <- match(HuGene_csv[idx_HuGene, 2], HM[, 2]) #find corresponding dm id for matched symbols in HM idx_dm[[i]] <- HM[idx_HM[[i]], 1] 56 #find dm id in dm m <- match(idx_dm[[i]], dm[, 1]) if(is.na(m)) D[[i]] <- 0 else { #find the coefficients as D D[[i]] <- as.numeric(dm[m,]) D[[i]][1] <- dm[m,1] } } #================================================================= #find the corresponding gene for CNV #the chromosome names in "GenomeWideSNP_6" file chSNP <- c("chr1""chr2""chr3""chr4""chr5""chr6""chr7""chr8""chr9""chr10""chr11""chr12", "chr13""chr14""chr15""chr16""chr17""chr18""chr19""chr20""chr21""chr22") #read the "HuGene" file and consider just the rows that their choromosoms #are in the set chSNP and asign this subset to "HuGene_subset" HuGene_subset <- subset(HuGene_csv,HuGene_csv[,3] %in% chSNP ) write.csv(HuGene_subset, "../Data/Gene expression/HuGene_subset.csv") #read the "GenomeWideSNP_6" and "HuGene_subset" as data tab GSNP <- fread("../Data/CNV/GenomeWideSNP_6.hg18.cnv_defs.csv") HuGene <- fread("../Data/Gene expression/HuGene_subset.csv") gr1 <- with(GSNP, GRanges(Rle(as.character(chr,levels = chSNP)), IRanges(start, end))) gr2 <- with(HuGene, GRanges(Rle(as.character(seqname,levels = chSNP)), IRanges(start, stop))) olaps <- findOverlaps(gr1, gr2) GSNP[, grp := seq_len(nrow(GSNP))] HuGene[subjectHits(olaps), grp := queryHits(olaps)] setkey(GSNP, grp) setkey(HuGene, grp) #find the table that shows matched genes and CNV's as well as #gene symbol and chromosome position HuGene[, list(grp, transcript_cluster_id, Gene_Symbol)][GSNP] #convert above table to data frame for more convenience olaps_data_farame <- as.data.frame(HuGene[, list(grp, transcript_cluster_id, Gene_Symbol)][GSNP]) write.csv(olaps_data_farame[, cbind(2:5)]"../Data/CNV/overlap.csv") #corresponding gene without repitition corr_gene_cnv <- unique(olaps_data_farame[, 2]) SNP <- read.csv("../Data/CNV/GenomeWideSNP_6.hg18.cnv_defs.csv") 57 #read the overlap file over_lap <- read.csv("../Data/CNV/overlap.csv") over_lap <- over_lap[, -1] #================================================================= #find the coefficient matrix of CNV for each gene i C <- list() inx <-list() #find the corresponding genes to CNV's in data expression v <- match(colnames(datExpr0),over_lap[,1]) #the column indexes of data expression(gene) which is correspond to CNV id <- which(!is.na(v)) m <- match(over_lap[v[id], 3],cnv[, 1]) #col=which(!is.na(match(colnames(cnv),colnames(expData)))) for (i in 1:length(id)) { C[[id[i]]] <- as.numeric(cnv[m[i], -1]) C[[id[i]]] <- c(cnv[m[i], 1],C[[id[i]]]) } j <- c(1:ncol(datExpr0)) j <- j[!j %in% id] for (i in j) # C[[i]] <- rep(0,nrow(datExpr0)-1) C[[i]] <- 0 #=================================================================#Find the linear regression for nonzero gene expression, DNA metylation dat #, and CNV data multimat <- list() fit <- list() D_1<- list() C_1<-list() pvalue <- list() for (i in 1:ncol(datExpr0)) { #append a zero to D[[i]](dna methylation data)because it has one sample less if(length(D[[i]]) >1) { D[[i]] <- append(D[[i]], 0) D_1[[i]] <- rep(0, nr + 1) } if(length(C[[i]]) >1) 58 C_1[[i]] <- rep(0, nr + 1) } #for each DNA mrthylation and CNV data find their multiplicaton #with gene expression of gene ith for (i in 1:nc) { for (j in 2:(nr + 1)) { if(length(D[[i]]) <= 1) D_1[[i]] <- 0 if(length(D[[i]]) >1) { D_1[[i]][j] <- as.numeric(D[[i]])[j] * as.numeric(y[[i]])[j] D_1[[i]][1] <- D[[i]][1] } if(length(C[[i]]) <= 1) C_1[[i]] <- 0 if(length(C[[i]]) >1) { C_1[[i]][j] <- as.numeric(C[[i]])[j] * as.numeric(y[[i]])[j] C_1[[i]][1] <- C[[i]][1] } } } #find the linear regression for (i in 1:nc) { multimat[[i]] <- as.numeric(y[[i]]) #add nonzero expression, dna methylation, and cnv data if(length(B[[i]]) >1) multimat[[i]] <- as.matrix(cbind(multimat[[i]], B[[i]])) if(length(D_1[[i]]) >1) { multimat[[i]] <- as.matrix(cbind(multimat[[i]], as.numeric(D_1[[i]][-1]))) colnames(multimat[[i]])[ncol(multimat[[i]])] = D_1[[i]][1] } if(length(C_1[[i]]) >1) { multimat[[i]] <- as.matrix(cbind(multimat[[i]], as.numeric(C_1[[i]][-1]))) colnames(multimat[[i]])[ncol(multimat[[i]])] = C_1[[i]][1] } } 59 k <- c(1:nc) for (i in 1:nc) { if(length(multimat[[i]]) == nr) k[i] <- 0 } zero <- which(k == 0) k <- k[!k %in% 0] for (i in k) { fit[[i]] <- lm(multimat[[i]][,1]~. , data = as.data.frame(multimat[[i]])) #find p-values pvalue[[i]] <- summary(fit[[i]])$coefficients[, 4] #we delete the first and second element of p-value since they are for intercept and pvalue[[i]] <- pvalue[[i]][-cbind(1, 2)] } G3 <- list() name <- list() for (i in k) { count <- which(pvalue[[i]] <0.05) if(length(count) == 0) G3[[i]] <- 0 if(length(count) == 1) { name[[i]] <- colnames(multimat[[i]])[as.numeric(count) + 1] G3[[i]]<- multimat[[i]][,as.numeric(count) + 1] } if(length(count) >1) { multimat[[i]] <- multimat[[i]][, -1] G3[[i]] <- multimat[[i]][, as.numeric(count)] } } for(i in zero) G3[[i]] <- 0 #R[[i]] has the coefficients and corresponding gene/dna methylation/cnv ID for each gene i R <- list() for(i in 1:nc) R[[i]] <- fit[[i]]$coefficients[-cbind(1, 2)] 60 save.image("result.RData") 61 APPENDIX D EXPERIMENT 62 EXPERIMENT #find the RMSE for GRN network for each gene i RMSE_GRN <- rep(0,nc) interc <- list() G <- list() b_g <- list() for (i in 1:nc) { #intercept for each gene i after LASSO and p-value interc[[i]] <- as.numeric(fit.lm[[i]]$coefficients[1]) s <- names(fit.lm[[i]]$coefficients) s <- gsub("[`]" "",s) idx <- match(colnames(B[[i]]), s) #coefficients of nonzero columns in gene expression after LASSO and p-value b_g[[i]] <- fit.lm[[i]]$coefficients[idx] #Multiplication of nonzero columns by coefficients G[[i]] <- as.numeric(B[[i]] %*% b_g[[i]]) } for (i in 1:nc) { Sum <- 0 for (j in 1:nr) #difference between G[[i]] and b_g[[i]] Sum <- Sum +(y[[i]][j] - (G[[i]][j]+as.numeric(interc[i])))^2 #RMSE for gene i in GRN RMSE_GRN[i] <- sqrt(Sum) } dim(G[[6]]) RMSE_GRN mean(RMSE_GRN[-which(is.na(RMSE_GRN))]) sd(RMSE_GRN[-which(is.na(RMSE_GRN))]) #=================================================================#find the RMSE for DCGRN network for each gene i RMSE_DCGRN <- rep(0, nc) G2 <- list() interc2 <-list() b_g2 <- list() name <-list() multimat2 <-list() fit_DCGRN <- list() for (i in 1:nc) 63 { multimat2[[i]] <- as.numeric(y[[i]]) #add nonzero expression, dna methylation, and cnv data if(length(B[[i]]) >1) multimat2[[i]] <- as.matrix(cbind(multimat2[[i]],B[[i]])) if(length(D[[i]]) >1) { multimat2[[i]] <- as.matrix(cbind(multimat2[[i]],as.numeric(D[[i]][-1]))) colnames(multimat2[[i]])[ncol(multimat2[[i]])]= D[[i]][1] } if(length(C[[i]]) >1) { multimat2[[i]] <- as.matrix(cbind(multimat2[[i]],as.numeric(C[[i]][-1]))) colnames(multimat2[[i]])[ncol(multimat2[[i]])]= C[[i]][1] } } pvalue_DCGRN <- list() l <- c(1:nc) for (i in 1:nc) { if(length(multimat2[[i]]) == nr) l[i] <- 0 } zero2 <- which(l == 0) l <- l[!l %in% 0] for (i in 1:nc) { if(length(multimat2[[i]]) > nr) { colnames(multimat2[[i]])[1] <- "g_i"#make a name for column y[[i]] fit_DCGRN[[i]] <- lm(g_i ~. , data = as.data.frame(multimat2[[i]])) #find p-values pvalue_DCGRN[[i]] <- summary(fit_DCGRN[[i]])$coefficients[,4] #we delete the first and second element of p-value since they are for intercept and pvalue_DCGRN[[i]]=pvalue_DCGRN[[i]][-1]#remove the p-value for intercept } } name_DCGRN <- rep(0,nc) for (i in l) { count= which(pvalue_DCGRN[[i]] <0.05) if (length(count) == 0) 64 { G2[[i]] <- 0 } if(length(count) == 1) { name_DCGRN[i] <- colnames(multimat2[[i]])[as.numeric(count) + 1] G2[[i]] <- multimat2[[i]][,as.numeric(count) + 1] } if(length(count) >1) { multimat2[[i]] <- multimat2[[i]][,-1] G2[[i]] <- multimat2[[i]][,as.numeric(count)] } } for(i in zero2) G2[[i]] <- 0 interc2 <- list() b_g2 <- list() countD_DCGRN <- 0 countC_DCGRN <- 0 for (i in 1:nc) { #intercept for each gene i after LASSO and p-value interc2[[i]] <- as.numeric(fit_DCGRN[[i]]$coefficients[1]) s <- names(fit_DCGRN[[i]]$coefficients) s <- gsub("[`]" "", s) #names(fit_DCGRN[[i]]$coefficients)<- s #coefficients of nonzero columns in gene expression after LASSO and p-value if(length(G2[[i]]) == nr) { idx <- match(name_DCGRN[i], s) b_g2[[i]] <- as.numeric(fit_DCGRN[[i]]$coefficients[idx]) G2[[i]] <- as.numeric(G2[[i]] * b_g2[[i]]) if(!is.na(match(name_DCGRN[[i]], dm[, 1]))) countD_DCGRN=countD_DCGRN + 1 if(!is.na(match(name_DCGRN[[i]], cnv[, 1]))) countC_DCGRN=countC_DCGRN + 1 } if(length(G2[[i]]) > nr) { idx<- match(colnames(G2[[i]]), s) b_g2[[i]] <- as.numeric(fit_DCGRN[[i]]$coefficients[idx]) countD_DCGRN <- countD_DCGRN+sum(!is.na(match(colnames(G2[[i]]),dm[,1]))) 65 countC_DCGRN <- countC_DCGRN+sum(!is.na(match(colnames(G2[[i]]),cnv[,1]))) #Multiplication of nonzero columns by coefficients G2[[i]] <- as.numeric(G2[[i]] %*% b_g2[[i]]) } } for (i in 1:nc) { Sum <- 0 for (j in 1:nr) #difference between G3[[i]] and b_g3[[i]] Sum <- Sum + (y[[i]][j] - (G2[[i]][j]+as.numeric(interc2[i])))^2 #RMSE for gene i in our method, its NA if the G3[[i]] is zero RMSE_DCGRN[i] <- sqrt(Sum) } RMSE_DCGRN mean(RMSE_DCGRN[which(!is.na(RMSE_DCGRN))]) sd(RMSE_DCGRN[which(!is.na(RMSE_DCGRN))]) countD_DCGRN countC_DCGRN #=================================================================#find the RMSE for our model for each gene i RMSE <- rep(0,nc) interc3 <-list() b_g3 <- list() countD <- 0 countC <- 0 for (i in 1:nc) { #intercept for each gene i after LASSO and p-value interc3[[i]] <- as.numeric(fit[[i]]$coefficients[1]) #s is the gene name of coefficient(with any p-value) s <- names(fit[[i]]$coefficients) s <- gsub("[`]" "",s) #names(fit[[i]]$coefficients)<- s #coefficients of nonzero columns in gene expression after LASSO and p-value if(length(G3[[i]]) == nr) { idx <- match(name[i], s) b_g3[[i]] <- as.numeric(fit[[i]]$coefficients[idx]) G3[[i]] <- as.numeric(G3[[i]] * b_g3[[i]]) if(!is.na(match(name[i], HM[,1]))) 66 countD=countD + 1 if(!is.na(match(name[i], cnv[,1]))) countC <- countC + 1 } if(length(G3[[i]]) > nr) { idx <- match(colnames(G3[[i]]), s) b_g3[[i]] <- as.numeric(fit[[i]]$coefficients[idx]) countD <- countD+sum(!is.na(match(colnames(G3[[i]]), dm[,1]))) countC <- countC+sum(!is.na(match(colnames(G3[[i]]), cnv[,1]))) #Multiplication of nonzero columns by coefficients G3[[i]] <- as.numeric(G3[[i]] %*% b_g3[[i]]) } } for (i in 1:nc) { s <- 0 for (j in 1:nr) #difference between G3[[i]] and b_g3[[i]] if(length(G3[[i]]) >1) s <- s + (y[[i]][j] - (G3[[i]][j] + as.numeric(interc3[i])))^2 #RMSE for gene i in our method, its NA if the G3[[i]] is zero RMSE[[i]] <- sqrt(s) } RMSE mean(RMSE[which(!is.na(RMSE))]) sd(RMSE[which(!is.na(RMSE))]) countD countC #=================================================================#boxPlot of 4 RMSEs data <- cbind(RMSE_GRN,RMSE_DCGRN,RMSE,RMSE_n) boxplot(data, ylab= "RMSE", xlab = "Method", col = c("red""sienna""palevioletred1""royalblue2"), names = c("GRN""DCGRN""RDCGRN1""RDCGRN2")) mean(RMSE_n[which(!is.na(RMSE_n))]) sd(RMSE_n[which(!is.na(RMSE_n))]) 67 APPENDIX E SIMULATION STUDY 1 68 SIMULATION STUDY 1 # LASSO function if(!is.installed("lars")) install.packages("lars") library(lars) Lasso <- function(z,lambda) { nc <- ncol(z) B = matrix(0, nrow = nc, ncol = nc) for (i in 1 : nc) { X <- z X[,i] <- 1 y <- z[, i] res <- lars(X, y, type="lasso", normalize=TRUE, intercept=TRUE) B[i, ] <- coef.lars(res, s = lambda, mode="step") } return(B) } #================================================================= # DCGRN # x should have the same value used as lambda in the first Lasso to find B DCGRN <- function(Simulated_GRN, F, W, D, C, lambda) { colnames(Simulated_GRN) <- sprintf("G%d",seq(1:p)) colnames(D) <- sprintf("DM%d",seq(1:d)) colnames(C) <- sprintf("CNV%d",seq(1:c)) Adj_DCGRN <- matrix(0,nrow = p,ncol = p) Adj_DCGRN_DM <- matrix(0,nrow = d,ncol = p) Adj_DCGRN_CNV <- matrix(0,nrow = c,ncol = p) for (i in 1:p) { dm_idx <- which(F[,i]==1) cnv_idx <- which(W[,i]==1) matrix_integrated <- Simulated_GRN matrix_integrated[, i] <- 1 DM <- as.matrix(D[, dm_idx]) n_col <- dim(as.matrix(DM))[2] matrix_integrated <- cbind(matrix_integrated, scale(DM)) d0 <- seq(p+1, p + n_col) 69 colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] CNV <- as.matrix(C[, cnv_idx]) matrix_integrated <- cbind(matrix_integrated, scale(CNV)) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] res <- lars(matrix_integrated, Simulated_GRN[,i], type="lasso",normalize=TRUE, intercept=TRUE) coeff <- coef.lars(res, s = lambda, mode="step") Adj_DCGRN[i,] <- coeff[cbind(1:p)] Adj_DCGRN_DM[dm_idx,i] <- coeff[d0] Adj_DCGRN_CNV[cnv_idx,i] <- coeff[ncol(matrix_integrated)] } diag(Adj_DCGRN) <- 0 return(list(Adj_DCGRN, Adj_DCGRN_DM, Adj_DCGRN_CNV)) } #=================================================================# iGRN #x should have the same value used as lambda in the first Lasso to find B iGRN <- function(Simulated_GRN, F, W, D, C, lambda) { colnames(D) <- sprintf("DM%d",seq(1:d)) colnames(C) <- sprintf("CNV%d",seq(1:c)) Adj_iGRN <- matrix(0,nrow = p,ncol = p) Adj_iGRN_DM <- matrix(0,nrow = d,ncol = p) Adj_iGRN_CNV <- matrix(0,nrow = c,ncol = p) for (i in 1:p) { dm_idx <- which(F[,i]==1) cnv_idx <- which(W[,i]==1) matrix_integrated <- Simulated_GRN matrix_integrated[, i] <- 1 DM <- as.matrix(D[, dm_idx]) DM_g <- matrix(0,nrow = nrow(DM),ncol = ncol(DM)) for (k in 1:ncol(DM)) DM_g[,k] <- DM[,k] * Simulated_GRN[,i] matrix_integrated <- cbind(matrix_integrated, scale(DM_g)) d0 <- seq(p+1, p + ncol(DM)) colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] 70 CNV <- as.matrix(C[, cnv_idx]) CNV_g <- matrix(0,nrow = nrow(CNV),ncol = ncol(CNV)) for (k in 1:ncol(CNV)) CNV_g[,k] <- CNV[,k] * Simulated_GRN[,i] matrix_integrated <- cbind(matrix_integrated, scale(CNV_g)) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] res <- lars(matrix_integrated, Simulated_GRN[,i], type="lasso", normalize=TRUE, intercept=TRUE) coeff <- coef.lars(res, s = lambda, mode="step") Adj_iGRN[i,] <- coeff[cbind(1:p)] Adj_iGRN_DM[dm_idx, i] <- coeff[d0] Adj_iGRN_CNV[cnv_idx, i] <- coeff[ncol(matrix_integrated)] } diag(Adj_iGRN) <- 0 return(list(Adj_iGRN, Adj_iGRN_DM, Adj_iGRN_CNV)) } source("loadLibrary.R") source("SimulationStudy_CommonFunc.R") source("SimulationStudy1_func.R") #================================================================= # Simulation study for only gene expression data without CNV and DM #================================================================= K <- c(seq(1, 102, by = 4)) ROC_GRN <- data.frame() ROC_DCGRN <- data.frame() ROC_DCGRN_DM <- data.frame() ROC_DCGRN_CNV <- data.frame() ROC_iGRN <- data.frame() ROC_iGRN_DM <- data.frame() ROC_iGRN_CNV <- data.frame() n_iter <- 50# iteration n_sample <- 300# number of samples p <- 100# number of gene exprssion c <- 100# number of CNV 71 d <- 150# number of DM x <- 0.01# variance for error data for (i in 1 : n_iter) { Z <- Rnetwork_GRN(n_sample, p, d, c, x) Adj_Rnetwork <- as.matrix(Z[[1]]) Simulated_GRN <- as.matrix(Z[[2]]) F <- as.matrix(Z[[3]]) W <- as.matrix(Z[[4]]) D <- as.matrix(Z[[5]]) C <- as.matrix(Z[[6]]) for(j in 1 : length(K)) { #ROC for GRN Adj_GRN <- Lasso(Simulated_GRN, K[j]) ROC_GRN <- rbind(ROC_GRN, c(K[j], Confusion2ROC(Confusion(Adj_GRN, Adj_Rnetwork)))) print(paste(i, "th iteration: ", j)) print("GRN") print(Confusion(Adj_GRN, Adj_Rnetwork)) #DCGRN G0 <- DCGRN(Simulated_GRN, F, W, D, C, K[j]) Adj_DCGRN <- G0[[1]] Adj_DCGRN_DM <- G0[[2]] Adj_DCGRN_CNV <- G0[[3]] ROC_DCGRN <- rbind(ROC_DCGRN, c(K[j], Confusion2ROC(Confusion(Adj_DCGRN, Adj_Rnetwork) + Confusion2(Adj_DCGRN_DM, F) + Confusion2(Adj_DCGRN_CNV, W)))) #ROC for dm in DCGRN ROC_DCGRN_DM <- rbind(ROC_DCGRN_DM, c(K[j], Confusion2(Adj_DCGRN_DM, F) + Confusion2(Adj_DCGRN_CNV, W)))) Confusion2ROC(Confusion2(Adj_DCGRN_DM,F)))) #ROC for cnv in DCGRN ROC_DCGRN_CNV <- rbind(ROC_DCGRN_CNV, c(K[j], Confusion2ROC(Confusion2(Adj_DCGRN_CNV, W)))) print("DCGRN") print(Confusion(Adj_DCGRN, Adj_Rnetwork)) print(Confusion2(Adj_DCGRN_DM, F)) print(Confusion2(Adj_DCGRN_CNV, W)) #iGRN G <- iGRN(Simulated_GRN, F, W, D, C, K[j]) 72 Adj_iGRN <- G[[1]] Adj_iGRN_DM <- G[[2]] Adj_iGRN_CNV <- G[[3]] #ROC for gene in iGRN ROC_iGRN <- rbind(ROC_iGRN, c(K[j], Confusion2ROC(Confusion(Adj_iGRN, Adj_Rnetwork) + Confusion2(Adj_iGRN_DM, F) + Confusion2(Adj_iGRN_CNV, W)))) #ROC for dm in iGRN ROC_iGRN_DM <- rbind(ROC_iGRN_DM, c(K[j], Confusion2ROC(Confusion2(Adj_iGRN_DM, F)))) #ROC for cnv in iGRN ROC_iGRN_CNV <- rbind(ROC_iGRN_CNV, c(K[j], Confusion2ROC(Confusion2(Adj_iGRN_CNV, W)))) print("iGRN") print(Confusion(Adj_iGRN, Adj_Rnetwork)) print(Confusion2(Adj_iGRN_DM, F)) print(Confusion2(Adj_iGRN_CNV, W)) } } colnames(ROC_GRN) <- c("Lambda", "TPR", "FPR") colnames(ROC_DCGRN) <- c("Lambda", "TPR", "FPR") colnames(ROC_iGRN) <- c("Lambda", "TPR", "FPR") colnames(ROC_DCGRN_DM) <- c("Lambda", "TPR", "FPR") colnames(ROC_DCGRN_CNV) <- c("Lambda", "TPR", "FPR") colnames(ROC_iGRN_DM) <- c("Lambda", "TPR", "FPR") colnames(ROC_iGRN_CNV) <- c("Lambda", "TPR", "FPR") ROC_Data <- cbind(ROC_GRN, ROC_DCGRN, ROC_iGRN) write.csv(ROC_Data"result/result/SimulationStudy1_ROC_Data.csv") TPR_GRN <- vector(length = length(K)) FPR_GRN <- vector(length = length(K)) TPR_iGRN <- vector(length = length(K)) FPR_iGRN <- vector(length = length(K)) TPR_DCGRN <- vector(length = length(K)) FPR_DCGRN <- vector(length = length(K)) TPR_DCGRN_DM <- vector(length = length(K)) FPR_DCGRN_DM <- vector(length = length(K)) TPR_DCGRN_CNV <- vector(length = length(K)) FPR_DCGRN_CNV <- vector(length = length(K)) TPR_iGRN_DM <- vector(length = length(K)) FPR_iGRN_DM <- vector(length = length(K)) TPR_iGRN_CNV <- vector(length = length(K)) FPR_iGRN_CNV <- vector(length = length(K)) 73 for (i in 1:length(K)) { col_means <- colMeans(ROC_GRN[ROC_GRN[, 1] == K[i],]) TPR_GRN[i] <- col_means[2] FPR_GRN[i] <- col_means[3] col_means <- colMeans(ROC_DCGRN[ROC_DCGRN[, 1] == K[i],]) TPR_DCGRN[i] <- col_means[2] FPR_DCGRN[i] <- col_means[3] col_means <- colMeans(ROC_DCGRN_DM[ROC_DCGRN_DM[, 1] == K[i],]) TPR_DCGRN_DM[i] <- col_means[2] FPR_DCGRN_DM[i] <- col_means[3] col_means <- colMeans(ROC_DCGRN_CNV[ROC_DCGRN_CNV[, 1] == K[i],]) TPR_DCGRN_CNV[i] <- col_means[2] FPR_DCGRN_CNV[i] <- col_means[3] col_means <- colMeans(ROC_iGRN[ROC_iGRN[, 1] == K[i],]) TPR_iGRN[i] <- col_means[2] FPR_iGRN[i] <- col_means[3] col_means <- colMeans(ROC_iGRN_DM[ROC_iGRN_DM[, 1] == K[i],]) TPR_iGRN_DM[i] <- col_means[2] FPR_iGRN_DM[i] <- col_means[3] col_means <- colMeans(ROC_iGRN_CNV[ROC_iGRN_CNV[, 1] == K[i],]) TPR_iGRN_CNV[i] <- col_means[2] FPR_iGRN_CNV[i] <- col_means[3] } FPR_GRN <- c(FPR_GRN, 1) TPR_GRN <- c(TPR_GRN, 1) FPR_DCGRN <- c(FPR_DCGRN, 1) TPR_DCGRN <- c(TPR_DCGRN, 1) FPR_iGRN <- c(FPR_iGRN, 1) TPR_iGRN <- c(TPR_iGRN, 1) # ROC for Gene pdf("Figures/SimulationStudy1_ROC.pdf") plot(FPR_GRN, TPR_GRN, xlab='FPR', ylab='TPR', type="o", asp = 1, pch = 0, col = 'black', xlim = c(0, 1), ylim = c(0, 1), xaxs = "i", yaxs = "i", main = "ROC") par(new = TRUE) plot(FPR_DCGRN, TPR_DCGRN, type="o", asp = 1, xlab='', ylab='', axes=FALSE, pch = 1, 74 col = 'blue', xlim = c(0, 1), ylim = c(0, 1), xaxs = "i", yaxs = "i") par(new = TRUE) plot(FPR_iGRN, TPR_iGRN, type="o", asp = 1, xlab='', ylab='', axes=FALSE, pch = 19, col = 'red', xlim = c(0, 1), ylim = c(0, 1), xaxs = "i", yaxs = "i") legend(0.6, 0.4, c("GRN", "DCGRN", "iGRN"), lty = c(1, 1, 1), pch = c(0, 1, 19), col = c("black", "blue", "red")) abline(a = 0, b = 1, col = "black", lty = 3) AUC(TPR_GRN, FPR_GRN) AUC(TPR_DCGRN, FPR_DCGRN) AUC(TPR_iGRN, FPR_iGRN) # ROC for DM plot(K, TPR_DCGRN_DM, xlab='Step on LASSO', ylab='TPR', type="o", pch = 1, col = 'blue', xlim = c(1, 100), ylim = c(0, 1), xaxs = "i", yaxs = "i", main = "TPR on DNA methylatin") par(new = TRUE) plot(K, TPR_iGRN_DM, xlab='', ylab='', type="o", axes=FALSE, pch = 19, col = 'red', xlim = c(1, 104), ylim = c(0, 1)) abline(a = 0, b = 0.01, col = "black", lty = 3) legend(60, 0.4, c("DCGRN", "iGRN"), lty = c(1, 1), pch = c(1, 19), col = c("blue", "red")) # ROC for CNV plot(K, TPR_DCGRN_CNV, xlab='Step on LASSO', ylab='TPR', type="o", pch = 1, col = 'blue', xlim = c(1, 100), ylim = c(0, 1), xaxs = "i", yaxs = "i", main = "TPR on CNV") par(new = TRUE) plot(K, TPR_iGRN_CNV, xlab='', ylab='', type="o", axes=FALSE, pch = 19, col = 'red', xlim = c(1, 104), ylim = c(0, 1)) abline(a = 0, b = 0.01, col = "black", lty = 3) legend(60, 0.4, c("DCGRN", "iGRN"), lty = c(1, 1), pch = c(1, 19), col = c("blue", "red")) dev.off() save.image("result/SimulationStudy1_ROC.RData") 75 APPENDIX F SIMULATION STUDY 2 76 SIMULATION STUDY 2 # LASSO function if(!is.installed("lars")) install.packages("lars") library(lars) Lasso <- function(z) { nc <- ncol(z) B = matrix(0, nrow = nc, ncol = nc) for (i in 1 : nc) { X <- z X[,i] <- 1 y <- z[, i] res <- lars(X, y, type = "lasso", normalize = TRUE, intercept = TRUE) cvlas <- cv.lars(X, y, type = "lasso", plot.it = FALSE, normalize = TRUE, intercept = TRUE) frac <- cvlas$index[which.min(cvlas$cv)] las.coef <- predict.lars(res, type = "coefficients", mode = "fraction", s = frac) coeff <- las.coef$coefficients B[i, ] <- coeff } return(B) } #================================================================= #GRN GRN <- function(Simulated_GRN) { B <- Lasso(Simulated_GRN) colnames(B) <- sprintf("G%d",seq(1:p)) Adj_GRN <- matrix(0, nrow = p,ncol = p) colnames(Adj_GRN) <- sprintf("G%d", seq(1:p)) Prediction <- matrix(0, nrow = n_sample, ncol = p) for (i in 1:p) { gene_idx <- which(B[i,] != 0) matrix_integrated <- as.matrix(Simulated_GRN[, gene_idx]) colnames(matrix_integrated) <- colnames(B)[gene_idx] if(ncol(matrix_integrated) >0) { 77 fit <- lm(Simulated_GRN[, i]~., data = as.data.frame(matrix_integrated)) pvalue <- summary(fit)$coefficients[,4] #we delete the first and second element of p-value since they are for intercept # and y[[i]] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- matrix_integrated[,as.numeric(count)] Adj_row <- match(colnames(B), colnames(matrix_result), nomatch = 0) Adj_GRN[i, Adj_row >0] <- as.numeric(fit$coefficients[names(count)]) #Adj_row #prediction amount for gene i intercept <- as.numeric(fit$coefficients[1])#intercept #coefficient for none zero columns coeff <- as.numeric(fit$coefficients[names(count)]) Prediction[, i] <- as.matrix(matrix_result) %*% coeff + intercept } } diag(Adj_GRN) <- 0 return(list(Adj_GRN, Prediction)) } #================================================================= # DCGRN DCGRN <- function(Simulated_GRN, F, W, D, C) { colnames(D) <- sprintf("DM%d",seq(1:d)) colnames(C) <- sprintf("CNV%d",seq(1:c)) Adj_DCGRN <- matrix(0,nrow = p,ncol = p) Adj_DCGRN_DM <- matrix(0,nrow = d,ncol = p) Adj_DCGRN_CNV <- matrix(0,nrow = c,ncol = p) colnames(Adj_DCGRN) <- sprintf("G%d",seq(1:p)) colnames(Simulated_GRN) <- sprintf("G%d",seq(1:p)) Prediction <- matrix(0, nrow = n_sample, ncol = p) for (i in 1:p) { dm_idx <- which(F[, i] >0) cnv_idx <- which(W[, i] >0) matrix_integrated <- Simulated_GRN matrix_integrated[, i] <- 1 78 DM <- as.matrix(D[, dm_idx]) n_col <- dim(as.matrix(DM))[2] matrix_integrated <- cbind(matrix_integrated, DM) d0 <- seq(p + 1, p + n_col) colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] CNV <- as.matrix(C[, cnv_idx]) matrix_integrated <- cbind(matrix_integrated,CNV) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] res <- lars(matrix_integrated, Simulated_GRN[, i], type="lasso", normalize = TRUE, intercept = TRUE) cvlas <- cv.lars(matrix_integrated, Simulated_GRN[, i], type="lasso", plot.it = FALSE, normalize = TRUE, intercept = TRUE) frac <- cvlas$index[which.min(cvlas$cv)] las.coef <- predict.lars(res, type = "coefficients", mode = "fraction", s = frac) coeff <- las.coef$coefficients idx_nzero <- which(coeff != 0) matrix_lasso <- as.matrix(matrix_integrated[, idx_nzero]) colnames(matrix_lasso) <- names(idx_nzero) if(ncol(matrix_lasso) >0) { # Compute p-value fit <- lm(Simulated_GRN[, i] ~ ., data = as.data.frame(matrix_lasso)) pvalue <- summary(fit)$coefficients[, 4] #we delete the first and second element of p-value since they are for intercept and y[[i]] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- as.matrix(matrix_lasso[, names(count)]) colnames(matrix_result) <- names(count) Adj_row <- match(colnames(Simulated_GRN), colnames(matrix_result), nomatch = 0) Adj_DCGRN[i, Adj_row >0] <- as.numeric(fit$coefficients[names (count)[Adj_row[Adj_row != 0]]]) Adj_DM_col <- match(colnames(D),colnames(matrix_result), nomatch = 0) Adj_DCGRN_DM[which(Adj_DM_col != 0) ,i] <- as.numeric(fit$coefficients[names(count)[Adj_DM_col[Adj_DM_col != 0]]]) #Adj_DM_col 79 Adj_CNV_col <- match(colnames(C),colnames(matrix_result), nomatch = 0) Adj_DCGRN_CNV[which(Adj_CNV_col != 0),i] <- as.numeric(fit$coefficients [names(count)[Adj_CNV_col[Adj_CNV_col != 0]]]) #Adj_CNV_col #prediction amount for gene i intercept <- as.numeric(fit$coefficients[1])#intercept #coefficient for none zero columns coeff <- as.numeric(fit$coefficients[names(count)]) Prediction[, i] <- as.matrix(matrix_result) %*% coeff + intercept } } diag(Adj_DCGRN) <- 0 return(list(Adj_DCGRN, Adj_DCGRN_DM, Adj_DCGRN_CNV, Prediction)) } #================================================================= # iGRN iGRN <- function(Simulated_GRN, F, W, D, C) { colnames(D) <- sprintf("DM%d", seq(1 : d)) colnames(C) <- sprintf("CNV%d", seq(1 : c)) Adj_iGRN <- matrix(0, nrow = p, ncol = p) Adj_iGRN_DM <- matrix(0, nrow = d, ncol = p) Adj_iGRN_CNV <- matrix(0, nrow = c, ncol = p) B <- Lasso(Simulated_GRN) colnames(B) <- sprintf("G%d", seq(1:p)) colnames(Adj_iGRN) <- sprintf("G%d", seq(1:p)) Prediction <- matrix(0, nrow = n_sample, ncol = p) for (i in 1:p) { dm_idx <- which(F[,i] != 0) cnv_idx <- which(W[,i] != 0) gene_idx <- which(B[i,] != 0) matrix_integrated <- as.matrix(Simulated_GRN[, gene_idx]) colnames(matrix_integrated) <- colnames(B)[gene_idx] ncol_gene <- ncol(matrix_integrated) DM <- as.matrix(D[, dm_idx]) ncol_dm <- dim(as.matrix(DM))[2] matrix_integrated <- cbind(matrix_integrated, DM) d0 <- seq(ncol_gene + 1, ncol_gene + ncol_dm) colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] 80 CNV <- as.matrix(C[, cnv_idx]) matrix_integrated <- cbind(matrix_integrated,CNV) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] fit <- lm(Simulated_GRN[,i]~., data = as.data.frame(matrix_integrated)) pvalue <- summary(fit)$coefficients[, 4] #we delete the first and second element of p-value since they are for intercept and y[[i]] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- matrix_integrated[, as.numeric(count)] Adj_row <- match(colnames(B), colnames(matrix_result), nomatch = 0) Adj_iGRN[i, Adj_row >0] <- as.numeric(fit$coefficients [names(count)[Adj_row[Adj_row != 0]]]) #Adj_row Adj_DM_col <- match(colnames(D), colnames(matrix_result), nomatch = 0) Adj_iGRN_DM[which(Adj_DM_col != 0), i] <- as.numeric(fit$coefficients [names(count)[Adj_DM_col[Adj_DM_col != 0]]]) #Adj_DM_col Adj_CNV_col <- match(colnames(C),colnames(matrix_result), nomatch = 0) Adj_iGRN_CNV[which(Adj_CNV_col != 0), i] <- as.numeric(fit$coefficients [names(count)[Adj_CNV_col[Adj_CNV_col != 0]]]) #Adj_CNV_col #prediction amount for gene i intercept <- as.numeric(fit$coefficients[1])#intercept #coefficient for none zero columns coeff <- as.numeric(fit$coefficients[names(count)]) Prediction[, i] <- as.matrix(matrix_result) %*% coeff + intercept #rep(intercept, nrow(Simulated_GRN)) } diag(Adj_iGRN) <- 0 return(list(Adj_iGRN, Adj_iGRN_DM, Adj_iGRN_CNV, Prediction)) } #setwd("C:/Users/neda/Dropbox/Neda/Publications/IJDMB2016") source("loadLibrary.R") source("SimulationStudy_CommonFunc.R") source("SimulationStudy2_func.R") #================================================================= 81 # Simulation study for only gene expression data without CNV and DM #================================================================= ROC_GRN <- data.frame() ROC_DCGRN <- data.frame() ROC_DCGRN_DM <- data.frame() ROC_DCGRN_CNV <- data.frame() ROC_iGRN <- data.frame() ROC_iGRN_DM <- data.frame() ROC_iGRN_CNV <- data.frame() conMat_GRN <- data.frame() conMat_DCGRN <- data.frame() conMat_iGRN <- data.frame() RMSE_GRN <- data.frame() RMSE_DCGRN <- data.frame() RMSE_iGRN <- data.frame() n_iter <- 50# iteration n_sample <- 300# number of samples p <- 100# number of gene exprssion c <- 100# number of CNV d <- 150# number of DM x <- 0.01# variance for error data for (i in 1:n_iter) { Z <- Rnetwork_GRN(n_sample, p, d, c, x) Adj_Rnetwork <- as.matrix(Z[[1]]) Simulated_GRN <- as.matrix(Z[[2]]) F <- as.matrix(Z[[3]]) W <- as.matrix(Z[[4]]) D <- as.matrix(Z[[5]]) C <- as.matrix(Z[[6]]) #ROC for GRN grn <- GRN(Simulated_GRN) Adj_GRN <- grn[[1]] Pred_GRN <- grn[[2]] # sum((Pred_GRN - Simulated_GRN)^2) ROC_GRN <- rbind(ROC_GRN, Confusion2ROC(Confusion(Adj_GRN, Adj_Rnetwork) + c(0, 0, d, 0) + c(0, 0, c, 0))) 82 print(paste( "th iteration: ",i)) print("GRN") print(Confusion(Adj_GRN, Adj_Rnetwork)) #DCGRN dcgrn <- DCGRN(Simulated_GRN, F, W, D, C) Adj_DCGRN <- dcgrn[[1]] Adj_DCGRN_DM <- dcgrn[[2]] Adj_DCGRN_CNV <- dcgrn[[3]] ROC_DCGRN <- rbind(ROC_DCGRN, Confusion2ROC(Confusion(Adj_DCGRN, Adj_Rnetwork) + Confusion2(Adj_DCGRN_DM, F) + Confusion2(Adj_DCGRN_CNV, W))) print("DCGRN") print(Confusion(Adj_DCGRN, Adj_Rnetwork)) print(Confusion2(Adj_DCGRN_DM, F)) print(Confusion2(Adj_DCGRN_CNV, W)) #ROC for dm in DCGRN ROC_DCGRN_DM <- rbind(ROC_DCGRN_DM, Confusion2ROC(Confusion2(Adj_DCGRN_DM,F))) #ROC for cnv in DCGRN ROC_DCGRN_CNV <- rbind(ROC_DCGRN_CNV, Confusion2ROC(Confusion2(Adj_DCGRN_CNV,W))) #iGRN igrn <- iGRN(Simulated_GRN, F, W, D, C) Adj_iGRN <- igrn[[1]] Adj_iGRN_DM <- igrn[[2]] Adj_iGRN_CNV <- igrn[[3]] #ROC for gene in iGRN ROC_iGRN <- rbind(ROC_iGRN, Confusion2ROC(Confusion(Adj_iGRN, Adj_Rnetwork) + Confusion2(Adj_iGRN_DM, F) + Confusion2(Adj_iGRN_CNV, W))) print("iGRN") print(Confusion(Adj_iGRN, Adj_Rnetwork)) print(Confusion2(Adj_iGRN_DM, F)) print(Confusion2(Adj_iGRN_CNV, W)) #ROC for dm in iGRN ROC_iGRN_DM <- rbind(ROC_iGRN_DM, Confusion2ROC(Confusion2(Adj_iGRN_DM,F))) 83 #ROC for cnv in iGRN ROC_iGRN_CNV <- rbind(ROC_iGRN_CNV, Confusion2ROC(Confusion2(Adj_iGRN_CNV, W))) } colnames(ROC_GRN) <- c( "TPR", "FPR""PPV") colnames(ROC_DCGRN) <- c( "TPR", "FPR", "PPV") colnames(ROC_iGRN) <- c( "TPR", "FPR", "PPV") colnames(ROC_DCGRN_DM) <- c("TPR", "FPR", "PPV") colnames(ROC_DCGRN_CNV) <- c("TPR", "FPR", "PPV") colnames(ROC_iGRN_DM) <- c( "TPR", "FPR", "PPV") colnames(ROC_iGRN_CNV) <- c("TPR", "FPR", "PPV") ROC_Data <- cbind(ROC_GRN, ROC_DCGRN, ROC_iGRN) write.csv(ROC_Data"result/SimulationStudy2_ROC_Data.csv") #================================================================= pdf("Figures/SimulationStudy2_pvalue.pdf") Data_TPR <- cbind(ROC_GRN[, 1], ROC_DCGRN[, 1], ROC_iGRN[, 1]) boxplot(Data_TPR , ylab= "Overall Sensitivity", xlab = "Methods", border = c("black""blue""red"), names = c("GRN""DCGRN""iGRN")) Data_FPR <- cbind(ROC_GRN[, 2], ROC_DCGRN[, 2], ROC_iGRN[, 2]) boxplot(Data_FPR , ylab= "Overall False Positive Rate", xlab = "Methods", border = c("black""blue""red"), names = c("GRN""DCGRN""iGRN")) Data_PPV <- cbind(ROC_GRN[, 3], ROC_DCGRN[, 3], ROC_iGRN[, 3]) boxplot(Data_FPR , ylab= "Overall Precision", xlab = "Methods", border = c("black""blue""red"), names = c("GRN""DCGRN""iGRN")) Data_TPR_DM <- cbind(ROC_DCGRN_DM[, 1], ROC_iGRN_DM[, 1]) boxplot(Data_TPR_DM , ylab= "Sensitivity on DM", xlab = "Methods", border = c("black""blue""red"), names = c("DCGRN""iGRN")) Data_FPR_DM <- cbind(ROC_DCGRN_DM[, 3], ROC_iGRN_DM[, 3]) boxplot(Data_FPR_DM , ylab= "False Positive Rate on DM", xlab = "Methods", border = c("black""blue""red"), names = c("DCGRN""iGRN")) Data_TPR_CNV <- cbind(ROC_DCGRN_CNV[, 1], ROC_iGRN_CNV[, 1]) boxplot(Data_TPR_DM , ylab= "Sensitivity on CNV", xlab = "Methods", border = c("black""blue""red"), names = c("DCGRN""iGRN")) Data_FPR_CNV <- cbind(ROC_DCGRN_CNV[, 3], ROC_iGRN_CNV[, 3]) boxplot(Data_FPR_DM , ylab= "False Positive Rate on CNV", xlab = "Methods", border = c("black""blue""red"), names = c("DCGRN""iGRN")) dev.off() 84 save.image("result/SimulationStudy2_pvalue.RData") 85 APPENDIX G SIMULATION STUDY 3 86 SIMULATION STUDY 3 #Generate random graph for gene expression according to the Erdos-Renyi model. #the arguments are, numbr of samples, number of genes, and the Sparce rate. Rnetwork_GRN <- function(n, p, d, c, x) { graph <- erdos.renyi.game(p, 0, directed = FALSE, loops = FALSE) degree.distribution(graph) # Ground truth of gene expression Adj_Rnetwork <- as_adjacency_matrix(graph) D <- matrix(runif(n * d, min = 0, max = 1),n,d) #C has random number 1,2,3, or 4 with the probabilities C <- matrix(sample(0:4, n*c, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE),n,c) # Random diagonal matrix for W W <- diag(nrow = c, ncol = p) #Random matrix for F with at most one none zero element per row F <- rbind(diag(p),diag(nrow = d - p, ncol = p)) CW <- (C) %*% W DF <- D %*% F Z <- scale(matrix(rnorm(n*p),n)) return(list(Adj_Rnetwork, Z, F, W, D, C)) } #================================================================= # LASSO function if(!is.installed("lars")) install.packages("lars") library(lars) Lasso <- function(z) { nc <- ncol(z) B = matrix(0, nrow = nc, ncol = nc) for (i in 1 : nc) { X <- z X[,i] <- 1 y <- z[, i] res <- lars(X, y, type="lasso") cvlas <- cv.lars(X, y , type="lasso", plot.it = FALSE) frac <- cvlas$index[which.min(cvlas$cv)] las.coef <- predict.lars(res, type="coefficients", mode="fraction", s=frac) coeff <- las.coef$coefficients 87 B[i, ] <- coeff } return(B) } #================================================================= #function to find confusion matrix Confusion <- function(Adj,Adj_Rnetwork) { TP_diag <- sum(diag(Adj) != 0& diag(Adj_Rnetwork) != 0) FP_diag <- sum(diag(Adj) != 0& diag(Adj_Rnetwork) == 0) FN_diag <- sum(diag(Adj) == 0& diag(Adj_Rnetwork) != 0) TN_diag <- sum(diag(Adj) == 0& diag(Adj_Rnetwork) == 0) TP <- sum(Adj != 0& Adj_Rnetwork != 0) - TP_diag FP <- sum(Adj != 0& Adj_Rnetwork == 0) - FP_diag FN <- sum(Adj == 0& Adj_Rnetwork != 0) - FN_diag TN <- sum(Adj == 0& Adj_Rnetwork == 0) - TN_diag return(c(TP, FP, FN, TN)) } #================================================================= #function to find confusion matrix for DM and CNV Confusion2 <- function(Adj, Adj_Rnetwork) { TP <- sum(Adj[Adj_Rnetwork >0] != 0) FP <- 0 FN <- sum(Adj[Adj_Rnetwork >0] == 0) TN <- 0 return(c(TP, FP, FN, TN)) } Confusion2ROC <- function(ROC) { TPR <- ROC[1] / (ROC[1] + ROC[3]) FPR <- ROC[2] / (ROC[4] + ROC[2]) FDR <- ROC[2] / (ROC[4] + ROC[2]) roc_data <- c(TPR, FPR, FDR) return (roc_data) } #================================================================= #GRN GRN <- function(Simulated_GRN) { 88 B <- Lasso(Simulated_GRN) colnames(B) <- sprintf("G%d",seq(1:p)) Adj_GRN <- matrix(0,nrow = p,ncol = p) colnames(Adj_GRN) <- sprintf("G%d",seq(1:p)) for (i in 1:p) { gene_idx <- which(B[i,] != 0) matrix_integrated <- as.matrix(Simulated_GRN[,gene_idx]) colnames(matrix_integrated) <- colnames(B)[gene_idx] if(length(matrix_integrated) >0) { fit <- lm(Simulated_GRN[,i]~., data = as.data.frame(matrix_integrated)) pvalue <- summary(fit)$coefficients[,4] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- matrix_integrated[,as.numeric(count)] Ajd_row <- match(colnames(B),colnames(matrix_result)) Ajd_row[is.na(Ajd_row)] <- 0 Adj_GRN[i,] <- Ajd_row } } diag(Adj_GRN) <- 0 return(Adj_GRN) } #=============================================================== # DCGRN DCGRN <- function(Simulated_GRN, F, W, D, C) { colnames(D) <- sprintf("DM%d",seq(1:d)) colnames(C) <- sprintf("CNV%d",seq(1:c)) Adj_DCGRN <- matrix(0,nrow = p,ncol = p) Adj_DCGRN_DM <- matrix(0,nrow = d,ncol = p) Adj_DCGRN_CNV <- matrix(0,nrow = c,ncol = p) colnames(Adj_DCGRN) <- sprintf("G%d",seq(1:p)) colnames(Simulated_GRN) <- sprintf("G%d",seq(1:p)) for (i in 1:p) { dm_idx <- which(F[,i]==1) 89 cnv_idx <- which(W[,i]==1) matrix_integrated <- Simulated_GRN matrix_integrated[, i] <- 1 DM <- as.matrix(D[, dm_idx]) n_col <- dim(as.matrix(DM))[2] matrix_integrated <- cbind(matrix_integrated, DM) d0 <- seq(p+1, p + n_col) colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] CNV <- as.matrix(C[, cnv_idx]) matrix_integrated <- cbind(matrix_integrated,CNV) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] res <- lars(matrix_integrated, Simulated_GRN[,i], type="lasso") cvlas <- cv.lars(matrix_integrated, Simulated_GRN[,i], type="lasso", plot.it = FALSE) frac <- cvlas$index[which.min(cvlas$cv)] las.coef <- predict.lars(res, type="coefficients", mode="fraction", s=frac) coeff <- las.coef$coefficients idx_nzero <- which(coeff != 0) matrix_lasso <- as.matrix(matrix_integrated[,idx_nzero]) colnames(matrix_lasso) <- names(idx_nzero) if(length(matrix_lasso) >0) { fit <- lm(Simulated_GRN[,i]~., data = as.data.frame(matrix_lasso)) pvalue <- summary(fit)$coefficients[,4] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- as.matrix(matrix_lasso[,names(count)]) colnames(matrix_result) <- names(count) Ajd_row <- match(colnames(Simulated_GRN),colnames(matrix_result)) Ajd_row[is.na(Ajd_row)] <- 0 Adj_DCGRN[i,] <- Ajd_row Ajd_DM_col <- match(colnames(D),colnames(matrix_result)) Ajd_DM_col[is.na(Ajd_DM_col)] <- 0 Adj_DCGRN_DM[,i] <- Ajd_DM_col Ajd_CNV_col <- match(colnames(C),colnames(matrix_result)) 90 Ajd_CNV_col[is.na(Ajd_CNV_col)] <- 0 Adj_DCGRN_CNV[,i] <- Ajd_CNV_col } } diag(Adj_DCGRN) <- 0 return(list(Adj_DCGRN,Adj_DCGRN_DM,Adj_DCGRN_CNV)) } #================================================================ # iGRN iGRN <- function(Simulated_GRN, F, W, D, C) { colnames(D) <- sprintf("DM%d",seq(1:d)) colnames(C) <- sprintf("CNV%d",seq(1:c)) Adj_iGRN <- matrix(0,nrow = p,ncol = p) Adj_iGRN_DM <- matrix(0,nrow = d,ncol = p) Adj_iGRN_CNV <- matrix(0,nrow = c,ncol = p) B <- Lasso(Simulated_GRN) colnames(B) <- sprintf("G%d",seq(1:p)) colnames(Adj_iGRN) <- sprintf("G%d",seq(1:p)) for (i in 1:p) { dm_idx <- which(F[,i] == 1) cnv_idx <- which(W[,i] == 1) gene_idx <- which(B[i,] != 0) matrix_integrated <- as.matrix(Simulated_GRN[,gene_idx]) colnames(matrix_integrated) <- colnames(B)[gene_idx] ncol_gene <- ncol(matrix_integrated) DM <- as.matrix(D[, dm_idx]) ncol_dm <- dim(as.matrix(DM))[2] matrix_integrated <- cbind(matrix_integrated, DM) d0 <- seq(ncol_gene+1, ncol_gene + ncol_dm) colnames(matrix_integrated)[d0] <- colnames(D)[dm_idx] CNV <- as.matrix(C[, cnv_idx]) matrix_integrated <- cbind(matrix_integrated,CNV) colnames(matrix_integrated)[ncol(matrix_integrated)] <- colnames(C)[cnv_idx] fit <- lm(Simulated_GRN[,i]~., data = as.data.frame(matrix_integrated)) pvalue <- summary(fit)$coefficients[,4] #we delete the first and second element of p-value since they are for intercept and 91 y[[i]] pvalue <- pvalue[-1] #find those columns with p-value less than 0.05 count <- which(pvalue <0.05) matrix_result <- matrix_integrated[,as.numeric(count)] Ajd_row <- match(colnames(B),colnames(matrix_result)) Ajd_row[is.na(Ajd_row)] <- 0 Adj_iGRN[i,] <- Ajd_row Ajd_DM_col <- match(colnames(D),colnames(matrix_result)) Ajd_DM_col[is.na(Ajd_DM_col)] <- 0 Adj_iGRN_DM[,i] <- Ajd_DM_col Ajd_CNV_col <- match(colnames(C),colnames(matrix_result)) Ajd_CNV_col[is.na(Ajd_CNV_col)] <- 0 Adj_iGRN_CNV[,i] <- Ajd_CNV_col } diag(Adj_iGRN) <- 0 return(list(Adj_iGRN,Adj_iGRN_DM,Adj_iGRN_CNV)) } source("loadLibrary.R") source("SimulationStudy_CommonFunc.R") source("SimulationStudy3_func.R") #================================================================= # Simulation study for only gene expression data without CNV and DM #================================================================= ROC_GRN <- data.frame() ROC_DCGRN <- data.frame() ROC_DCGRN_DM <- data.frame() ROC_DCGRN_CNV <- data.frame() ROC_iGRN <- data.frame() ROC_iGRN_DM <- data.frame() ROC_iGRN_CNV <- data.frame() n_iter <- 50# iteration n_sample <- 300# number of samples p <- 100# number of gene exprssion c <- 100# number of CNV d <- 150# number of DM 92 for (i in 1 : n_iter) { Z <- GenerateEmptyGRN(n_sample, p, d, c) Adj_Rnetwork <- as.matrix(Z[[1]]) Simulated_GRN <- as.matrix(Z[[2]]) F <- as.matrix(Z[[3]]) W <- as.matrix(Z[[4]]) D <- as.matrix(Z[[5]]) C <- as.matrix(Z[[6]]) #ROC for GRN Adj_GRN <- GRN(Simulated_GRN) ROC_GRN <- rbind(ROC_GRN, Confusion2ROC(Confusion(Adj_GRN, Adj_Rnetwork) + c(0, 0, d, 0) + c(0, 0, c, 0))) print(paste( "th iteration: ",i)) #DCGRN dcgrn <- DCGRN(Simulated_GRN, F, W, D, C) Adj_DCGRN <- dcgrn[[1]] Adj_DCGRN_DM <- dcgrn[[2]] Adj_DCGRN_CNV <- dcgrn[[3]] ROC_DCGRN <- rbind(ROC_DCGRN, Confusion2ROC(Confusion(Adj_DCGRN, Adj_Rnetwork) + Confusion2(Adj_DCGRN_DM, F) + Confusion2(Adj_DCGRN_CNV, W))) #ROC for dm in DCGRN ROC_DCGRN_DM <- rbind(ROC_DCGRN_DM,Confusion2(Adj_DCGRN_DM, F)) #ROC for cnv in DCGRN ROC_DCGRN_CNV <- rbind(ROC_DCGRN_CNV,Confusion2(Adj_DCGRN_CNV, W)) #iGRN igrn <- iGRN(Simulated_GRN, F, W, D, C) Adj_iGRN <- igrn[[1]] Adj_iGRN_DM <- igrn[[2]] Adj_iGRN_CNV <- igrn[[3]] #ROC for gene in iGRN ROC_iGRN <- rbind(ROC_iGRN, Confusion2ROC(Confusion(Adj_iGRN, Adj_Rnetwork) + Confusion2(Adj_iGRN_DM, F) + Confusion2(Adj_iGRN_CNV, W))) #ROC for dm in iGRN 93 ROC_iGRN_DM <- rbind(ROC_iGRN_DM,Confusion2(Adj_iGRN_DM, F)) #ROC for cnv in iGRN ROC_iGRN_CNV <- rbind(ROC_iGRN_CNV, Confusion2(Adj_iGRN_CNV, W)) } colnames(ROC_GRN) <- c("TPR", "FPR", "FDR") colnames(ROC_DCGRN) <- c("TPR", "FPR", "FDR") colnames(ROC_iGRN) <- c("TPR", "FPR", "FDR") colnames(ROC_DCGRN_DM) <- c("TPR", "FPR") colnames(ROC_DCGRN_CNV) <- c("TPR", "FPR") colnames(ROC_iGRN_DM) <- c( "TPR", "FPR", "FDR") colnames(ROC_iGRN_CNV) <- c("TPR", "FPR") ROC_Data <- cbind(ROC_GRN, ROC_DCGRN, ROC_iGRN) #================================================================= #False discovery rate FDR_GRN <- ROC_GRN[,3] FDR_DCGRN <- ROC_DCGRN[,3] FDR_iGRN <- ROC_iGRN[,3] #boxPlot of 3 False discovery rate data <- cbind(FDR_GRN, FDR_DCGRN, FDR_iGRN) pdf("Figures/FDR.pdf") boxplot(data, ylab= "FDR", xlab = "", border = c("black""blue""red"), names = c("GRN""DCGRN""iGRN"), main = "FDR") dev.off() save.image("result/SimulationStudy3_FDR.RData") 94 VITA Neda graduated with a bachelor’s degree in Applied Mathematics from Razi University in Kermanshah, Iran. Her degree was awarded in July 2007. From there, she received a graduate degree in Applied Mathematics in January 2010 from the University of Tehran in Iran. Most recently, she studied at Texas A&M University-Commerce as a graduate student in Computational Science. She was awarded her degree in May 2017. Her research interests include bioinformatics, machine learning, data mining, and big data. During her time as a graduate student, she gained four years of experience teaching mathematics and statistics at the university level. Though she has not been formally involved in any professional software development projects within the software industry, she has programming experience in C++, Python, and R as a result of working on numerous projects of her own. Address: Computer Science Department Texas A&M University PO Box 3011 Commerce, TX 75428 Email: nzarayeneh@leomail.tamuc.edu |