CONTENTS

What the pipeline does?

What you need?

Getting Started

Function Sourcecode


What the pipeline does?

The pipeline will help you find and visualize significant Methylation Susceptible Loci (MSL) between the groups using 3 SIMPLE STEPS:

What do you need?

You will need to install the package msap and prepare the data format that it needs. That is all you need for this pipeline.

Getting Started

Lets run the three steps for the pipeline. Before starting with the following command, copy and run the function sourcecode provided in the last section.

STEP 1

Use the msap package and the input data to find the MSL.

library(msap)
msap.out=msap("your_data.csv",name = "any_name")

STEP 2

pass the output to the custom function find.sig.loci( ) (more info about the function in subsequent sections)

my.out=find.sig.loci(msap.out)
## analysing.... 
## Total loci: 153 
## Total samples: 41 
## Total number of groups: 2
## replacing types with numbers suitable for fisher test and heatmap..
## Performing multiple fisher tests
## Calculating FDR using benjamini and hochberg
## Warning in p.adjust(pval, method = "hochberg"): NAs introduced by coercion
## filtering significant loci
## 25  loci are found significant.

Note: The warning during coercion to NA is produced when the input data has only one variable and canโ€™t be used for fisher test. The cistome function returns a na for such cases.

and now the output my.out is a list of five elements:

names(my.out)
## [1] "heatmap.matrix"   "Significant.Loci" "All.loci.results"
## [4] "complete.data"    "group.cont"

where, *heatmap.matrix is a output matrix of numbers significant loci suitable for creating the heatmap.

*significant.loci is character vector containing the significant loci selected.

*All.loci.results is the overall result showing the p-values and adjusted p-values for all the loci(not filtered). The pipeline uses a filter of FDR<0.05. While using this data you can use your own filter.

*complete.data contains the matrix of all loci and samples together.

*group.count can be used to know the sample count for each group for using it in heatmap cluster coloring.

#significant loci
my.out$Significant.Loci
##  [1] "X104.372303"  "X166.2043885" "X177.2979221" "X178.4825882"
##  [5] "X189.2535821" "X247.2600556" "X264.6262937" "X306.2137821"
##  [9] "X307.6917062" "X312.9926866" "X329.91875"   "X347.8885"   
## [13] "X365.7359649" "X373.8959322" "X375.1234703" "X376.7047115"
## [17] "X380.05265"   "X384.7438168" "X394.4164319" "X439.0106726"
## [21] "X447.656391"  "X450.4504651" "X456.8275"    "X526.8863354"
## [25] "X547.6297468"
## attr(,"na.action")
## [1] 20
## attr(,"class")
## [1] "omit"
head(my.out$All.loci.results)
##           loci                 pval p.adjust.pval..method....hochberg..
## 1 X98.61148718   0.0697636307412465                        1.0000000000
## 2  X104.372303 1.09988172605173e-06                        0.0001649823
## 3 X104.6896429 0.000479415113561449                        0.0589680590
## 4 X106.0720225   0.0612799103669727                        1.0000000000
## 5 X108.8681673  0.00862025252269153                        0.8620252523
## 6 X110.0554777   0.0387238934603335                        1.0000000000
#number of samples
my.out$group.cont
##   grpNAME grpCOUNT
## 1      G1       21
## 2      G2       20

STEP3

Now get the heatmap matrix data and plot the heatmap. I have used the heatmap.2( ) function from gplots package. You can use any package or settings.

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heat.data=my.out$heatmap.matrix

#for getting the group colors for the samples marked as G1,G2.. size of which obtained from group,cont dataframe
row.cluster.colors= c(rep("green",21),rep("red",20))
heatmap.2(as.matrix(heat.data),
                        Rowv = F,
                        trace = "none",
                        col = c('#eff3ff','#bdd7e7','#6baed6','#2171b5') ,
                        RowSideColors = ,
                        dendrogram="column")

NOTE: The four shades of blue corresponds to the four levels of loci TypeI-IV(Type I, unmethylated; Type II, inner cytosine methylation; Type III, hemi-methylation of the outer cytosine; Type IV, fully methylation.) in the order 1-4.

and its done!

Function Sourcecode

Copy the following function code and run it in your R environment. This will create two functions, find.sig.loci( ) and tryCatchFisher( ).

###########################################################################
########### STATISTICAL ANALYSIS FOR FINDING SIGNIFICANT LOCI BW GROUPS####
############## @author: Lakshay Anand #####################################


find.sig.loci<- function(msap.input){
  
  out=list()
  
  ptrn=msap.input[[2]]
  grpNAME=c()
  grpCOUNT=c()
  grpALL=data.frame()
  for(d in 1:length(ptrn)){
    
    Grp=ptrn[[d]]
    rownames(Grp)= c(paste("G",d,"-",c(1:nrow(Grp)),sep = ""))
    grpNAME[d]=paste("G",d,sep = "")
    grpCOUNT[d]=nrow(Grp)
    grpALL=rbind.data.frame(grpALL,Grp)
    
  }
  
  
  cat("analysing.... \n")
  cat("Total loci:",ncol(grpALL),"\n")
  cat("Total samples:",nrow(grpALL),"\n")
  cat("Total number of groups:",length(ptrn),"\n")
  
  
  ############## replace types with numbers
  message("replacing types with numbers suitable for fisher test and heatmap.. \n")
  for(i in 1:ncol(grpALL)){
    
    
    for(j in 1:nrow(grpALL)){
      
      if(grpALL[j,i]=="u"){
        
        grpALL[j,i]=1
      }
      else if(grpALL[j,i]=="h"){
        
        grpALL[j,i]=2
      } else if(grpALL[j,i]=="i"){
        
        grpALL[j,i]=3
      } else if(grpALL[j,i]=="f"){
        
        grpALL[j,i]=4
      }
      
      
      
    }
    
    
    
  }
  
  for(i in 1:ncol(grpALL)){
    
    grpALL[,i]=as.integer(grpALL[,i])
    
  }
  
  
  
  #####################################################
  #creating groups vector for fisher test
  groups.vector=c()
  
  for(i in 1:length(rownames(grpALL))){
    
    groups.vector[i]=strsplit(rownames(grpALL)[i],"-")[[1]][1]
  }
  
  
  ############### multiple fisher test ##########
  message("Performing multiple fisher tests \n")
  pval=c()
  loci=c()
  for(i in 1:ncol(grpALL)){
    
    loci[i]=colnames(grpALL)[i]
    
    pval[i]=tryCatchFisher(table(groups.vector,grpALL[,i]))
    
    
  }
  
  
  message("Calculating FDR using benjamini and hochberg \n")
  #result of the fisher test wit p-vale adjustment using benjamini
  
  grpALL_LOCI=data.frame(loci,pval,p.adjust(pval,method = "BH"))
  
  message("filtering significant loci \n")
  ### filter the loci with FDR <0.05
  grp_sigloci=grpALL_LOCI[grpALL_LOCI[,3]<0.05,]
  
  
  
  ###collect the significant loci
  sig.loci=as.character(grp_sigloci[,1])
  sig.loci=na.omit(sig.loci)
  cat(length(sig.loci)," loci are found significant. \n")
  ##### filter the loci from the matrix
  
  grpALL_with_sig_loci=grpALL[,sig.loci]
  
  
  out[[1]]=as.matrix(grpALL_with_sig_loci)
  out[[2]]=sig.loci
  out[[3]]=grpALL_LOCI
  out[[4]]=grpALL
  out[[5]]=data.frame(grpNAME,grpCOUNT)
  names(out)=c("heatmap.matrix","Significant.Loci","All.loci.results","complete.data","group.cont")
  
  return(out)
  }




#### the fisher test in try catch
tryCatchFisher<-function(v){
  tryCatch(fisher.test(v)[[1]],
           error = function(e) {"na"})
}