8. Multivariate d-profiling

In the previous sections, you have learnt how to segment and extract features for the cells. Now it is time to use these features to study the effect of drug, siRNA and other treatments on the cells. In particular, we will be interested in those features that has changed the most upon treatment with drug, siRNA, etc. These features that are most different between the control and the treatment will provide greatest discriminative power in separating the control and treatment. Hence we will use support vector machine (SVM) algorithm which find optimal hyperplanes separating treated and control cells to identify these features.

The contribution of the features is summarized in the d-profile, which is a vector indicating the direction of greatest separation between the two populations. A feature that differs more between control and treated will have a higher absolute value in the d-profile.

8.1. Loading data into R

To load data into R, the user first create a list containing information about your experiment. The list needs to contain the following information:

  • plate.type.list = 96/384
  • project.path
  • markerset.id = 0,1,2 (for markerset 1,2,3 respectively)
  • plate.id = 0,1,2 (for plate 1,2,3 in the project respectively)
  • wells = list of well positions
  • feature.list = array of features that we want to load
  • well.names

For the RNAi data set, the following list has been created

  • wells is a list of all the genes that we are going to examine. For each gene, we have included two well positions. The first and second entries corresponde to the well position of the control well and that of the gene respectively.
  • The features have been saved and is loaded into R using read.table.
setwd("../2012_12_cellXpress_demo/")
source('./lib/cXlibrary.R')

core.number = 4
features <- read.table('./analysis/d_profiles/features')

svm_list <- list(
  plate.type.list = 96,
  project.path = "./plates/121126_HeLa-d_profile-RZ12001",
  markerset.id = 0,
  plate.id = 0,

  wells = list( POLR2A = c("A1","A3"),
              POLR2B = c("A1","A4"),
              POLR2G = c("A1","A5"),
              POLR2F = c("B1","B3"),
              POLR2H = c("B1","B4"),
              POLR2I = c("B1","B5"),
              POLR2J = c("B1","B6"),
              POLR2K = c("B1","B7"),
              RPS10 = c("C1",'C3'),
              RPS18 = c("C1","C4"),
              RPS20 = c("C1","C5"),
              RSP24 = c("C1","C6"),
              RPS12 = c("C1",'C7'),
              RPS15A = c("C1","C8"),
              RPS25 = c("C1","C9"),
              RSP26 = c("C1","C10"),
              ACTA1 = c("D1","D3"),
              ACTA2 = c("D1","D4"),
              ACTB = c("D1","D5"),
              ACTG1 = c("D1","D6"),
              ACTG2 = c("D1","D7"),
              SPTBN5 = c("E1","E3"),
              SPTBN1 = c("C1","C11"),
              SPTBN2 = c("C1","C12"),
              TUBB6= c("A1","A6"),
              TUBA1 = c("A1","A7"),
              TUBA2 = c("A1","A8"),
              TUBB1 = c("A1","A9"),
              TUBB4 = c("A1","A10"),
              TUBB4Q = c("A1","A11"),
              TUBG1 = c("A1","A12"),
              TUBA3 = c("F1","F3")),

  feature.list = as.character(features[[1]]),

  well.names = c("POLR2A","POLR2B","POLR2G","POLR2F","POLR2H","POLR2I","POLR2J",
                 "POLR2K","RPS10","RPS18","RPS20","RPS24","RPS12","RPS15A","RPS25",
                 "RPS26","ACTA1","ACTA2","ACTB","ACTG1","ACTG2","SPTBN5","SPTBN1",
                 "SPTBN2","TUBB6","TUBA1","TUBA2","TUBB1","TUBB4","TUBB4Q","TUBG1","TUBA3")
)

After creating the data list, we can now load the data into R. First we will load the corresponding to the control wells. Some of the genes share similar control wells as they are located on the same plate. Hence we will first determine the names of the unique control wells.

m = length(svm_list[[5]])

#determining the number of unique control wells
control.names <- NULL
for (i in c(1:m))
{
  control.names <- c(control.names,svm_list$svm[[i]][1])
}
control.names = unique(control.names)

Next we will load the control wells. To do so, we have to load the R library with the functions written for data loading. Notice that there are some code to identify and discard cells with features values yielding Inf or NaN.

Control is a list containing the data for the control wells. Control[[1]] is a matrix of size 178 x 290. It contains the values of the 290 features obtained for each of the 178 cells in well A1.

# loading the control wells
Control <- list()
for (i in c(1:length(control.names)))
{
  control <- load_wells(control.names[i],svm_list)
  control<- control[[1]][[1]]
  a <- which(control == 'Inf'|control== 'NaN')
  if (length(a) > 0)
  {x <- a%%(dim(control)[1])
   control <- control[-x,]
  }
  Control[[i]] <- control
}

We can now load the treated wells. Treatment is the list containing the data for the treated wells. Treatment[[i]] is a matrix containing the values of the 290 features obtained for each of the cells in for treated well i.

# loading the treated wells
Treatment <- list()
for (i in c(1:m))
{
  data <- load_wells(svm_list[[5]][[[i]][2], svm_list)
  treatment <- data[[1]][[1]]

  a <- which(treatment == 'Inf'|treatment== 'NaN')
  if (length(a) > 0)
  {x <- a%%(dim(treatment)[1])
   treatment <- treatment[-x,]
  }
  Treatment[[i]] <- treatment
}

8.2. Computing d-profiles

The library implements a support vector machine (SVM) algorithm to find optimal hyperplanes separating treated and untreated populations of cells in multi-dimensional feature space.

The d-profile, which is the normal vector of the hyperplane, indicates the phenotypic direction of greatest separation between the treated and untreated populations.

To obtain robust estimation of the d-profile, we need

  • Parameter tuning: Determine the best parameter for the analysis.
  • Multiple trials and cross-validation approach in each trial.

    • In each trial, k-fold cross-validation is performed. The original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The train accuracy is obtained as the average degree of separation between the treated and untreated set obtained from the training data. The test accuracy is that obtained from the validation data. Multiple trials is run to average out effects of different partitioning of the data.
  • Features selection: an algorithm to iteratively remove sets of features without a loss of classification accuracy and determine the most informative features that differentiates the treated from the untreated.

    • In the first run, all the features are used and the best separating hyperplane and classification accuracy is calculated. All the features are ranked according to their contribution to the hyperplane. The set of features that contributes least to the classification will be removed. The number of features to remove at each iteration (rfe.step.size) is set by the user. In the next run, the algorithm repeats on the remaining set of features. This process runs until a certain number of features (rfe.end.size) set by the user is left.

To compute the d-profiles, we run the code below:

#start the svm analysis
source('./lib/lib_classify.R')
control.index <- list()
treatment.index <- list()
svm_results <- list()
for (i in c(1:m)){
  control.name <- svm_list[[5]][[i]][[1]]
  k <- which(control.names == control.name)
  control <-  Control[[k]]
  treatment <- Treatment[[i]][
  row.min <- min(dim(control)[1],dim(treatment)[1])
  control.index[[i]][ <- sample(dim(control)[1],row.min,replace=FALSE)
  control <- control[control.index[[i]],]
  treatment.index[[i]] <- sample(dim(treatment)[1],row.min,replace=FALSE)
  treatment <- treatment[treatment.index[[i]],]
  all.data <- rbind(control,treatment)

  label1= rep(1, row.min)
  label2= rep(-1, row.min)
  all.label = as.factor(c(label1,label2))
  a <- dim(all.data)[1]

  tune.para <- list(cost.list = c(0.01,0.1,1,10,100), cross = 0,gamma.list = 1, is.verbose = TRUE,
               select = "max", type = "bar.mean",type = "bar.mean", rfe.end.size = 1, rfe.step.size = 1)
  #perform classification
  test <- perform_cv(all.data = all.data, all.label = all.label, class.weights = NULL, kernel.type = "linearR",
          fs.method = "svmrfe", trial.number = 3, fold.number = 3, tune.para = tune.para)
  svm_results[[i]]<- test
}

In the code,

  • We load the cXprofiling library.
  • We compare the control and treated cells to determine which set has a lower number of cells. This number is the row.min.
  • We will sample row.min number of cells from both the control and treatment to set up the data for svm analysis.
  • Next we set up the parameters for SVM. The parameters are in tune.para.

    • cost.list: parameter for svm tuning
    • cross: set cross = 0
    • gamma.list: parameter for svm tuning. gamma.list = 1 if using linear kernel.
    • is.verbose: TRUE
    • select: performance measurement tie breaker "min" or "max"
    • type: performance measure type ("acc.mean", "bar.mean", "acc.quartile", "bar.quartile")
    • rfe.end.size: The feature selection algorithm terminates when the number of remaining features = rfe.end.size .
    • rfe.step.size: The number of features to remove each time during feature selection. If the rfe.step.size = 2, the 2 feature that contributes the least to the hyperplane will be removed.
  • perform_cv is used for the svm analysis.

    • all.data: data of the treated and untreated cells
    • all.label: label (1 and -1 for treated and untreated cells)
    • class.weights: NULL
    • kernel.type: "linearR": for linear SVM.
    • fs.method: "svmrfe": perform SVM feature selection. "none" : no feature selection. Only perform parameter tuning. "apply": no feature selection and parameter tuning.
    • trial.number: number of trials to run
    • fold.number: The number of folds to use for cross-validation
    • tune.para: the svm parameter specified earlier.
  • The output of perform_cv is a list containing trial.number sublist. Each sublist contains the result for one trial. For example test[[1]] is the results for the 1st trial. There are three entries (train, test and weight)

    • results[[n]]$train is a 3 x 2 matrix. In each row, the first and the second entries are the classification accuracy for the untreated and treated cells used in training respectively for the kth fold.
    • results[[n]]$test is a 3 x 2 matrix. In each row, the first and the second entries are the classification accuracy for the untreated and treated cells used in validation respectively for the kth fold.
    • results[[n]]$weights is a 3 x 290 matrix, that contains the weights for each feature in each fold.

Combining the results from 3 trials and 3 folds.

# combining the results from the different folds and trials.
n =length(features[[1]])
weights.abs = mat.or.vec(m,n)
train.accuracy = test.accuracy = train.accuracy.sd = test.accuracy.sd = mat.or.vec(1,m)

for (i in c(1:m)){
  weight.tmp<- train.acc.tmp <- test.acc.tmp<- NULL
  for (ii in 1:3){
    a <- colMeans(svm_results[[i]][[ii]]$weight)
    a <- a/sum(abs(a))
    weight.tmp <- rbind(weight.tmp,a)
    train.acc.tmp <- c(train.acc.tmp,mean(svm_results[[i]][[ii]]$train))
    test.acc.tmp <- c(test.acc.tmp,mean(svm_results[[i]][[ii]]$test))
  }
  a <- colMeans(weight.tmp)
  a <- a/sum(abs(a))
  weights.abs[i,] = a
  train.accuracy[i] = mean(train.acc.tmp)
  test.accuracy[i] = mean(test.acc.tmp)
  train.accuracy.sd[i] = sd(train.acc.tmp)
  test.accuracy.sd[i] = sd(test.acc.tmp)
}
weights.abs.zscore <- weights_norm(weights.abs,"Z-score")
rownames(weights.abs.zscore) <- svm_list$well.names
colnames(weights.abs.zscore) <- svm_list$feature.list

8.3. Visualization

To visualize the profiles dissimilarity among the 4 group of genes, we generate an interactive 3D MDS plot using libraries, MASS8 (Support Functions and Datasets for Venables and Ripley’s MASS) and rgl (3D visualization device system (OpenGL)), in R. MASS and rgl can be downloaded from http://cran.r-project.org/web/packages/MASS/index.html and http://cran.r-project.org/web/packages/rgl/index.html respectively.

First, we use the MASS library to determine distance among the points when projected into 3d space.

library(MASS)
Color = c("blue","orange","red","green")
Class = c("RNA synthesis","Ribosomal","Actin","Tubulin")
Num = c(8,8,8,8)
color = rep(Color, Num)
EMBL_dist <- cos_dissim(weights.abs.zscore)
EMBL_mds <- isoMDS(EMBL_dist, k = 3)

Next, we use the rgl package to visualize the points in an interactive window and also take a snapshot of the plot.

#3d visulization
library(rgl)
## Some configuration parameters:
fig.width       <- 1000
fig.height      <- 1000
def.font.size   <- 1.5
label.font.size <- 2
tick.mark       <- c(-0.6, 0, 0.6)
grid.mark       <- c(-0.6, -0.3, 0, 0.3, 0.6)
grid.lwd        <- 3

## Create the window
open3d()
par3d(windowRect=c(0,0, fig.width, fig.height),
      cex=def.font.size)

## Clear the lights
rgl.clear(type="light")

## Plot the points
plot3d(EMBL_mds$points[,1],EMBL_mds$points[,2],EMBL_mds$point[,3],type="p", lwd = 2, col = color, size = 28,
       xlim = c(-0.7, 0.7),ylim = c(-0.7, 0.7),zlim = c(-0.7, 0.7), xlab="", ylab="", zlab="", box=FALSE,
       axes=FALSE)

## Plot outlines
points3d(EMBL_mds$points[,1],EMBL_mds$points[,2],EMBL_mds$points[,3],col= 'black', size = 36)

## Set the viewpoint
par3d(userMatrix=matrix(c(0.97795659,  0.207743421, 0.02102859,    0,
                          -0.02294367,  0.006813017, 0.99971282,    0,
                          0.20754056, -0.978158832, 0.01142918,    0,
                          0,            0,          0,    1),
                        nrow=4, ncol=4, byrow=TRUE))
par3d(FOV=72)
## Add the grid
grid3d(side = c('x','y+','z'), at=grid.mark, lwd=grid.lwd)

## Add the axes
axes3d("bbox",
       marklen=0.075,
       marklen.rel=FALSE,
       xat=tick.mark,
       yat=tick.mark,
       zat=tick.mark,
       xunit=0, yunit=0, zunit=0, lwd=grid.lwd, col="black")


## Add the labels
mtext3d('Dim 1', edge = 'x', cex=label.font.size, line=0.5)
mtext3d('Dim 2', edge = 'y-+', cex=label.font.size, pos=c(-0.7,0.3,0.9))
mtext3d('Dim 3', edge = 'z', cex=label.font.size, line=1)

## Create a snapshot
rgl.snapshot(filename="./analysis/d_profiles/dprofile_mds_3d.png")

As observed from the MDS plot, genes from the same group lie closer together in the MDS plot.

images/Ch8_dprofile_mds_3d.png

8.4. Comparison with raw features and PCA

Next, we determine the profiles using raw features and PCA.

# using arithmetic mean of raw features as profiles
raw <- mat.or.vec(m,n)
for (i in c(1:m)){
  raw[i,] <- apply(Treatment[[i]],2,mean)}
raw_norm <- weights_norm(raw,"Z-score")
rownames(raw_norm) <- svm_list$well.names
colnames(raw_norm) <- svm_list$feature.list
raw_norm <- data.frame(raw_norm)

# using PCA as profiles
EMBL_PCA <- prcomp(raw_norm)
sum_eig <- sum(EMBL_PCA[[1]]^2)
eig <- EMBL_PCA[[1]]^2/sum_eig

# using the number of principle components needed to explain 95% of the variation
eig_cum = c()
eig_cum[1] = eig[1]
for (i in c(2:length(eig))){
  eig_cum[i] = eig_cum[i-1]+eig[i]}
k = min(which(eig_cum > 0.95))
PCA <- EMBL_PCA$x[,c(1:k)]
PCA.abs <- weights_norm(PCA,'Z-score')

To compare the performance of d-profile, raw features and PCA, we determine the intra-group and inter-group distances for the 4 group of genes.

To calculate intra-group distance, we determined the pair-wise dissimilarity for all the genes in each of the four group. We measured the intra-cluster distance as the mean of the maximum pair-wise distances for each of the four groups.

Weights <- list()
Weights[[1]] <- weights.abs.zscore
Weights[[2]] <- PCA.abs
Weights[[3]] <- raw_norm

# determine intra-group distance
intra.dist <- list()
intra.mean <- mat.or.vec(1,length(Weights))
intra.mean.sd <- mat.or.vec(1,length(Weights))
for (i in 1:length(Weights)){
  intra.dist[[i]] <- intra_dist(Weights[[i]],Num, method_intra = 'max')
  intra.mean[i] <- mean(intra.dist[[i]])
  intra.mean.sd[i]<- apply(intra.dist[[i]],1,sd)
}
colnames(intra.mean)= c('d-profile','PCA','raw')
colnames(intra.sd)= c('d-profile','PCA','raw')

We obtained the following results which shows that d-profiles yields the smallest intra-group distance.

> intra.mean
     d-profile      PCA      raw
[1,]  1.100821 1.428272 1.329314

To calculate inter-group distance, we determined all pair-wise dissimilarity between genes from two different groups. These distances were then sorted from the lowest to the highest. For the n nearest neighbour analysis, the mean of the n lowest dissimilarities is the inter-group distance for the two group of genes. This analysis is repeated for all the different possible pair of groups and the mean is the inter-group distance.

N = c(5,10,30)
inter.dist <- list()
inter.mean <- mat.or.vec(3,length(Weights))
inter.mean.sd <- mat.or.vec(3,length(Weights))
for (j in c(1:3)){
   inter.dist[[j]]<- list()
for (i in 1:length(Weights)){
   inter.dist[[j]][[i]] <- as.vector(inter_dist(Weights[[i]],Num,N = N[j]))
   inter.mean[j,i] <- mean(inter.dist[[j]][[i]])
   inter.mean.sd[j,i] <- sd(inter.dist[[j]][[i]])
}}
colnames(inter.mean)= c('d-profile','PCA','raw')
rownames(inter.mean)= N

We obtained the following results which shows that d-profiles yields the largest inter-group distance for the different values of N.

> inter.mean
   d-profile       PCA       raw
5  0.9310926 0.6310284 0.5979230
10 0.9547827 0.7110320 0.6703705
30 0.9972681 0.8781165 0.8718356

To determine if the intra-group and inter-group distance obtained in d-profile is statistically smaller and larger respectively than that of raw features and PCA, we performed a two-sided paired t-test.

For the intra-group distance, we performed the t-test as follows:

# determine the p-value for intra-group distance
x = c(1,1,2)
y = c(2,3,3)
pval_intra = mat.or.vec(1,3)
for (i in c(1:3)){
    pval_intra[i] = t.test(x = intra.dist[[x[i]]], y = intra.dist[[y[i]]], alternative = 'two.sided', paired = TRUE)$p.value
}
colnames(pval_intra) = c('d-profile vs PCA','d-profile vs raw','PCA vs raw')

We obtained the following results showing that the intra-group distance of d-profile is significantly smaller than that of PCA and raw features at 95% confidence level.

> pval_intra
     d-profile vs PCA d-profile vs raw PCA vs raw
[1,]      0.006486328       0.03190998  0.1606494

For the inter-group distance, we performed the t-test as follows:

pval_inter = mat.or.vec(1,3)
for (i in c(1:3)){
  pval_inter[i] = t.test(x = inter.dist[[1]][[x[i]]], y = inter.dist[[1]][[y[i]]], alternative = 'two.sided', paired = TRUE)$p.value
}
colnames(pval_inter) = c('d-profile vs PCA','d-profile vs raw','PCA vs raw')
rownames(pval_inter) = 5

We obtained the following results showing that the inter-group distance of d-profile is significantly larger than that of PCA and raw features at 95% confidence level.

> pval_inter
  d-profile vs PCA d-profile vs raw PCA vs raw
5      0.001219603      0.002793763  0.1820621

8.5. Generating d-profiles for a large dataset

In the above example, we are calculating the d-profile for a couple of genes. To compute d-profiles for hundreds or thousands ofgenes, it would be better to save the d-profiles in individual files and load and combine the results afterwards. To do so, we could use the get_dprofile function in the cXprofiling.R library.

To compute the d-profiles for a large set of untreated and treated cells

  • load data
  • Arrange data and label
  • Set parameters
  • Run the svm analysis using multiple cores. The results for each svm analysis will be same in a file.
  • Loading all the files and retrieve the results.
All.data <- NULL
cell.length <- NULL
for (i in 1:length(data)){
  All.data <- rbind(All.data,data[[i]][[[1]]])
  cell.length <- c(cell.length,dim(data[[i]][[[1]]])[1])
}
All.label = rep(query.label, cell.length)

#Set parameters
dprofile.options <- list( temp.dir    = temp.dir,
                          is.save     = TRUE,
                          is.cv       = TRUE,
                          kernel.type = "linearR",
                          fs.method   = "svmrfe",
                          tune.para   = NULL,
                          cv.trial.number = 3,
                          cv.fold.number = 3)
dprofile.options$tune.para <- list(cross = 0,
                                   type = "bar.mean",
                                   select = "max",
                                   cost.list = c(0.1,1,10,100),
                                   gamma.list = 1,
                                   is.verbose = TRUE,
                                   rfe.step.size = 1,
                                   rfe.end.size = 1)
# Run the svm analysis using multiple cores. The results will be save as individual files
# in the temporary folder.
core.number = 4
X = All.data
X.label = All.label
job.id = 1
dprofile.res <- mclapply(1:length(query.label),
                         get_dprofile,
                         job.id,
                         X, X.label,
                         query.label, control.cell.idx= c(1:178),
                         dprofile.options,
                         mc.cores=core.number)
#To retrieve the files and combile the data.
combine_dprofile_result(job.id = 1, query.label = query.label, temp.dir = temp.dir,
is.cv = TRUE, is.remove.file = TRUE)