In the previous sections, you have learnt how to segment and extract features for
the cells. Now it is time to use these features to study the effect of drug, siRNA
and other treatments on the cells. In particular, we will be interested in those
features that has changed the most upon treatment with drug, siRNA, etc. These
features that are most different between the control and the treatment will provide
greatest discriminative power in separating the control and treatment. Hence we will
use support vector machine (SVM) algorithm which find *optimal* hyperplanes
separating treated and control cells to identify these features.

The contribution of the features is summarized in the d-profile, which is a vector indicating the direction of greatest separation between the two populations. A feature that differs more between control and treated will have a higher absolute value in the d-profile.

To load data into R, the user first create a list containing information about your experiment. The list needs to contain the following information:

- plate.type.list = 96/384
- project.path
- markerset.id = 0,1,2 (for markerset 1,2,3 respectively)
- plate.id = 0,1,2 (for plate 1,2,3 in the project respectively)
- wells = list of well positions
- feature.list = array of features that we want to load
- well.names

For the RNAi data set, the following list has been created

- wells is a list of all the genes that we are going to examine. For each gene, we have included two well positions. The first and second entries corresponde to the well position of the control well and that of the gene respectively.
- The features have been saved and is loaded into R using read.table.

setwd("../2012_12_cellXpress_demo/") source('./lib/cXlibrary.R') core.number = 4 features <- read.table('./analysis/d_profiles/features') svm_list <- list( plate.type.list = 96, project.path = "./plates/121126_HeLa-d_profile-RZ12001", markerset.id = 0, plate.id = 0, wells = list( POLR2A = c("A1","A3"), POLR2B = c("A1","A4"), POLR2G = c("A1","A5"), POLR2F = c("B1","B3"), POLR2H = c("B1","B4"), POLR2I = c("B1","B5"), POLR2J = c("B1","B6"), POLR2K = c("B1","B7"), RPS10 = c("C1",'C3'), RPS18 = c("C1","C4"), RPS20 = c("C1","C5"), RSP24 = c("C1","C6"), RPS12 = c("C1",'C7'), RPS15A = c("C1","C8"), RPS25 = c("C1","C9"), RSP26 = c("C1","C10"), ACTA1 = c("D1","D3"), ACTA2 = c("D1","D4"), ACTB = c("D1","D5"), ACTG1 = c("D1","D6"), ACTG2 = c("D1","D7"), SPTBN5 = c("E1","E3"), SPTBN1 = c("C1","C11"), SPTBN2 = c("C1","C12"), TUBB6= c("A1","A6"), TUBA1 = c("A1","A7"), TUBA2 = c("A1","A8"), TUBB1 = c("A1","A9"), TUBB4 = c("A1","A10"), TUBB4Q = c("A1","A11"), TUBG1 = c("A1","A12"), TUBA3 = c("F1","F3")), feature.list = as.character(features[[1]]), well.names = c("POLR2A","POLR2B","POLR2G","POLR2F","POLR2H","POLR2I","POLR2J", "POLR2K","RPS10","RPS18","RPS20","RPS24","RPS12","RPS15A","RPS25", "RPS26","ACTA1","ACTA2","ACTB","ACTG1","ACTG2","SPTBN5","SPTBN1", "SPTBN2","TUBB6","TUBA1","TUBA2","TUBB1","TUBB4","TUBB4Q","TUBG1","TUBA3") )

After creating the data list, we can now load the data into R. First we will load the corresponding to the control wells. Some of the genes share similar control wells as they are located on the same plate. Hence we will first determine the names of the unique control wells.

m = length(svm_list[[5]]) #determining the number of unique control wells control.names <- NULL for (i in c(1:m)) { control.names <- c(control.names,svm_list$svm[[i]][1]) } control.names = unique(control.names)

Next we will load the control wells. To do so, we have to load the R library with the functions written for data loading. Notice that there are some code to identify and discard cells with features values yielding Inf or NaN.

Control is a list containing the data for the control wells. Control[[1]] is a matrix of size 178 x 290. It contains the values of the 290 features obtained for each of the 178 cells in well A1.

# loading the control wells Control <- list() for (i in c(1:length(control.names))) { control <- load_wells(control.names[i],svm_list) control<- control[[1]][[1]] a <- which(control == 'Inf'|control== 'NaN') if (length(a) > 0) {x <- a%%(dim(control)[1]) control <- control[-x,] } Control[[i]] <- control }

We can now load the treated wells. Treatment is the list containing the data for the treated wells. Treatment[[i]] is a matrix containing the values of the 290 features obtained for each of the cells in for treated well i.

# loading the treated wells Treatment <- list() for (i in c(1:m)) { data <- load_wells(svm_list[[5]][[[i]][2], svm_list) treatment <- data[[1]][[1]] a <- which(treatment == 'Inf'|treatment== 'NaN') if (length(a) > 0) {x <- a%%(dim(treatment)[1]) treatment <- treatment[-x,] } Treatment[[i]] <- treatment }

The library implements a support vector machine (SVM) algorithm to find
*optimal* hyperplanes separating treated and untreated populations of
cells in multi-dimensional feature space.

The d-profile, which is the normal vector of the hyperplane, indicates
the *phenotypic direction* of greatest separation between the treated
and untreated populations.

To obtain robust estimation of the d-profile, we need

- Parameter tuning: Determine the best parameter for the analysis.
Multiple trials and cross-validation approach in each trial.

- In each trial, k-fold cross-validation is performed. The original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The train accuracy is obtained as the average degree of separation between the treated and untreated set obtained from the training data. The test accuracy is that obtained from the validation data. Multiple trials is run to average out effects of different partitioning of the data.

Features selection: an algorithm to iteratively remove sets of features without a loss of classification accuracy and determine the most informative features that differentiates the treated from the untreated.

- In the first run, all the features are used and the best separating hyperplane and classification accuracy is calculated. All the features are ranked according to their contribution to the hyperplane. The set of features that contributes least to the classification will be removed. The number of features to remove at each iteration (rfe.step.size) is set by the user. In the next run, the algorithm repeats on the remaining set of features. This process runs until a certain number of features (rfe.end.size) set by the user is left.

To compute the d-profiles, we run the code below:

#start the svm analysis source('./lib/lib_classify.R') control.index <- list() treatment.index <- list() svm_results <- list() for (i in c(1:m)){ control.name <- svm_list[[5]][[i]][[1]] k <- which(control.names == control.name) control <- Control[[k]] treatment <- Treatment[[i]][ row.min <- min(dim(control)[1],dim(treatment)[1]) control.index[[i]][ <- sample(dim(control)[1],row.min,replace=FALSE) control <- control[control.index[[i]],] treatment.index[[i]] <- sample(dim(treatment)[1],row.min,replace=FALSE) treatment <- treatment[treatment.index[[i]],] all.data <- rbind(control,treatment) label1= rep(1, row.min) label2= rep(-1, row.min) all.label = as.factor(c(label1,label2)) a <- dim(all.data)[1] tune.para <- list(cost.list = c(0.01,0.1,1,10,100), cross = 0,gamma.list = 1, is.verbose = TRUE, select = "max", type = "bar.mean",type = "bar.mean", rfe.end.size = 1, rfe.step.size = 1) #perform classification test <- perform_cv(all.data = all.data, all.label = all.label, class.weights = NULL, kernel.type = "linearR", fs.method = "svmrfe", trial.number = 3, fold.number = 3, tune.para = tune.para) svm_results[[i]]<- test }

In the code,

- We load the cXprofiling library.
- We compare the control and treated cells to determine which set has a lower number of cells. This number is the row.min.
- We will sample row.min number of cells from both the control and treatment to set up the data for svm analysis.
Next we set up the parameters for SVM. The parameters are in tune.para.

- cost.list: parameter for svm tuning
- cross: set cross = 0
- gamma.list: parameter for svm tuning. gamma.list = 1 if using linear kernel.
- is.verbose: TRUE
- select: performance measurement tie breaker "min" or "max"
- type: performance measure type ("acc.mean", "bar.mean", "acc.quartile", "bar.quartile")
- rfe.end.size: The feature selection algorithm terminates when the number of remaining features = rfe.end.size .
- rfe.step.size: The number of features to remove each time during feature selection. If the rfe.step.size = 2, the 2 feature that contributes the least to the hyperplane will be removed.

perform_cv is used for the svm analysis.

- all.data: data of the treated and untreated cells
- all.label: label (1 and -1 for treated and untreated cells)
- class.weights: NULL
- kernel.type: "linearR": for linear SVM.
- fs.method: "svmrfe": perform SVM feature selection. "none" : no feature selection. Only perform parameter tuning. "apply": no feature selection and parameter tuning.
- trial.number: number of trials to run
- fold.number: The number of folds to use for cross-validation
- tune.para: the svm parameter specified earlier.

The output of perform_cv is a list containing trial.number sublist. Each sublist contains the result for one trial. For example test[[1]] is the results for the 1st trial. There are three entries (train, test and weight)

- results[[n]]$train is a 3 x 2 matrix. In each row, the first and the second entries are the classification accuracy for the untreated and treated cells used in training respectively for the kth fold.
- results[[n]]$test is a 3 x 2 matrix. In each row, the first and the second entries are the classification accuracy for the untreated and treated cells used in validation respectively for the kth fold.
- results[[n]]$weights is a 3 x 290 matrix, that contains the weights for each feature in each fold.

Combining the results from 3 trials and 3 folds.

# combining the results from the different folds and trials. n =length(features[[1]]) weights.abs = mat.or.vec(m,n) train.accuracy = test.accuracy = train.accuracy.sd = test.accuracy.sd = mat.or.vec(1,m) for (i in c(1:m)){ weight.tmp<- train.acc.tmp <- test.acc.tmp<- NULL for (ii in 1:3){ a <- colMeans(svm_results[[i]][[ii]]$weight) a <- a/sum(abs(a)) weight.tmp <- rbind(weight.tmp,a) train.acc.tmp <- c(train.acc.tmp,mean(svm_results[[i]][[ii]]$train)) test.acc.tmp <- c(test.acc.tmp,mean(svm_results[[i]][[ii]]$test)) } a <- colMeans(weight.tmp) a <- a/sum(abs(a)) weights.abs[i,] = a train.accuracy[i] = mean(train.acc.tmp) test.accuracy[i] = mean(test.acc.tmp) train.accuracy.sd[i] = sd(train.acc.tmp) test.accuracy.sd[i] = sd(test.acc.tmp) } weights.abs.zscore <- weights_norm(weights.abs,"Z-score") rownames(weights.abs.zscore) <- svm_list$well.names colnames(weights.abs.zscore) <- svm_list$feature.list

To visualize the profiles dissimilarity among the 4 group of genes, we generate an interactive 3D MDS plot using libraries, MASS8 (Support Functions and Datasets for Venables and Ripley’s MASS) and rgl (3D visualization device system (OpenGL)), in R. MASS and rgl can be downloaded from http://cran.r-project.org/web/packages/MASS/index.html and http://cran.r-project.org/web/packages/rgl/index.html respectively.

First, we use the MASS library to determine distance among the points when projected into 3d space.

library(MASS) Color = c("blue","orange","red","green") Class = c("RNA synthesis","Ribosomal","Actin","Tubulin") Num = c(8,8,8,8) color = rep(Color, Num) EMBL_dist <- cos_dissim(weights.abs.zscore) EMBL_mds <- isoMDS(EMBL_dist, k = 3)

Next, we use the rgl package to visualize the points in an interactive window and also take a snapshot of the plot.

#3d visulization library(rgl) ## Some configuration parameters: fig.width <- 1000 fig.height <- 1000 def.font.size <- 1.5 label.font.size <- 2 tick.mark <- c(-0.6, 0, 0.6) grid.mark <- c(-0.6, -0.3, 0, 0.3, 0.6) grid.lwd <- 3 ## Create the window open3d() par3d(windowRect=c(0,0, fig.width, fig.height), cex=def.font.size) ## Clear the lights rgl.clear(type="light") ## Plot the points plot3d(EMBL_mds$points[,1],EMBL_mds$points[,2],EMBL_mds$point[,3],type="p", lwd = 2, col = color, size = 28, xlim = c(-0.7, 0.7),ylim = c(-0.7, 0.7),zlim = c(-0.7, 0.7), xlab="", ylab="", zlab="", box=FALSE, axes=FALSE) ## Plot outlines points3d(EMBL_mds$points[,1],EMBL_mds$points[,2],EMBL_mds$points[,3],col= 'black', size = 36) ## Set the viewpoint par3d(userMatrix=matrix(c(0.97795659, 0.207743421, 0.02102859, 0, -0.02294367, 0.006813017, 0.99971282, 0, 0.20754056, -0.978158832, 0.01142918, 0, 0, 0, 0, 1), nrow=4, ncol=4, byrow=TRUE)) par3d(FOV=72) ## Add the grid grid3d(side = c('x','y+','z'), at=grid.mark, lwd=grid.lwd) ## Add the axes axes3d("bbox", marklen=0.075, marklen.rel=FALSE, xat=tick.mark, yat=tick.mark, zat=tick.mark, xunit=0, yunit=0, zunit=0, lwd=grid.lwd, col="black") ## Add the labels mtext3d('Dim 1', edge = 'x', cex=label.font.size, line=0.5) mtext3d('Dim 2', edge = 'y-+', cex=label.font.size, pos=c(-0.7,0.3,0.9)) mtext3d('Dim 3', edge = 'z', cex=label.font.size, line=1) ## Create a snapshot rgl.snapshot(filename="./analysis/d_profiles/dprofile_mds_3d.png")

As observed from the MDS plot, genes from the same group lie closer together in the MDS plot.

Next, we determine the profiles using raw features and PCA.

# using arithmetic mean of raw features as profiles raw <- mat.or.vec(m,n) for (i in c(1:m)){ raw[i,] <- apply(Treatment[[i]],2,mean)} raw_norm <- weights_norm(raw,"Z-score") rownames(raw_norm) <- svm_list$well.names colnames(raw_norm) <- svm_list$feature.list raw_norm <- data.frame(raw_norm) # using PCA as profiles EMBL_PCA <- prcomp(raw_norm) sum_eig <- sum(EMBL_PCA[[1]]^2) eig <- EMBL_PCA[[1]]^2/sum_eig # using the number of principle components needed to explain 95% of the variation eig_cum = c() eig_cum[1] = eig[1] for (i in c(2:length(eig))){ eig_cum[i] = eig_cum[i-1]+eig[i]} k = min(which(eig_cum > 0.95)) PCA <- EMBL_PCA$x[,c(1:k)] PCA.abs <- weights_norm(PCA,'Z-score')

To compare the performance of d-profile, raw features and PCA, we determine the intra-group and inter-group distances for the 4 group of genes.

To calculate intra-group distance, we determined the pair-wise dissimilarity for all the genes in each of the four group. We measured the intra-cluster distance as the mean of the maximum pair-wise distances for each of the four groups.

Weights <- list() Weights[[1]] <- weights.abs.zscore Weights[[2]] <- PCA.abs Weights[[3]] <- raw_norm # determine intra-group distance intra.dist <- list() intra.mean <- mat.or.vec(1,length(Weights)) intra.mean.sd <- mat.or.vec(1,length(Weights)) for (i in 1:length(Weights)){ intra.dist[[i]] <- intra_dist(Weights[[i]],Num, method_intra = 'max') intra.mean[i] <- mean(intra.dist[[i]]) intra.mean.sd[i]<- apply(intra.dist[[i]],1,sd) } colnames(intra.mean)= c('d-profile','PCA','raw') colnames(intra.sd)= c('d-profile','PCA','raw')

We obtained the following results which shows that d-profiles yields the smallest intra-group distance.

> intra.mean d-profile PCA raw [1,] 1.100821 1.428272 1.329314

To calculate inter-group distance, we determined all pair-wise dissimilarity between genes from two different groups. These distances were then sorted from the lowest to the highest. For the n nearest neighbour analysis, the mean of the n lowest dissimilarities is the inter-group distance for the two group of genes. This analysis is repeated for all the different possible pair of groups and the mean is the inter-group distance.

N = c(5,10,30) inter.dist <- list() inter.mean <- mat.or.vec(3,length(Weights)) inter.mean.sd <- mat.or.vec(3,length(Weights)) for (j in c(1:3)){ inter.dist[[j]]<- list() for (i in 1:length(Weights)){ inter.dist[[j]][[i]] <- as.vector(inter_dist(Weights[[i]],Num,N = N[j])) inter.mean[j,i] <- mean(inter.dist[[j]][[i]]) inter.mean.sd[j,i] <- sd(inter.dist[[j]][[i]]) }} colnames(inter.mean)= c('d-profile','PCA','raw') rownames(inter.mean)= N

We obtained the following results which shows that d-profiles yields the largest inter-group distance for the different values of N.

> inter.mean d-profile PCA raw 5 0.9310926 0.6310284 0.5979230 10 0.9547827 0.7110320 0.6703705 30 0.9972681 0.8781165 0.8718356

To determine if the intra-group and inter-group distance obtained in d-profile is statistically smaller and larger respectively than that of raw features and PCA, we performed a two-sided paired t-test.

For the intra-group distance, we performed the t-test as follows:

# determine the p-value for intra-group distance x = c(1,1,2) y = c(2,3,3) pval_intra = mat.or.vec(1,3) for (i in c(1:3)){ pval_intra[i] = t.test(x = intra.dist[[x[i]]], y = intra.dist[[y[i]]], alternative = 'two.sided', paired = TRUE)$p.value } colnames(pval_intra) = c('d-profile vs PCA','d-profile vs raw','PCA vs raw')

We obtained the following results showing that the intra-group distance of d-profile is significantly smaller than that of PCA and raw features at 95% confidence level.

> pval_intra d-profile vs PCA d-profile vs raw PCA vs raw [1,] 0.006486328 0.03190998 0.1606494

For the inter-group distance, we performed the t-test as follows:

pval_inter = mat.or.vec(1,3) for (i in c(1:3)){ pval_inter[i] = t.test(x = inter.dist[[1]][[x[i]]], y = inter.dist[[1]][[y[i]]], alternative = 'two.sided', paired = TRUE)$p.value } colnames(pval_inter) = c('d-profile vs PCA','d-profile vs raw','PCA vs raw') rownames(pval_inter) = 5

We obtained the following results showing that the inter-group distance of d-profile is significantly larger than that of PCA and raw features at 95% confidence level.

> pval_inter d-profile vs PCA d-profile vs raw PCA vs raw 5 0.001219603 0.002793763 0.1820621

In the above example, we are calculating the d-profile for a couple of genes. To compute d-profiles for hundreds or thousands ofgenes, it would be better to save the d-profiles in individual files and load and combine the results afterwards. To do so, we could use the get_dprofile function in the cXprofiling.R library.

To compute the d-profiles for a large set of untreated and treated cells

- load data
- Arrange data and label
- Set parameters
- Run the svm analysis using multiple cores. The results for each svm analysis will be same in a file.
- Loading all the files and retrieve the results.

All.data <- NULL cell.length <- NULL for (i in 1:length(data)){ All.data <- rbind(All.data,data[[i]][[[1]]]) cell.length <- c(cell.length,dim(data[[i]][[[1]]])[1]) } All.label = rep(query.label, cell.length) #Set parameters dprofile.options <- list( temp.dir = temp.dir, is.save = TRUE, is.cv = TRUE, kernel.type = "linearR", fs.method = "svmrfe", tune.para = NULL, cv.trial.number = 3, cv.fold.number = 3) dprofile.options$tune.para <- list(cross = 0, type = "bar.mean", select = "max", cost.list = c(0.1,1,10,100), gamma.list = 1, is.verbose = TRUE, rfe.step.size = 1, rfe.end.size = 1) # Run the svm analysis using multiple cores. The results will be save as individual files # in the temporary folder. core.number = 4 X = All.data X.label = All.label job.id = 1 dprofile.res <- mclapply(1:length(query.label), get_dprofile, job.id, X, X.label, query.label, control.cell.idx= c(1:178), dprofile.options, mc.cores=core.number) #To retrieve the files and combile the data. combine_dprofile_result(job.id = 1, query.label = query.label, temp.dir = temp.dir, is.cv = TRUE, is.remove.file = TRUE)