Cancer is an evolutionary process characterized by continuous acquisition of somatic mutations, leading to intra-tumor heterogeneity where multiple genetically distinct subpopulations of cancer cells (clones and subclones) coexist within a tumor. Understanding the clonal architecture and evolutionary dynamics of tumors is crucial for predicting treatment response, disease progression, and metastatic potential.
In this analysis, we employ mClust, a model-based clustering approach using Gaussian finite mixture modeling, to investigate tumor clonality patterns based on variant allele frequency (VAF) data. VAF represents the proportion of sequencing reads supporting a variant allele at a given genomic position and serves as a proxy for the prevalence of mutations within the tumor cell population.
This assignment consists of two main analyses:
By clustering mutations based on their VAF values, we can infer which mutations arose early (present in most/all cancer cells - major/clonal) versus later (present in subsets of cells - subclonal) during tumor evolution, ultimately revealing the clonal composition and evolutionary history of these tumor samples.
# Loading mclust library
library(mclust)
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
# Reading in the LU-4.snv.txt file from LAQ4_data folder
data1 <- read.delim("/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/LU-4.snv.txt", header=T)
# Checking dimensions and preview the data
dim(data1)
## [1] 4207 12
head(data1)
# Calculating VAF (variant allele frequency)
# VAF = alternative allele count / (alternative allele count + reference allele count)
vaf1 <- data1$t_alt_count / (data1$t_alt_count + data1$t_ref_count)
# Adding VAF column to the dataframe
data1$vaf <- vaf1
# Previewing the data with VAF column
head(data1)
4207 mutations (rows) 13 columns (12 original + 1 VAF column added) VAF values are calculated and range from ~0.12 to ~0.30
#confirming the presence of NaN or 0 values in the data
sum(is.na(data1$vaf) | is.nan(data1$vaf) | is.infinite(data1$vaf))
## [1] 0
This confirms that neither there is 0 or NaN in this table and we are good to proceed.
cat("Length:", length(data1$vaf), "\n")
## Length: 4207
cat("Range:", range(data1$vaf), "\n")
## Range: 0.01686747 0.6896552
4207 mutations → good sample size for mixture modeling. Minimum VAF ≈ 0.017 : low-frequency variants present. Maximum VAF ≈ 0.69 : biologically plausible. Under simple diploid heterozygosity, VAF should not exceed ~0.5. VAF values substantially above 0.5 indicate either loss of the wild-type allele or amplification of the mutant allele, which are common genomic events in tumour evolution.
# Question 1: VAF Distribution Histogram (2 marks)
# Generate histogram of VAF distribution
hist(data1$vaf,
main = "Distribution of Variant Allele Frequencies in LU-4 Tumor Sample",
xlab = "Variant Allele Frequency (VAF)",
ylab = "Number of Mutations",
col = "lightblue",
breaks = 50)
The VAF distribution is multimodal, with: a low-VAF peak around ~0.05–0.1 a broader peak around ~0.20–0.30 a right-hand tail with local density increases around ~0.35–0.4, extending up to ~0.6
### Question 2: Cluster Identification and Characterization (3 marks)
#### Step 1: Density Estimation via Gaussian Finite Mixture Modeling
# Run density mclust analysis on VAF values
# Density estimation via Gaussian finite mixture modeling
library(mclust)
mod1 <- densityMclust(data1$vaf)
summary(mod1)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 4 components:
##
## log-likelihood n df BIC ICL
## 3816.08 4207 11 7540.37 6366.892
cat("Length:", length(data1$vaf), "\n")
## Length: 4207
cat("Range:", range(data1$vaf), "\n")
## Range: 0.01686747 0.6896552
## 2. Density estimation via Gaussian finite mixture modeling
mod1 <- densityMclust(data1$vaf)
summary(mod1)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 4 components:
##
## log-likelihood n df BIC ICL
## 3816.097 4207 11 7540.405 6367.385
## 3. Model selection – BIC plot
plot(mod1, what = "BIC", main = "BIC for Model Selection")
## 4. Density estimation plot
plot(mod1, what = "density", data = data1$vaf,
main = "Density Estimation of VAF Distribution")
## 5. Diagnostic plots – CDF
plot(mod1, what = "diagnostic", type = "cdf")
title("Cumulative Distribution Function")
## 6. Diagnostic plots – Q-Q plot
plot(mod1, what = "diagnostic", type = "qq")
title("Q-Q Plot")
## 7. Dimension reduction for clustering and classification
mod1dr <- MclustDR(mod1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (V, 4)
##
## Clusters n
## 1 619
## 2 600
## 3 2823
## 4 165
##
## Estimated basis vectors:
## Dir1
## data1$vaf 1
##
## Dir1
## Eigenvalues 0.88279
## Cum. % 100.00000
## 8. Density plot with finer breaks
plot(mod1, what = "density", data = data1$vaf, breaks = 50,
main = "VAF Density with Identified Clusters")
## 9. Cluster classification plot
plot(mod1dr, what = "classification",
main = "Cluster Classification")
## 10. Dimension reduction density
plot(mod1dr, what = "density",
main = "Dimension Reduction Density")
## 11. Eigenvalues plot
plot(mod1dr, what = "evalues",
main = "Eigenvalues")
identical(mod1$data, data1$vaf)
## [1] FALSE
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
length(mod1$data)
## [1] 4207
length(data1$vaf)
## [1] 4207
range(as.numeric(mod1$data))
## [1] 0.01686747 0.68965517
range(as.numeric(data1$vaf))
## [1] 0.01686747 0.68965517
str(mod1$data)
## num [1:4207, 1] 0.238 0.298 0.119 0.221 0.207 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "data1$vaf"
length(mod1$data); length(data1$vaf)
## [1] 4207
## [1] 4207
range(as.numeric(mod1$data)); range(data1$vaf)
## [1] 0.01686747 0.68965517
## [1] 0.01686747 0.68965517
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
identical(as.numeric(mod1$data), data1$vaf)
## [1] TRUE
mod1dr <- MclustDR(mod1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (V, 4)
##
## Clusters n
## 1 619
## 2 600
## 3 2823
## 4 165
##
## Estimated basis vectors:
## Dir1
## data1$vaf 1
##
## Dir1
## Eigenvalues 0.88279
## Cum. % 100.00000
all(mod1dr$classification == mod1$classification)
## [1] TRUE
# Mean VAF per cluster
mean_vaf <- aggregate(data1$vaf,
by = list(Cluster = mod1$classification),
FUN = mean)
# Rename the column
colnames(mean_vaf)[2] <- "Mean VAF"
mean_vaf
mod1 <- densityMclust(data1$vaf)
# Check model used correct VAF
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
# Check lengths match
length(mod1$classification) == nrow(data1)
## [1] TRUE
# Check DR classification matches model classification
mod1dr <- MclustDR(mod1)
all(mod1dr$classification == mod1$classification)
## [1] TRUE
# Add classification
data1$classification <- mod1dr$classification
# Empirical mean VAF per cluster
mu_emp <- tapply(data1$vaf, data1$classification, mean)
cat("\n=== Mean VAF for Each Cluster (Empirical) ===\n")
##
## === Mean VAF for Each Cluster (Empirical) ===
print(mu_emp)
## 1 2 3 4
## 0.05656809 0.09329739 0.24892659 0.50731950
cat("\n=== Number of Mutations in Each Cluster ===\n")
##
## === Number of Mutations in Each Cluster ===
print(table(data1$classification))
##
## 1 2 3 4
## 624 595 2823 165
colnames(data1)
## [1] "X.CHROM" "POS" "ID" "REF"
## [5] "ALT" "QUAL" "FILTER" "INFO"
## [9] "t_alt_count" "t_ref_count" "X" "X.1"
## [13] "vaf" "classification"
There is no column for tumour purity, copy number, ploidy or adjusted VAF, so we cannot compute adjusted VAF we have to use raw VAF for computing CCF.
z <- mod1$z # posterior probabilities (n x G)
x <- as.numeric(data1$vaf)
mu_soft <- colSums(z * x) / colSums(z)
mu_soft
## [1] 0.05826626 0.08933283 0.24210885 0.44723138
mod1$parameters$mean
## 1 2 3 4
## 0.05826626 0.08933283 0.24210885 0.44723138
isTRUE(all.equal(as.numeric(mu_soft), as.numeric(mod1$parameters$mean)))
## [1] TRUE
#### Step 6: Validate and Characterize Each Cluster
# Get the number of clusters identified
n_clusters <- length(unique(data1$classification))
cat("Number of clusters identified:", n_clusters, "\n\n")
## Number of clusters identified: 4
# Create a list to store cluster data
cluster_list <- list()
# Analyze each cluster
for (i in 1:n_clusters) {
cluster_data <- data1[which(data1$classification == i),]
cluster_list[[i]] <- cluster_data
cat("=== Cluster", i, "===\n")
cat("Number of mutations:", nrow(cluster_data), "\n")
cat("Mean VAF:", round(mean(cluster_data$vaf), 4), "\n")
cat("Median VAF:", round(median(cluster_data$vaf), 4), "\n")
cat("SD VAF:", round(sd(cluster_data$vaf), 4), "\n")
cat("VAF Range:", round(min(cluster_data$vaf), 4), "-",
round(max(cluster_data$vaf), 4), "\n\n")
}
## === Cluster 1 ===
## Number of mutations: 624
## Mean VAF: 0.0566
## Median VAF: 0.0565
## SD VAF: 0.0101
## VAF Range: 0.0252 - 0.0732
##
## === Cluster 2 ===
## Number of mutations: 595
## Mean VAF: 0.0933
## Median VAF: 0.0909
## SD VAF: 0.0134
## VAF Range: 0.0734 - 0.12
##
## === Cluster 3 ===
## Number of mutations: 2823
## Mean VAF: 0.2489
## Median VAF: 0.2482
## SD VAF: 0.0684
## VAF Range: 0.0169 - 0.4206
##
## === Cluster 4 ===
## Number of mutations: 165
## Mean VAF: 0.5073
## Median VAF: 0.4876
## SD VAF: 0.0721
## VAF Range: 0.4211 - 0.6897
# Plot histograms for each cluster
par(mfrow=c(ceiling(n_clusters/2), 2)) # Arrange plots in grid
for (i in 1:n_clusters) {
hist(cluster_list[[i]]$vaf,
main=paste("Cluster", i, "VAF Distribution"),
xlab="VAF",
col=rainbow(n_clusters)[i],
breaks=20)
}
par(mfrow=c(1,1)) # Reset plot layout
Phylogeentic tree created manually on PPT and pasted in the report.
Paired analysis, Part 2
# Load paired lymph node metastases data
library(mclust)
# Read the shared mutations file
data2 <- read.table("/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/mutations_SP4_LN1_LN2_together.txt",
header = TRUE, sep = "\t")
# Check dimensions and structure
dim(data2)
## [1] 112 8
head(data2)
colnames(data2)
## [1] "chr" "pos" "refCount_LN1" "varCount_LN1" "VAF_LN1"
## [6] "refCount_LN2" "varCount_LN2" "VAF_LN2"
library(mclust)
# 1) Load data
data2 <- read.table(
"/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/mutations_SP4_LN1_LN2_together.txt",
header = TRUE, sep = "\t"
)
# 2) Build the 2D VAF matrix for paired-sample clustering (LN1 vs LN2)
data2_vaf <- data.frame(data2$VAF_LN1, data2$VAF_LN2)
colnames(data2_vaf) <- c("VAF_LN1", "VAF_LN2") # optional, but cleaner
# 3) Scatter plot of shared mutations (LN1 vs LN2)
plot(data2$VAF_LN1, data2$VAF_LN2,
xlab = "VAF LN1 (%)", ylab = "VAF LN2 (%)",
main = "Shared mutations: LN1 vs LN2 VAF")
abline(0, 1, lty = 2) # reference line
# 4) BIC for model selection
BIC <- mclustBIC(data2_vaf)
plot(BIC)
summary(BIC)
## Best BIC values:
## VVV,4 VVV,3 VVV,5
## BIC -1743.766 -1746.335837 -1753.259975
## BIC diff 0.000 -2.570214 -9.494352
# 5) Fit best model using BIC
mod2 <- Mclust(data2_vaf, x = BIC)
summary(mod2, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 4
## components:
##
## log-likelihood n df BIC ICL
## -817.6201 112 23 -1743.766 -1768.431
##
## Clustering table:
## 1 2 3 4
## 44 37 24 7
##
## Mixing probabilities:
## 1 2 3 4
## 0.3368298 0.3928912 0.2082271 0.0620519
##
## Means:
## [,1] [,2] [,3] [,4]
## VAF_LN1 40.42042 23.17434 8.3740644 80.87664
## VAF_LN2 30.18589 25.14641 0.1611822 45.91522
##
## Variances:
## [,,1]
## VAF_LN1 VAF_LN2
## VAF_LN1 38.93879 9.34715
## VAF_LN2 9.34715 16.79622
## [,,2]
## VAF_LN1 VAF_LN2
## VAF_LN1 352.9449 203.9659
## VAF_LN2 203.9659 225.3257
## [,,3]
## VAF_LN1 VAF_LN2
## VAF_LN1 22.5324718 -0.6170872
## VAF_LN2 -0.6170872 0.1516705
## [,,4]
## VAF_LN1 VAF_LN2
## VAF_LN1 24.14895 94.85224
## VAF_LN2 94.85224 404.88276
# 6) Dimension reduction clustering & classification
mod2dr <- MclustDR(mod2)
summary(mod2dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (VVV, 4)
##
## Clusters n
## 1 44
## 2 37
## 3 24
## 4 7
##
## Estimated basis vectors:
## Dir1 Dir2
## VAF_LN1 -0.97864 -0.54255
## VAF_LN2 -0.20559 0.84002
##
## Dir1 Dir2
## Eigenvalues 0.88912 0.25841
## Cum. % 77.48129 100.00000
# 7) Classification scatter plot (colored clusters)
plot(mod2, what = "classification",
xlab = "VAF LN1 (%)", ylab = "VAF LN2 (%)",
main = "mclust classification: LN1 vs LN2")
# 8) Cluster membership and mapping
head(mod2dr$classification)
## [1] 1 1 1 1 2 1
## Levels: 1 2 3 4
mod2dr$class2mixcomp
## [1] 1 2 3 4
# 9) Cluster means (model-based) for LN1 and LN2
mod2dr$mu
## [,1] [,2] [,3] [,4]
## [1,] 40.42042 23.17434 8.3740644 80.87664
## [2,] 30.18589 25.14641 0.1611822 45.91522
# 10) Save classification back to dataframe
data2$classification <- mod2dr$classification
summary_table <- data.frame(
N_shared = nrow(data2),
Mean_VAF_LN1 = mean(data2$VAF_LN1),
Mean_VAF_LN2 = mean(data2$VAF_LN2),
Median_VAF_LN1 = median(data2$VAF_LN1),
Median_VAF_LN2 = median(data2$VAF_LN2),
Correlation = cor(data2$VAF_LN1, data2$VAF_LN2)
)
summary_table
colnames(data2)
## [1] "chr" "pos" "refCount_LN1" "varCount_LN1"
## [5] "VAF_LN1" "refCount_LN2" "varCount_LN2" "VAF_LN2"
## [9] "classification"
G <- mod2$G
G
## [1] 4
cluster_sizes <- table(data2$classification)
cluster_sizes
##
## 1 2 3 4
## 44 37 24 7
mod2dr$mu
## [,1] [,2] [,3] [,4]
## [1,] 40.42042 23.17434 8.3740644 80.87664
## [2,] 30.18589 25.14641 0.1611822 45.91522
This information is utilized to calculate CCF and creating the phylogenetic