MClust Tumor Clonality Analysis

Introduction

Cancer is an evolutionary process characterized by continuous acquisition of somatic mutations, leading to intra-tumor heterogeneity where multiple genetically distinct subpopulations of cancer cells (clones and subclones) coexist within a tumor. Understanding the clonal architecture and evolutionary dynamics of tumors is crucial for predicting treatment response, disease progression, and metastatic potential.

In this analysis, we employ mClust, a model-based clustering approach using Gaussian finite mixture modeling, to investigate tumor clonality patterns based on variant allele frequency (VAF) data. VAF represents the proportion of sequencing reads supporting a variant allele at a given genomic position and serves as a proxy for the prevalence of mutations within the tumor cell population.

Objectives

This assignment consists of two main analyses:

  1. Single Sample Analysis (LU-4 tumor sample):
    • Examine the VAF distribution of somatic mutations
    • Identify distinct clonal clusters using mClust
    • Determine cancer cell fractions (CCF) for major clones and subclones
    • Reconstruct the tumor evolutionary history
  2. Paired Sample Analysis (SP4 patient - LN1 vs LN2 lymph nodes):
    • Compare VAF patterns of shared mutations between two metastatic sites
    • Identify common and site-specific clonal populations
    • Determine clonal relationships and evolutionary trajectories across tumor sites
    • Infer the evolutionary relationship between lymph node metastases

By clustering mutations based on their VAF values, we can infer which mutations arose early (present in most/all cancer cells - major/clonal) versus later (present in subsets of cells - subclonal) during tumor evolution, ultimately revealing the clonal composition and evolutionary history of these tumor samples.

# Loading mclust library
library(mclust)
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
# Reading in the LU-4.snv.txt file from LAQ4_data folder
data1 <- read.delim("/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/LU-4.snv.txt", header=T)

# Checking dimensions and preview the data
dim(data1)
## [1] 4207   12
head(data1)
# Calculating VAF (variant allele frequency)
# VAF = alternative allele count / (alternative allele count + reference allele count)
vaf1 <- data1$t_alt_count / (data1$t_alt_count + data1$t_ref_count)

# Adding VAF column to the dataframe
data1$vaf <- vaf1

# Previewing the data with VAF column
head(data1)

4207 mutations (rows) 13 columns (12 original + 1 VAF column added) VAF values are calculated and range from ~0.12 to ~0.30

#confirming the presence of NaN or 0 values in the data
sum(is.na(data1$vaf) | is.nan(data1$vaf) | is.infinite(data1$vaf))
## [1] 0

This confirms that neither there is 0 or NaN in this table and we are good to proceed.

cat("Length:", length(data1$vaf), "\n")
## Length: 4207
cat("Range:", range(data1$vaf), "\n")
## Range: 0.01686747 0.6896552

4207 mutations → good sample size for mixture modeling. Minimum VAF ≈ 0.017 : low-frequency variants present. Maximum VAF ≈ 0.69 : biologically plausible. Under simple diploid heterozygosity, VAF should not exceed ~0.5. VAF values substantially above 0.5 indicate either loss of the wild-type allele or amplification of the mutant allele, which are common genomic events in tumour evolution.

# Question 1: VAF Distribution Histogram (2 marks)
# Generate histogram of VAF distribution
hist(data1$vaf, 
     main = "Distribution of Variant Allele Frequencies in LU-4 Tumor Sample",
     xlab = "Variant Allele Frequency (VAF)",
     ylab = "Number of Mutations",
     col = "lightblue",
     breaks = 50)

The VAF distribution is multimodal, with: a low-VAF peak around ~0.05–0.1 a broader peak around ~0.20–0.30 a right-hand tail with local density increases around ~0.35–0.4, extending up to ~0.6

### Question 2: Cluster Identification and Characterization (3 marks)

#### Step 1: Density Estimation via Gaussian Finite Mixture Modeling

# Run density mclust analysis on VAF values
# Density estimation via Gaussian finite mixture modeling
library(mclust)
mod1 <- densityMclust(data1$vaf)

summary(mod1)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 4 components: 
## 
##  log-likelihood    n df     BIC      ICL
##         3816.08 4207 11 7540.37 6366.892
cat("Length:", length(data1$vaf), "\n")
## Length: 4207
cat("Range:", range(data1$vaf), "\n")
## Range: 0.01686747 0.6896552
## 2. Density estimation via Gaussian finite mixture modeling
mod1 <- densityMclust(data1$vaf)

summary(mod1)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 4 components: 
## 
##  log-likelihood    n df      BIC      ICL
##        3816.097 4207 11 7540.405 6367.385
## 3. Model selection – BIC plot
plot(mod1, what = "BIC", main = "BIC for Model Selection")

## 4. Density estimation plot
plot(mod1, what = "density", data = data1$vaf,
     main = "Density Estimation of VAF Distribution")

## 5. Diagnostic plots – CDF
plot(mod1, what = "diagnostic", type = "cdf")
title("Cumulative Distribution Function")

## 6. Diagnostic plots – Q-Q plot
plot(mod1, what = "diagnostic", type = "qq")
title("Q-Q Plot")

## 7. Dimension reduction for clustering and classification
mod1dr <- MclustDR(mod1)
summary(mod1dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (V, 4) 
##         
## Clusters    n
##        1  619
##        2  600
##        3 2823
##        4  165
## 
## Estimated basis vectors: 
##           Dir1
## data1$vaf    1
## 
##                  Dir1
## Eigenvalues   0.88279
## Cum. %      100.00000
## 8. Density plot with finer breaks
plot(mod1, what = "density", data = data1$vaf, breaks = 50,
     main = "VAF Density with Identified Clusters")

## 9. Cluster classification plot
plot(mod1dr, what = "classification",
     main = "Cluster Classification")

## 10. Dimension reduction density
plot(mod1dr, what = "density",
     main = "Dimension Reduction Density")

## 11. Eigenvalues plot
plot(mod1dr, what = "evalues",
     main = "Eigenvalues")

identical(mod1$data, data1$vaf)
## [1] FALSE
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
length(mod1$data)
## [1] 4207
length(data1$vaf)
## [1] 4207
range(as.numeric(mod1$data))
## [1] 0.01686747 0.68965517
range(as.numeric(data1$vaf))
## [1] 0.01686747 0.68965517
str(mod1$data)
##  num [1:4207, 1] 0.238 0.298 0.119 0.221 0.207 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "data1$vaf"
length(mod1$data); length(data1$vaf)
## [1] 4207
## [1] 4207
range(as.numeric(mod1$data)); range(data1$vaf)
## [1] 0.01686747 0.68965517
## [1] 0.01686747 0.68965517
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
identical(as.numeric(mod1$data), data1$vaf)
## [1] TRUE
mod1dr <- MclustDR(mod1)
summary(mod1dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (V, 4) 
##         
## Clusters    n
##        1  619
##        2  600
##        3 2823
##        4  165
## 
## Estimated basis vectors: 
##           Dir1
## data1$vaf    1
## 
##                  Dir1
## Eigenvalues   0.88279
## Cum. %      100.00000
all(mod1dr$classification == mod1$classification)
## [1] TRUE
# Mean VAF per cluster
mean_vaf <- aggregate(data1$vaf, 
                      by = list(Cluster = mod1$classification), 
                      FUN = mean)

# Rename the column
colnames(mean_vaf)[2] <- "Mean VAF"

mean_vaf
mod1 <- densityMclust(data1$vaf)

# Check model used correct VAF
isTRUE(all.equal(as.numeric(mod1$data), as.numeric(data1$vaf)))
## [1] TRUE
# Check lengths match
length(mod1$classification) == nrow(data1)
## [1] TRUE
# Check DR classification matches model classification
mod1dr <- MclustDR(mod1)
all(mod1dr$classification == mod1$classification)
## [1] TRUE
# Add classification
data1$classification <- mod1dr$classification

# Empirical mean VAF per cluster
mu_emp <- tapply(data1$vaf, data1$classification, mean)

cat("\n=== Mean VAF for Each Cluster (Empirical) ===\n")
## 
## === Mean VAF for Each Cluster (Empirical) ===
print(mu_emp)
##          1          2          3          4 
## 0.05656809 0.09329739 0.24892659 0.50731950
cat("\n=== Number of Mutations in Each Cluster ===\n")
## 
## === Number of Mutations in Each Cluster ===
print(table(data1$classification))
## 
##    1    2    3    4 
##  624  595 2823  165
colnames(data1)
##  [1] "X.CHROM"        "POS"            "ID"             "REF"           
##  [5] "ALT"            "QUAL"           "FILTER"         "INFO"          
##  [9] "t_alt_count"    "t_ref_count"    "X"              "X.1"           
## [13] "vaf"            "classification"

There is no column for tumour purity, copy number, ploidy or adjusted VAF, so we cannot compute adjusted VAF we have to use raw VAF for computing CCF.

z <- mod1$z                       # posterior probabilities (n x G)
x <- as.numeric(data1$vaf)

mu_soft <- colSums(z * x) / colSums(z)
mu_soft
## [1] 0.05826626 0.08933283 0.24210885 0.44723138
mod1$parameters$mean
##          1          2          3          4 
## 0.05826626 0.08933283 0.24210885 0.44723138
isTRUE(all.equal(as.numeric(mu_soft), as.numeric(mod1$parameters$mean)))
## [1] TRUE
#### Step 6: Validate and Characterize Each Cluster

# Get the number of clusters identified
n_clusters <- length(unique(data1$classification))
cat("Number of clusters identified:", n_clusters, "\n\n")
## Number of clusters identified: 4
# Create a list to store cluster data
cluster_list <- list()

# Analyze each cluster
for (i in 1:n_clusters) {
  cluster_data <- data1[which(data1$classification == i),]
  cluster_list[[i]] <- cluster_data
  
  cat("=== Cluster", i, "===\n")
  cat("Number of mutations:", nrow(cluster_data), "\n")
  cat("Mean VAF:", round(mean(cluster_data$vaf), 4), "\n")
  cat("Median VAF:", round(median(cluster_data$vaf), 4), "\n")
  cat("SD VAF:", round(sd(cluster_data$vaf), 4), "\n")
  cat("VAF Range:", round(min(cluster_data$vaf), 4), "-", 
      round(max(cluster_data$vaf), 4), "\n\n")
}
## === Cluster 1 ===
## Number of mutations: 624 
## Mean VAF: 0.0566 
## Median VAF: 0.0565 
## SD VAF: 0.0101 
## VAF Range: 0.0252 - 0.0732 
## 
## === Cluster 2 ===
## Number of mutations: 595 
## Mean VAF: 0.0933 
## Median VAF: 0.0909 
## SD VAF: 0.0134 
## VAF Range: 0.0734 - 0.12 
## 
## === Cluster 3 ===
## Number of mutations: 2823 
## Mean VAF: 0.2489 
## Median VAF: 0.2482 
## SD VAF: 0.0684 
## VAF Range: 0.0169 - 0.4206 
## 
## === Cluster 4 ===
## Number of mutations: 165 
## Mean VAF: 0.5073 
## Median VAF: 0.4876 
## SD VAF: 0.0721 
## VAF Range: 0.4211 - 0.6897
# Plot histograms for each cluster
par(mfrow=c(ceiling(n_clusters/2), 2))  # Arrange plots in grid
for (i in 1:n_clusters) {
  hist(cluster_list[[i]]$vaf, 
       main=paste("Cluster", i, "VAF Distribution"),
       xlab="VAF", 
       col=rainbow(n_clusters)[i],
       breaks=20)
}

par(mfrow=c(1,1))  # Reset plot layout

Phylogeentic tree created manually on PPT and pasted in the report.

Paired analysis, Part 2

# Load paired lymph node metastases data

library(mclust)

# Read the shared mutations file
data2 <- read.table("/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/mutations_SP4_LN1_LN2_together.txt", 
                    header = TRUE, sep = "\t")

# Check dimensions and structure
dim(data2)
## [1] 112   8
head(data2)
colnames(data2)
## [1] "chr"          "pos"          "refCount_LN1" "varCount_LN1" "VAF_LN1"     
## [6] "refCount_LN2" "varCount_LN2" "VAF_LN2"
library(mclust)

# 1) Load data
data2 <- read.table(
  "/Users/dr.parul/Downloads/data_required_for_CAN7032-7132/LAQ4_data/mutations_SP4_LN1_LN2_together.txt",
  header = TRUE, sep = "\t"
)

# 2) Build the 2D VAF matrix for paired-sample clustering (LN1 vs LN2)
data2_vaf <- data.frame(data2$VAF_LN1, data2$VAF_LN2)
colnames(data2_vaf) <- c("VAF_LN1", "VAF_LN2")  # optional, but cleaner

# 3) Scatter plot of shared mutations (LN1 vs LN2)
plot(data2$VAF_LN1, data2$VAF_LN2,
     xlab = "VAF LN1 (%)", ylab = "VAF LN2 (%)",
     main = "Shared mutations: LN1 vs LN2 VAF")
abline(0, 1, lty = 2)  # reference line

# 4) BIC for model selection
BIC <- mclustBIC(data2_vaf)
plot(BIC)

summary(BIC)
## Best BIC values:
##              VVV,4        VVV,3        VVV,5
## BIC      -1743.766 -1746.335837 -1753.259975
## BIC diff     0.000    -2.570214    -9.494352
# 5) Fit best model using BIC
mod2 <- Mclust(data2_vaf, x = BIC)
summary(mod2, parameters = TRUE)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 4
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -817.6201 112 23 -1743.766 -1768.431
## 
## Clustering table:
##  1  2  3  4 
## 44 37 24  7 
## 
## Mixing probabilities:
##         1         2         3         4 
## 0.3368298 0.3928912 0.2082271 0.0620519 
## 
## Means:
##             [,1]     [,2]      [,3]     [,4]
## VAF_LN1 40.42042 23.17434 8.3740644 80.87664
## VAF_LN2 30.18589 25.14641 0.1611822 45.91522
## 
## Variances:
## [,,1]
##          VAF_LN1  VAF_LN2
## VAF_LN1 38.93879  9.34715
## VAF_LN2  9.34715 16.79622
## [,,2]
##          VAF_LN1  VAF_LN2
## VAF_LN1 352.9449 203.9659
## VAF_LN2 203.9659 225.3257
## [,,3]
##            VAF_LN1    VAF_LN2
## VAF_LN1 22.5324718 -0.6170872
## VAF_LN2 -0.6170872  0.1516705
## [,,4]
##          VAF_LN1   VAF_LN2
## VAF_LN1 24.14895  94.85224
## VAF_LN2 94.85224 404.88276
# 6) Dimension reduction clustering & classification
mod2dr <- MclustDR(mod2)
summary(mod2dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (VVV, 4) 
##         
## Clusters  n
##        1 44
##        2 37
##        3 24
##        4  7
## 
## Estimated basis vectors: 
##             Dir1     Dir2
## VAF_LN1 -0.97864 -0.54255
## VAF_LN2 -0.20559  0.84002
## 
##                 Dir1      Dir2
## Eigenvalues  0.88912   0.25841
## Cum. %      77.48129 100.00000
# 7) Classification scatter plot (colored clusters)
plot(mod2, what = "classification",
     xlab = "VAF LN1 (%)", ylab = "VAF LN2 (%)",
     main = "mclust classification: LN1 vs LN2")

# 8) Cluster membership and mapping
head(mod2dr$classification)
## [1] 1 1 1 1 2 1
## Levels: 1 2 3 4
mod2dr$class2mixcomp
## [1] 1 2 3 4
# 9) Cluster means (model-based) for LN1 and LN2
mod2dr$mu
##          [,1]     [,2]      [,3]     [,4]
## [1,] 40.42042 23.17434 8.3740644 80.87664
## [2,] 30.18589 25.14641 0.1611822 45.91522
# 10) Save classification back to dataframe
data2$classification <- mod2dr$classification
summary_table <- data.frame(
  N_shared = nrow(data2),
  Mean_VAF_LN1 = mean(data2$VAF_LN1),
  Mean_VAF_LN2 = mean(data2$VAF_LN2),
  Median_VAF_LN1 = median(data2$VAF_LN1),
  Median_VAF_LN2 = median(data2$VAF_LN2),
  Correlation = cor(data2$VAF_LN1, data2$VAF_LN2)
)

summary_table
colnames(data2)
## [1] "chr"            "pos"            "refCount_LN1"   "varCount_LN1"  
## [5] "VAF_LN1"        "refCount_LN2"   "varCount_LN2"   "VAF_LN2"       
## [9] "classification"
G <- mod2$G
G
## [1] 4
cluster_sizes <- table(data2$classification)
cluster_sizes
## 
##  1  2  3  4 
## 44 37 24  7
mod2dr$mu
##          [,1]     [,2]      [,3]     [,4]
## [1,] 40.42042 23.17434 8.3740644 80.87664
## [2,] 30.18589 25.14641 0.1611822 45.91522

This information is utilized to calculate CCF and creating the phylogenetic