--- title: "TDAvec vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{TDAvec vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Persistence diagrams are a fundamental tool in topological data analysis (TDA) that summarize the multi-scale topological features (such as connected components, loops, and voids) of a dataset. They represent these features as points in a 2D plane, where each point corresponds to the birth and death of a feature across different scales. Since persistence diagrams are multisets of points in a 2D plane, their unordered structure makes them challenging to use directly in machine learning models. By vectorizing them (i.e., transforming into fixed-length vectors), conventional statistical and machine learning techniques can be applied while preserving topological information. The `TDAvec` R package is specifically designed for this task, offering 13 vectorization methods commonly used in TDA. These methods are divided into three broad categories: - Functional vector summaries - based on summary functions: - Betti curve - Euler characteristic curve - Normalized life curve - Persistence block - Persistence surface - Persistence landscape function - Persistence silhouette function - Persistent entropy summary function - Template function - Algebraic vector summaries - based on polynomial maps: - Algebraic functions - Complex polynomial coefficients - Tropical coordinate function - Statistical vector summaries - based on descriptive statistics: - Basic descriptive statistics For improved computational efficiency, all code behind the vector summaries of `TDAvec` is written in `C++` using the `Rcpp` and `RcppArmadillo` packages. In this vignette, we illustrate the basic usage of the package using simple examples. Let's first load the required libraries. ```{r setup} library(TDAvec) stats_pkg_flag <- requireNamespace("TDAstats", quietly = TRUE) # to compute persistence diagrams if (stats_pkg_flag) { library(TDAstats) # load it only if present } else { message("NOTE: TDAstats not installed; code chunks that rely on it will be shown but not run.") } ``` Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise: ```{r} # the number of points to sample N <- 100 # set a random seed for reproducibility set.seed(123) # sample N points uniformly from the unit circle and add Gaussian noise theta <- runif(N, min = 0, max = 2 * pi) X <- cbind(cos(theta), sin(theta)) + rnorm(2 * N, mean = 0, sd = 0.2) # plot the point cloud plot(X,pch = 20,asp = 1,xlab = 'x',ylab = 'y') ``` Next, we use the `TDAstats` package to compute the persistence diagram (PD) of a Vietoris-Rips filtration built on top of the point cloud $X$. ```{r} D <- calculate_homology(X,dim=1,threshold=2) sum(D[,1]==0) # number of connected components sum(D[,1]==1) # number of loops sum(D[,1]==2) # number of voids ``` In the `calculate_homology()` function, `dim` is the maximum homological dimension of the topological features to be computed (connected components if `dim=0`; connected components and loops if `dim=1`; connected components, loops and voids if `dim=2`, etc.). `threshold` is the maximum value of the scale parameter of the filtration (which we set equal to 2 since the points are sampled from a circle with diameter 2). The persistence diagram $D$ has `r sum(D[,1]==0)` connected components (the point cloud size - 1; `TDAstats` drops the connected component with infinite death value), `r sum(D[,1]==1)` loops (one with long life-span, the rest are short-lived) and `r sum(D[,1]==2)` voids along with their `birth` and `death` values. To plot the diagram, we can use the `plot_persisit()` function. ```{r} plot_persist(D) ``` In the plot, the solid dots and triangles represent connected components and loops respectively. Let's compute a vector summary of one of the simplest summary functions, the Betti curve, for homological dimension $H_1$. ```{r} # sequence of scale values to be used for vectorization scaleSeq = seq(0,1.5,length.out=16) # vectorize the Betti curve for homological dimension H_1 computeBettiCurve(D,homDim=1,scaleSeq) ``` By default, vectorization in this case is performed by computing the average values of the Betti curve over consecutive intervals defined by an increasing sequence of scale points (`evaluate`="intervals"). This vectorization method is adopted as the default for all univariate summary functions that are easy to integrate. More specifically, if $f$ is a (univariate) summary function and $t_1,t_2,\ldots,t_n$ are increasing scale values, we discretize $f$ as: \begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation} where $\Delta t_k=t_{k+1}-t_k$, $k=1,\ldots,n-1$. Alternatively, by setting the `evaluate` argument to "points", one can vectorize the Betti curve and all the other univariate summary functions by evaluating them at each scale point and arranging the values into a vector: \begin{equation} (f(t_1),f(t_2),\ldots,f(t_n))\in\mathbb{R}^{n}, \end{equation} As a note, the `computePersistenceLandscape()` function supports both classic and generalized persistence landscape functions. For example: ```{r} # classic computePersistenceLandscape(D,homDim=1,scaleSeq,k=3) # k = the number of persistence landscape functions to consider (default is 1) # generalized computePersistenceLandscape(D,homDim=1,scaleSeq,k=3,generalized=TRUE,kernel = "epanechnikov",h=0.2) # h = bandwidth for the kernel function ``` Computing algebraic and statistical vector summaries follows a similar approach. For example: ```{r} computeAlgebraicFunctions(D,homDim=0) ``` returns four algebraic functions based on the birth and death values (refer to the help documentation for more details). _Persistence surface_ and _persistence block_ are bivariate summary functions and to vectorize them, we first need to switch from the birth-death to the birth-persistence coordinates: ```{r} D[,3] <- D[,3] - D[,2] colnames(D)[3] <- "Persistence" ``` Below, we compute the _persistence image_, which is a vector summary of the _persistence surface_: ```{r} # Persistence Image (PI) resB <- 5 # resolution (or grid size) along the birth axis resP <- 5 # resolution (or grid size) along the persistence axis # find min and max persistence values for dimension H_0 minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3]) # construct one-dimensional grid of scale values ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1) # default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram sigma <- 0.5*(maxPH0-minPH0)/resP # compute PI for homological dimension H_0 computePersistenceImage(D,homDim=0,xSeq=NA,ySeqH0,sigma) ``` Since the $H_0$ features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization. For homological dimension $H_1$, the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. ```{r} # Persistence Image (PI) # find min & max birth and persistence values for dimension H_1 minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2]) minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3]) xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1) ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1) sigma <- 0.5*(maxPH1-minPH1)/resP # compute PI for homological dimension H_1 computePersistenceImage(D,homDim=1,xSeqH1,ySeqH1,sigma) ``` The code for `computePersistenceImage()` is adopted from the `pers.image()` function (available in the `kernelTDA` package) with some modifications. For example, `pers.image()` uses a one-dimensional grid for homological dimension $H_0$ regardless of the filtration type. In contrast, `computePersistenceImage()` uses a one-dimensional grid only if additionally the birth values are the same (which may not be true for some filtration types). Moreover, `pers.image()` uses a square grid (e.g., 10 by 10) for vectorization, whereas `computePersistenceImage()` is not limited to such a grid and can compute vector summaries using a rectangular grid (e.g., 10 by 20). Finally, we compare the computational costs of two vector-based vectorization methods—`persistence landscape` and `persistence silhouette`—across three `R` packages: `TDA`, `TDAkit`, and `TDAvec`. These are the only vector summaries consistently implemented in these three packages. For persistence diagrams with sizes ranging from 100 to 1000 points, we generate random birth and death values, construct the diagrams, and apply each method using the appropriate package functions. To assess performance, we use the `microbenchmark` package to record the median execution time (in milliseconds) for each method, averaged over 10 repetitions. ```{r message=FALSE, warning=FALSE} library(TDA) options(rgl.useNULL=TRUE) library(TDAkit) library(microbenchmark) ``` ```{r message=FALSE, warning=FALSE} N <- seq(100,1000,by=100) # sequence of diagram sizes method <- c('landscape-TDA','landscape-TDAkit','landscape-TDAvec', 'silhouette-TDA','silhouette-TDAkit','silhouette-TDAvec') cost <- matrix(ncol = length(method),nrow = length(N)) colnames(cost) <- method; rownames(cost) <- N for (i in seq_along(N)){ n <- N[i] b <- runif(n) # birth values d <- b + runif(n) # death values pd <- cbind(0,b,d) # persistence diagram pdList <- list(Dimension=rep(0,n),Birth=b,Death=d) # PD as a list for TDAkit attr(pdList,"class") <- "homology" scaleSeq = seq(0,2,length.out=101) # sequence of scale values mb <- summary(microbenchmark( TDA::landscape(pd,dimension = 0,KK=1:5,tseq = scaleSeq), TDAkit::diag2landscape(pdList,dimension = 0,k=5,nseq = 101), # nseq: grid size TDAvec::computePersistenceLandscape(pd,homDim = 0,scaleSeq,k=5), TDA::silhouette(pd,p=1,dimension = 0,tseq = scaleSeq), TDAkit::diag2silhouette(pdList,dimension = 0,p=1,nseq = 101), TDAvec::computePersistenceSilhouette(pd,homDim = 0,scaleSeq,p=1,evaluate = "points"), unit = 'ms', # unit: milliseconds times = 10 # number of times to call each function )) cost[i,] <- mb[['median']] } ``` Let us also download the results for python simulation, created using `./python/tdavec/time_tests.py` script ```{r} py_data <- read.csv(system.file("extdata", "benchmark_ps.csv", package = "TDAvec")) py_cost <- rbind(t(cost), 1e3*py_data$mean) rownames(py_cost)[nrow(py_cost)] <- "silhouette-python" print(py_cost, digits = 3) ``` From the results, we observe significant differences in computational efficiency: `TDAvec` consistently outperforms both `TDAkit` and `TDA`, with its silhouette and landscape methods being substantially faster—e.g., for 1000 points, we compute silhouettes in ~`r round(cost[10,'silhouette-TDAvec'],digits=2)`ms using `TDAvec` versus ~`r round(cost[10,'silhouette-TDAkit'],digits=2)`ms with `TDAkit`, and landscapes in ~`r round(cost[10,'landscape-TDAvec'],digits=2)`ms versus ~`r round(cost[10,'landscape-TDAkit'],digits=2)`ms. The following line plot illustrates how computation time scales with diagram size for the persistence silhouette method, highlighting performance differences between the three packages. ```{r message=FALSE, warning=FALSE, fig.width=7, fig.height=4} library(ggplot2) library(reshape2) silhouette_data <- data.frame( N = N, `silhouette-TDA` = py_cost['silhouette-TDA',], `silhouette-TDAkit` = py_cost['silhouette-TDAkit',], `silhouette-TDAvec` = py_cost['silhouette-TDAvec',], `silhouette-python` = py_cost['silhouette-python',]) # Reshape data silhouette_data_long <- melt(silhouette_data, id.vars = "N", variable.name = "Method", value.name = "Time") # Plot ggplot(silhouette_data_long, aes(x = N, y = Time, color = Method)) + geom_line(size = 0.5) + geom_point(size = 1) + labs( title = "Median Runtime of Silhouette Methods vs Diagram Size", x = "Diagram Size (N)", y = "Median Time (ms)", color = "Method" ) + theme_minimal(base_size = 10) ``` #### References 1. Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. _Frontiers in Artificial Intelligence_, 108. 2. Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. _Advances in Computational Mathematics_, 48(1), 1-42. 3. Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. _Journal of Machine Learning Research_, 16(1), 77-102. 4. Berry, E., Chen, Y. C., Cisewski-Kehe, J., & Fasy, B. T. (2020). Functional summaries of persistence diagrams. _Journal of Applied and Computational Topology_, 4(2):211–262. 5. Adcock, A., Carlsson, E. and Carlsson, G., 2013. The ring of algebraic functions on persistence bar codes. Homology, Homotopy Appl., 18:381–402, 2016. 6. Ali, D., Asaad, A., Jimenez, M.J., Nanda, V., Paluzo-Hidalgo, E. and Soriano-Trigueros, M., (2023). A survey of vectorization methods in topological data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 7. Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., ... & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. _Journal of Machine Learning Research_, 18.