Subsampling with LANL

This is an analysis of a new algorithm designed to determine the most appropriate threshold for clustering HIV sequences to reconstruct potential transmission networks. The dataset used in this analysis is from the Los Alamos National Laboratory, and was partitioned by year. In order to assess the performance of the algorithm under different sampling conditions, the datasets were subsampled randomly at 75%, 50%, and 25% for each respective year. The algorithm was then run on each subsample to determine the variability of the inferred best threshold based on the sparsity of the resulting network. The results of this analysis will provide insight into the effectiveness of the algorithm under different sampling conditions.

Threshold

Plot

Figure 1. In this figure, the x axis represents the year, while the y axis represents the optimal distance threshold used to link pairs of sequences into a cluster. The optimal distance threshold is determined by a heuristic score, which is described later in the page. Each dot in the figure is colored according to the sampling proportion of the original dataset for each respective year.

Scores

Plot

Figure 2. This figure shows a plot with years on the x-axis and a heuristic score on the y-axis. The heuristic score is based on the number of clusters and the ratio of the largest cluster to the second largest cluster. The points on the scatter plot are colored by proportion of random samples from original dataset. Each point represents the year and the corresponding heuristic score for each respective sample.

Table