Subsampling with AUTO-TUNE

This data is from Rhee et. all (1). The dataset in this study consists of 6,034 complete HIV-1 pol gene sequences, which were obtained from publicly available databases such as GenBank, the Los Alamos National Laboratories (LANL) HIV Sequence Database, and the HIV Drug Resistance Database. These sequences were annotated by country and year and were classified into 11 pure subtypes and 70 circulating recombinant forms (CRFs) using established taxonomic criteria. To test the robustness of AUTO-TUNE, the researchers generated 10 random subsamples at 25%, 50%, and 75% each.

Network visualizations of each subsample run can be found here.

Threshold

Range

Scores

Range

This figure shows a plot with years on the x-axis and a heuristic score on the y-axis. The heuristic score is based on the number of clusters and the ratio of the largest cluster to the second largest cluster. The points on the scatter plot are colored by proportion of random samples from original dataset. Each point represents the year and the corresponding heuristic score for each respective sample.

Summary Statistics

Detailed Scores

Plot

This figure shows a plot with years on the x-axis and a heuristic score on the y-axis. The heuristic score is based on the number of clusters and the ratio of the largest cluster to the second largest cluster. The points on the scatter plot are colored by proportion of random samples from original dataset. Each point represents the year and the corresponding heuristic score for each respective sample.

Full Sampling Scores

Plot

This figure shows a plot with years on the x-axis and a heuristic score on the y-axis. The heuristic score is based on the number of clusters and the ratio of the largest cluster to the second largest cluster. The points on the scatter plot are colored by proportion of random samples from original dataset. Each point represents the year and the corresponding heuristic score for each respective sample.

Table

Missing Nodes Summary

Plot

The proportion of nodes subsampled that were clustered in both the original network and the subsampled network. The box plots indicate that the proportion of nodes captured in the subsampled networks increases from the 1.5% threshold to the optimized AUTO-TUNE threshold at each subsampling level. This trend suggests that the AUTO-TUNE scoring method may be more effective and reliable for maintaining network structure in subsampled datasets.

Plot

Proportion of nodes that were singletons in the original network and became clustered in the subsampled networks. Each box plot represents the proportion of such nodes at 25%, 50%, and 75% subsampling rates across 10 random iterations. Here, we see that AUTO-TUNE's adaptive thresholding plays a minimal role in affecting underlying network structure.

References

  • Rhee SY, Shafer RW. Geographically-stratified HIV-1 group M pol subtype and circulating recombinant form sequences. Sci Data. 2018 Jul 31;5:180148. doi: 10.1038/sdata.2018.148. PMID: 30063225; PMCID: PMC6067049.