Choosing an appropriate distance threshold is an important part of inferring a transmission network to determine the relative growth of clusters within a localized epidemic. This distance threshold determines how close two consensus sequences must be in order for a link to be created between them in the network. Using a distance threshold that is too high can result in a network with many unnecessary links, making it difficult to interpret and analyze. On the other hand, using a distance threshold that is too low can result in a network with too few links, which may not capture key insights into rapidly growing clusters among patients with shared attributes that could benefit from public health intervention measures.
Here, we present a heuristic scoring approach for tuning a distance threshold by associating each tested threshold against the maximal number of clusters created across all thresholds and the difference between the ratio () of the largest cluster in the network to the second largest cluster at each iteration. The number of clusters is normalized between then gated via a Gompertz function transform. Meanwhile, the distribution of all ratios are converted to scores, and normalized relative to the largest positive score across all candidate distances. The priority score is the sum of aforementioned two components.
Published research using the HIV-TRACE software package frequently use the default threshold of 1.5% for HIV pol gene sequences. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanismβs suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify, such as a transmission network transitioning from primarily IDU transmission to MSM. Such identification may allow best intervention practices by respective public health officials.
To use the algorithm, you first must use a multiple sequence alignment with the tn93 fast pairwise distance calculator. Once a pairwise distance file is created, use the hivnetworkcsv script with the -A keyword argument to generate the tab-separated output compatible with this page.
An example workflow is as follows:
./tn93 -t 0.015 pol.fasta > pairwise_distances.15.tn93.csv
hivnetworkcsv -i pairwise_distances.15.tn93.csv -f plain -A 0 > autotune_report.tsv
Network threshold selection procedure proceeds as follows:
For each candidate threshold , in increasing order, ranging from the smallest genetic distance in the dataset, up to either the largest distance or a predetermined maximal threshold, we compute two network statistics: , the ratio of the largest cluster to the second largest cluster, and β the number of clusters in the network.
A priority score is assigned to each . This score measures two properties of the threshold: Does jump at ? How far is the number of clusters at from the maximal number of clusters over all threshold values? Let there be overall candidate values, and assume we are examining the ith candidate, with ( is a positive integer defined below).
The threshold with the highest priority score will be selected as the suggested automatic distance threshold, if the score is high enough ( or more), and either of the two conditions hold.
If no single threshold can be selected in step 3, then the one with the highest priority score is suggested, and an inspection of the plot like the one on the analyze page is recommended to ensure that the threshold is sensible.
Degree-weighted homophily (DWH) is a measure of similarity between nodes in a network based on their attributes (such as demographic characteristics or behaviors) and their degree (i.e., the number of connections they have to other nodes in the network). It is used to quantify the extent to which nodes with similar attributes tend to be connected to each other more frequently than would be expected by chance.
DWH is calculated as the ratio of the observed number of connections between nodes with similar attributes to the expected number of connections between such nodes, based on their degree.
In mathematical terms, it is defined as:
Where:
DWH ranges from -1 to 1. A DWH value of 0 indicates that there is no more homophily than expected with chance, while a value of 1 indicates that there is perfect homophily (e.g. Birds always link to birds, and only birds). A value of -1 is achieved for perfectly disassortative networks (e.g. Bird never linking with another bird).
DWH is used in social network analysis and in the study of how different attributes are related to the formation of connections between individuals. It is used as a way to measure the similarity of attributes between individuals in a network.
Randomization is performed by shuffling attribute labels among nodes, then performing DWH computation. This is useful in creating a null distribution of DWH scores under random mixing. A panmictic range is reported by shuffling attributes multiple times and reporting the minimum and maximum score.
Please see Benjamin Golub, Matthew O. Jackson, How Homophily Affects the Speed of Learning and Best-Response Dynamics, The Quarterly Journal of Economics, Volume 127, Issue 3, August 2012, Pages 1287β1338 for more information.
With your FASTA File, execute a command similar to the following. A full list of arguments one can use with HIV-TRACE is provided here.
hivtrace -i ./INPUT.FASTA -a resolve -r HXB2_prrt -t .015 -m 500 -g .05 > hivtrace.results.json
Use the hivnetworkannotate
script to annotate your results from HIV-TRACE with attributes. The script should already be installed on your machine if you have already installed hivtrace
.
An example command is:
hivnetworkannotate -n hivtrace.results.json -a node_attributes.json -g schema.json -r
Please see the hivnetworkannotate documentation for more information.
Once the results file has been annotated, please use the assortativity page for analysis.