Input Format and Interpretation
- Should I adjust the allele frequencies based on tumor cell content before I run EXPANDS?
Adjusting the allele frequencies to account for tumor cell content is not necessary. Tumor cell content is part of the output of EXPANDS: the size of the largest subpopulation is a measure of tumor purity. The size of every subpopulation is calculated relative to the entire sequenced sample, not relative to the tumor cell population. To obtain subpopulation sizes relative to the tumor cell population, divide the size of each subpopulation by the size of the largest subpopulation.
- Do copy numbers used as EXPANDS input have to be natural numbers or can they be rational positive numbers?
The more precise the estimates of copy numbers, the better. Using rational positive estimates of copy numbers is recommended as this allows for a better resolution on subclones that carry copy number variations.
- Does it make sense to include mutations within non-coding regions as input to EXPANDS?
Yes, in general it does make sense to include non-coding mutations. Even though they are less likely to provide selective advantages and cause clonal expansions, they can act as markers (“footprints”) of clonal expansions and can be very useful for identifying subpopulations. However, in very highly mutated cancers, non-coding regions tend to be more noisy. That’s because in these tumors SNVs have a higher probability to occur more frequently during early phases of a given clonal expansion into a subpopulation. In this case, when working with whole genome sequencing data, it may be better to trim the SNV input to SNVs within or nearby coding regions. Parameters ‘region’ and ‘maxN’ of the runExPANdS function provide this functionality.Further information on input data format and requirements is available here.
Errors/Warnings Encountered During Run
- What should I do if EXPANDS fails to find cell-frequency distributions for a high % of my SNVs?
Typically approximately 1-10% of SNVs fail during this step. A common reason for a failure rate greater than 10% is a wrong input format for copy numbers. Absolute, positive rational numbers are expected as copy number input rather than log-ratios.
- Sometimes despite a successful run, I observe a series of error messages displayed after the cell-frequency distribution calculation step. Errors are displayed following the message: “Failed to find cell-frequency distribution for 21 SNVs. Causes:…”. Are these of concern for the reliability of the end result?
Typically most of these error messages can be traced back to unsuccessful Gaussian mixture modeling of cellular frequencies for individual mutations (that is, no cellular frequency could be found that explains observed copy number and allele frequency, under given constraints). Those mutations are simply excluded from further processing. So the errors are no reason for concern, unless they apply to a high fraction of SNVs (see point 4).
- How long does EXPANDS take to run?
The runtime depends mainly on the number of SNVs used as input. The table below lists runtimes recorded for various input complexities and server/laptop configurations.The most time-consuming step when working with many SNVs (>1,000) is the clustering of cell-frequency distributions. This step may take several hours without any user-feedback. For all other steps user feedback on the progress of the run is frequent and its absence indicates that the program hangs (see point 7).
- What should I do if the program hangs?
If the program hangs during clustering:
Decrease the number of SNVs used as input, for example by using only SNVs within or in the vicinity of coding regions (parameters ‘maxN’, ‘region’ of function runExPANdS) or by excluding germline SNVs within LOH regions (rows for which column ‘PN_B’>0).If the program hangs during the computation of cell-frequency distributions:
Two alternative options fix this problem. Option 1: exclude SNVs with low allele-frequencies (<=0.05 or <=0.1). Option 2: decrease parameter ‘min_CellFreq’ of the runExPANdS function (e.g. to 0.01 or 0.05). The problem occurs because the algorithm tries to find solutions for these low abundance SNVs, where no good solution exists within the accepted cell-frequency range.Originally, the method was designed to work with low to moderate sequencing depth (30 to 100X). SNVs detected at low allele frequencies have a high FP rate in this context. So when working with such data it is advisable to choose option 1. For DNA isolated from samples of low tumor purity, a high % of somatic SNVs are expected to be present at low allele frequencies. Option 2 along with a higher sequencing depth is the more appropriate solution for such low tumor purity cases.
Output Format and Interpretation
- How can I determine tumor purity from EXPANDS output?
The tumor purity estimate for a given sample is the size of the largest subpopulation identified in that sample. It can be found in column ’SP’ of the .sps output. The tumor purity estimated by EXPANDS is calculated under the assumption that the analyzed tumor has a monoclonal origin. Otherwise, if the tumor has a polyclonal origin, its purity will be underestimated by EXPANDS.
- Is each SNV unique to the subpopulation to which it has been assigned or can a SNV in a more dominant subpopulation also be present in one ore more additional, less dominant subpopulations?
SNVs assigned to a particular subpopulation are not necessarily exclusive to that subpopulation. The assignment between subpopulation and SNV only means that the SNV has been first propagated during the clonal expansion that gave rise to the subpopulation. So yes, SNVs present in dominant subpopulations may also be present in less abundant subpopulations. In particular, assuming a monoclonal tumor origin, the SNVs present in the largest subpopulation, should also be present all other subpopulations.EXPANDS versions >= 1.6 also predict the phylogeny of the tumor based on subpopulation specific copy numbers and assign SNVs to multiple subpopulations according to the predicted phylogeny. This functionality hasn’t been validated yet, but it’s available.
- After cell-frequency distribution clustering and SNV assignment to clusters, the best cell-frequency solution of some SNVs (column ‘f’) changes from NA to a specific value between 0 and 1. Is this expected?
Yes, that’s because per default settings, only SNVs that fulfill certain constraints are used during clustering, that is: when the number and size of co-existing subpopulations are estimated. For example max_PM is set to 6, i.e. SNVs are excluded from the clustering step if they cannot be explained by a total ploidy of mutated cells below or equal to 6. After subpopulations are identified, these SNV are included back into further processing; this is why the values in column ‘f’ may change for some SNVs.The clustering step is critical for the accuracy of the algorithm, if the predicted number and size of subpopulations are not correct, then everything else will be wrong as well. In contrast, a wrong assignment of certain SNVs to subpopulations does not affect accuracy upstream.If your SNVs are called from deep sequencing data (i.e. allele frequencies and copy numbers are precise) and you want more SNVs to be included during clustering, change ‘max_PM’ to a higher value.