Signatures of copy number alterations in human cancer

Utilized datasets

Using SNP6 microarray data, copy number profiles were generated for 9,873 cancers and matching germline DNA of 33 different types from TCGA6 using allele-specific copy number analysis of tumours (ASCAT)56 with a segmentation penalty of 70 (Supplementary Table 1). In addition, a set of whole-genome sequences from 512 cancers of the International Cancer Genome Consortium that overlapped with tumour profiles in TCGA were analysed33 to generate WGS-derived copy number profiles (see below). Last, a set of whole-exome sequences from 282 cancers from TCGA was analysed to generate exome-derived copy number profiles (see below).

Copy number profile summarization

Copy number segments were classified into three heterozygosity states: heterozygous segments with copy number of (A > 0, B > 0) (numbers reflect the counts for major allele A and minor allele B); segments with LOH with copy number of (A > 0, B = 0); and segments with homozygous deletions (A = 0, B = 0). Segments were further subclassified into five classes on the basis of the sum of major and minor alleles (TCN; Extended Data Fig. 1e) and were chosen for biological relevance as follows: TCN = 0 (homozygous deletion); TCN = 1 (deletion leading to LOH); TCN = 2 (wild type, including copy-neutral LOH); TCN = 3 or 4 (minor gain); TCN = 5–8 (moderate gain); and TCN ≥ 9 (high-level amplification). Each of the heterozygous and LOH TCN states were then subclassified into five classes on basis of the size of their segments: 0–100 kb, 100 kb–1 Mb, 1 Mb–10 Mb, 10 Mb–40 Mb and >40 Mb (the largest category for homozygous deletions was restricted to >1 Mb). This subclassification was used to capture focal, large-scale and chromosomal-scale copy number changes. In this way, copy number profiles were summarized as counts of 48 combined copy number categories defined by heterozygosity, copy number and size, which we defined as N = (n1,n2,…,n48). For a given dataset, the copy number profiles of a set with S samples were then summarized as a nonnegative matrix with S × 48 dimensions. The segment sizes were selected to ensure that a sufficient proportion of segments were classified in each category, which resulted in a reasonable representation across the pan-cancer TCGA dataset (Extended Data Fig. 1f–h). Two examples, representing a mostly diploid adrenocortical carcinoma (Extended Data Fig. 1i, j) and a copy number aberrant bladder cancer (Extended Data Fig. 1k–l), are provided to illustrate how the segments from a copy number profile are summarized by our framework into a vector of mutually exclusive and exhaustive quantitative features.

Deciphering signatures of copy number alterations

Copy number signatures were extracted by applying our previously developed approach for creating a reference set of signatures10. Specifically, SigProfilerExtractor (v.1.0.17)21 was applied to the matrix encompassing all TCGA samples, and separately to each matrix corresponding to an individual tumour type. In brief, SigProfilerExtractor utilizes nonnegative matrix factorization (NMF) to find a set of copy number signatures ranging from 1 to 25 components for each examined matrix. For each number of components, 250 NMF replicates with distinct initializations of the lower dimension matrices were performed on the Poisson resampled data. SigProfilerExtractor was used with default parameters, except for the initializations of the lower dimension matrices, for which random initialization was utilized consistent with our prior analyses of mutational signatures10,11. After performing 250 NMFs, SigProfilerExtractor clusters the factorization within each decomposition to automatically identify the optimum number of operative signatures that best explain the data without overfitting these data21.

As previously done10, the sets of all identified copy number signatures were combined into a reference set of pan-cancer copy number signatures by leveraging hierarchical clustering based on the cosine dissimilarities between each signature. The number of combined signatures is chosen to maximize the minimum average cosine similarity between each signature in a cluster and the mean of all samples in that cluster to ensure that each copy number signature in a cluster has a high similarity to the combined copy number signature for that cluster. Simultaneously, the maximum cosine similarity between mean copy number signatures for each cluster is minimized to ensure that each combined signature is distinct from all others. To avoid reference signatures being linear combinations of two or more other signatures, for each identified signature, a synthetic sample was created with the pattern of the signature multiplied by 1,000 copy number segments. Furthermore, the synthetic sample was resampled with probabilities proportional to the strength of each copy number category in each identified signature. Each resampling was then scanned for activity of all other signatures from the reference set. If a resampled sample can be reconstituted with a cosine similarity >0.95 by 3 or fewer other signatures, the signature used to create the synthetic sample was deemed to be a linear combination of those signatures, and the signature was removed from the global reference set of signatures.

Reference set of copy number signatures

Initially, 28 pan-cancer copy number signatures were derived from the different SigProfilerExtractor analyses of the 9,873 copy number profiles from SNP microarrays. In silico evaluation and manual curation showed that ten copy number signatures were linear combinations of two or more other signatures. Additionally, three signatures were deemed to be artefactual owing to oversegmentation of copy number profiles. These artefactual signatures were removed from further analyses, as were samples with any attribution of any of these artefactual signatures (116 samples; 1.2% of all TCGA samples). Moreover, samples with >25 Mb of homozygous deletions across the genome were removed from downstream analyses (58 samples), leaving 9,699 samples for full analysis. Following signature assignment (see below), three of the signatures that were removed owing to linear combination were re-extracted within tumour-type-specific assignment (cosine similarity = 1), which indicates that some copy number profiles could not be explained well without these three signatures. As a result, these 3 signatures were reintroduced into the compendium of signatures, leaving a total of 19 signatures. Last, it was observed that a number of samples with high amounts of LOH were poorly explained by the 19 signatures. To remedy this, signatures were extracted from all samples with a proportion of the genome LOH > 0.7. This extraction identified 3 new signatures that were incorporated into the reference set of signatures, giving 22 signatures. One of the newly identified LOH signatures was able to reconstitute 1 of the previous 19 signatures as a linear combination with another signature; therefore the linear combination LOH signature was removed from the reference set, leaving 21 non-artefactual pan-cancer signatures of copy number alteration.

CN1–CN3 form a group of ploidy-associated signatures. CN1 and CN2 display TCNs between 2 and 3–4 respectively, with predominantly >40 Mb heterozygous segments. CN3 consists of predominantly heterozygous segments of TCNs 5–8 with sizes >1 Mb.

CN4–CN8 form a group of amplicon-associated signatures that all have segment sizes predominantly between 100 kb and 10 Mb but with differing TCN or LOH states. CN4 consists of a mixture of LOH segments with a TCN of 1 and heterozygous segments with TCNs 3–4. CN5 consists almost entirely of LOH segments with a TCN of 2. CN6 consists of a mixture of LOH segments with a TCN of 2 and heterozygous segments with TCNs 3–4. CN7 consists of a mixture of heterozygous segments with TCNs of 3–4, 5–8 and 9+. CN8 consists of predominantly heterozygous segments with TCNs of 9+.

CN9–CN12 form a group of signatures with considerable LOH components. CN9 consists of a mixture of LOH segments with a TCN of 2 and heterozygous segments with a TCN of 2, each ranging from 100 kb to 40 Mb, which is suggestive of structural CIN. CN10 consists of a mixture of LOH segments with TCNs 2 and 3–4 and heterozygous segments with TCNs 3–4 between 100 kb and 40 Mb. CN11 consists of a mixture of LOH segments with TCNs 3–4 and heterozygous segments with TCNs 5–8, each at predominantly 1–10 Mb. CN12 consists of mostly LOH segments of a TCN of 2 with sizes >100 kb and additional heterozygous segments of TCNs 3–4 with sizes between 10 and 40 Mb.

CN13–CN16 form a group of signatures with whole-arm-scale or whole-chromosome-scale LOH events, a form of numerical CIN. CN13 is predominantly LOH TCN 1 segments, CN14 is LOH TCN 2 and CN15 is LOH TCN 3–4. CN16 consists of LOH segments with TCNs of 3–4 and heterozygous segments with TCNs of 5–8, each at >40 Mb.

CN17 has been associated with the tandem duplicator phenotype (Fig. 4). This signature consists of LOH segments of TCNs 2 and 3–4 and heterozygous segments of TCNs 3–4 and 5–8, each with segment sizes of 1–40 Mb.

CN18–CN21 originate from unknown processes and are diverse in their copy number patterns. CN18 consists of predominantly heterozygous segments of TCNs 4–8 at >1 Mb, but with appreciable contributions of LOH segments with TCNs 3–4 at >1 Mb and heterozygous segments with TCNs 9+ at >100 kb. CN19 consists of segments between 100 kb and 40 Mb that are heterozygous with TCNs 3–4 or less commonly LOH with a TCN of 1 or 2. CN20 consists of predominantly heterozygous segments with TCNs 3–4 at 100 kb–40 Mb with some heterozygous segments of TCNs 3–4 at 100 kb–10 Mb. CN21 consists of heterozygous segments with a TCN of 2 at >1 Mb and many heterozygous segments with TCNs 3–4 at 100 kb–1 Mb.

Assignment of copy number signatures to individual cancer samples

The global reference set of copy number signatures was used to assign an activity for each signature to each of the 9,873 examined samples using the decomposition module of SigProfilerExtractor21. For the assignment, the information of the de novo signature and their activities assigned to each sample were used to implement the decomposition module with default parameters, except for the NNLS addition penalty (nnls_add_penalty), which was set to 0.1, the NNLS removal penalty (nnls_remove_penalty), which was set to 0.01, and the initial removal penalty (initial_remove_penalty), which was set to 0.05. Signatures were assigned to samples in both tumour-specific evaluations and in a pan-cancer evaluation. As previously done10, the signature attributions from either tumour-specific or pan-cancer evaluations that gave the best cosine similarity between the input sample vector and the reconstructed sample vector were used as the attributions for that sample in all subsequent analyses.

Copy number signatures derived from WGS and WES data

A set of samples from TCGA with both SNP array and exome sequencing data were selected (n = 282). Copy number profiles were generated from the exome sequencing data using ASCAT across all of the dbSNP common SNP positions with a segmentation penalty ranging from 20 to 140. Signatures were re-extracted for these 282 samples from both the SNP-array-derived copy number profiles and the exome-derived copy number profiles, and the resulting signatures were compared.

For WGS data, we examined 512 whole-genome sequenced samples from the PCAWG project overlapping with TCGA samples with microarray data. Copy number profiles from WGS data were generated using ASCAT across the SNP6 positions, with a segmentation penalty ranging from 20 to 120. Signatures were extracted for samples with both SNP6-microarray-derived copy number profiles and the WGS-derived copy number profiles, and the extracted signatures were compared. In all cases, a segmentation penalty of 70 gave the best concordance for both copy number profiles and extracted copy number signatures based on SNP6 microarray, WGS and WES data.

Copy number signatures derived from different copy number callers

A set of 3,175 allele-specific copy number profiles called using the ABSOLUTE57 algorithm were obtained. Copy number signatures were extracted from the 3,175 ABSOLUTE profiles, as well as re-extracted for the 3,175 corresponding ASCAT profiles. Signatures were compared using cosine similarity with between 2 and 12 signatures extracted, and with the sigProfiler suggested solution of 4 signatures extracted.

Mapping copy number signatures to the landscapes of cancer genomes

See Supplementary Methods for details of mapping copy number signatures back onto the reference genome.

For all mapping analyses, P values were adjusted for multiple testing as appropriate for Monte Carlo testing58.

Associations between copy number signatures and events defined by genomic region

Localized events (chromothripsis33 and amplicon structure30) identified using WGS data were associated with mapped copy number signatures from TCGA for all available matching samples (chromothripsis n = 657; amplicon n = 1,703). Each segment in every sample was categorized as overlapping or non-overlapping of a localized event. For each copy number signature, the association was then tested using two-sided Fisher’s exact test on a contingency table of segments categorized as overlapping or non-overlapping of a localized event and assigned to or not assigned to the given copy number signature across all samples. Multiple-testing correction was performed using the Benjamini–Hochberg method.

Genome-doubled copy number signatures

With the copy number categories being defined as 0, 1, 2, 3–4, 5–8 and 9+, it is possible to artificially ‘genome double’ any copy number category, other than 0, by assigning it to the next highest copy number category. In this way, we artificially ‘genome doubled’ each signature by assigning the count for each copy number class to its next highest copy number class. First, the copy number 1 class is assigned a count of 0, then each copy number class is assigned the count of the preceding copy number class. For example, copy number class of 2 is assigned to the previous copy number class of 1, 3–4 assigned previous 2, and so on, until finally the copy number 9+ class is assigned a count that is the sum of the previous copy number 5–8 class and 9+ class. During this conversion, LOH and size categories were retained so that the only shift is in copy number. Having performed this conversion, cosine similarities between the artificially genome-doubled signatures and the original signatures were calculated. Any genome-doubled and original signature pair that had a cosine similarity of >0.85 was considered to contain a pair of signatures with analogous copy number patterns distinguished only by their genome-doubling status.

Associations between copy number signatures and ploidy

Ploidy for each copy number profile was calculated as the relative length weighted sum of TCN across a sample. The proportions of the genome that displayed LOH (pLOH) were also calculated. Samples with a ploidy above −3/2 × pLOH + 3, meaning an LOH-adjusted ploidy of 3 or greater, were deemed to be genome-doubled samples. By contrast, samples with a ploidy above −5/2 × pLOH + 5, meaning an LOH-adjusted ploidy of 5 or greater, were deemed to be twice genome-doubled samples. All other samples were considered as non-genome-doubled samples. Each signature (CN1–CN21) was associated with each genome doubling category (GD×0, GD×1 and GD×2) using one-sided Fisher’s exact test on a contingency table with samples categorized by whether the samples have >0.05 attribution to the given copy number signature or not, and whether the sample has the given genome doubled category or not. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Associations between copy number signatures and known cancer risk factors

Associations between attributions of copy number signatures and attributions of SBSs, IDs and doublet-base signature exposures10 were performed using Kendall’s rank correlation. Only the significant associations found in both cancer-type-specific and pan-cancer analysis are reported. For the cancer risk association analyses, copy number signatures were associated with sex59, tobacco smoking60 and alcohol drinking status61. For each copy number signature, the association was conducted using two-sided Fisher’s exact test on a contingency table of a clinical feature categorized as present or absent and assigned to or not assigned to the given copy number signature across all samples. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Associations between copy number signature attribution (binarized to present or absent) and the TDP (also binarized to present or absent)29 were performed using two-sided Fisher’s exact test (n = 882). This was performed for each copy number signature separately. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with q < 0.05 are reported.

Associations between copy number signature attribution (binarized to present or absent) and driver-gene single nucleotide variant (SNV) and ID mutation status40 were performed within tumour types using two-sided Fisher’s exact test (n = 6,543 across all cancer types). This was performed for all copy number signature/gene combinations for which the gene was mutated in the given cancer type and the copy number signature was observed in the given cancer type. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with both q < 0.05 and |log2(OR)|>1 are reported.

Driver copy number alterations of COSMIC cancer gene census genes62 were defined as follows: (1) homozygous deletion (CN = (0, 0)) of genes listed as deleted (D) in COSMIC mutation types; or (2) amplification (CN > 2 × ploidy + 1) of genes listed as amplified (A) in COSMIC mutation types. Associations were then performed on copy number driver alterations for SNV and ID driver gene alterations as outlined above (n = 9,699 across all cancer types).

The diversity of copy number signatures, as defined by Shannon’s diversity index, was associated with both SNV and ID and copy number driver gene mutations using a logistic regression model with binary diversity (>0, =0) as the dependent variable, and tumour type and gene mutation status as independent variables. LGG was taken as the reference tumour type. Only driver genes with >250 mutant samples in the dataset were included in the model.

Associations between copy number signature attribution (binarized to present or absent) and age at diagnosis (binarized to above or below median separately for each cancer type) were performed within cancer types using two-sided Fisher’s exact test (n = 8,841 across all cancer types). All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with both q < 0.05 and |log2(OR)|>1 are reported.

Leukocyte counts were obtained from TCGA50. The leukocyte fraction was associated with copy number signatures using a logistic regression model with binarized leukocyte fraction (fraction > or ≤ median fraction) as the dependent variable, and binarized copy number signature attribution (0, >0 attribution) and ASCAT estimated tumour purity as independent variables. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Copy number signatures and defective HR

Signatures were tested for enrichment in tumour types using one-sided Mann–Whitney tests of signature attribution in a given tumour type versus all other tumour types. This was performed for all signature and tumour combinations. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

The following core HR repair pathway member genes were chosen for interrogation: BRCA1, BRCA2, RAD51C and PALB2 (refs. 63,64). Copy number alterations across these genes were identified based on ASCAT copy number profiles for homozygous deletions (that is, CN = (0, 0)) and LOH (that is, CN = (>0, 0)). Somatic SNVs and IDs were taken from ref. 40. Pathogenic germline variants in BRCA1 and BRCA2 were taken from ref. 65. Samples were deemed as bi-allelically mutated for the HR pathway if homozygously deleted or if more than one of any of the other classes of alteration were present within any of the HR pathway genes. Mono-allelic loss was defined as one of any of the non-homozygously deleted alterations within any of the HR pathway genes. Wild type was defined as no alterations in any HR pathway genes. The associations between HR pathway status and CN17 were then restricted to only breast (n = 589), ovarian (n = 309) and pan-cancer (n = 4,919). Two-sided Fisher’s exact tests were performed between wild-type and mono-allelic samples, between wild-type and bi-allelic samples, and between mono-allelic and bi-allelic HR pathway status samples. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

A further multivariate logistic regression model was utilized with CN17 attribution (>0 or 0) as the dependent variable, and BRCA1, BRCA2, RAD51C, PALB2, FBXW7, CDK12 mutational status, categorized as wild type, mono-allelic or bi-allelic as previously described, as independent variables, to test associations between the mutation status of individual HR pathway genes and CN17.

Orthologous scores of HRD were calculated using scarHRD61. Associations between scarHRD scores and CN17 were tested using two-sided Fisher’s exact tests, with CN17 categorized as present or absent, and scarHRD scores categorized as positive or negative around thresholds of both 42 (which has been described as an adequate threshold in breast cancer61) and 63 (which has been described as an adequate threshold in ovarian cancer66). Furthermore, we associated the presence or absence of CN17 with continuous scarHRD scores using two-sided Mann–Whitney test.

To test associations between promoter hypermethylation of the HR machinery and CN17, TCGA methylation β values were downloaded from https://portal.gdc.cancer.gov/ and TCGA-normalized gene expression RSEM values were downloaded from https://gdac.broadinstitute.org/

Relationships between log10(RSEM) values and mean TSS200 and TSS1500 associated methylation probe β values were initially inspected in breast cancer to determine a threshold mean β value for determining promoter hypermethylation and subsequent epigenetic silencing of BRCA1. This threshold was set at mean β > 0.7.

CN17 attribution was associated between BRCA1 promoter hypermethylated breast cancer samples and both genomic BRCA1 wild-type and bi-allelically mutated breast cancer samples using two-sided Mann–Whitney test. This analysis was extended to a pan-cancer association, performing two-sided Fisher’s exact tests between signature attribution or not, and promoter hypermethylation (mean TSS200 and TSS1500 β > 0.7) or hypomethylation (mean TSS200 and TSS1500 β ≤ 0.7). P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with hypoxia

Gene-expression-derived scores of hypoxia from 8,006 TCGA tumours were used49,67. A linear regression with hypoxia score as the dependent variable, and binarized copy number signature attributions (>0, =0) as well as tumour type as independent variables.

Copy number signatures associated with complex rearrangements

Assignment of rearrangement phenomena to PCAWG samples were used31. Associations of each re-arrangement phenomenon with each copy number signature were evaluated using two-sided Fisher’s exact tests of copy number signature non-attributed or attributed (=0, >0) against rearrangement phenomenon presence or absence. P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with HPV in HNSC

We used HPV testing status from TCGA HNSCs obtained from ref. 68. HPV status was associated with copy number signature attribution using two-sided Fisher’s test. P values were corrected for multiple testing using the Benjamini–Hochberg method. Furthermore, hypoxia scores (see above) were associated with HPV status using two-sided Mann–Whitney test.

Copy number signature associated with ethnicity

Ethnicity information for 11,160 individuals from TCGA was taken from the TCGA Clinical Data Resource59. Copy number signatures (binarized to present/absent) were associated between Black/White ethnicity and between Asian/White ethnicity separately using two-sided Fisher’s exact tests. P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with changes of overall survival

Survival data for 11,160 individuals from TCGA were obtained from the TCGA Clinical Data Resource59. Univariate disease-specific survival analysis for signatures was performed using a log-rank test and Kaplan–Meier curves in R, with groups being unattributed (attribution = 0) and attributed (attribution > 0) for each signature separately, or for summed attributions of a set of signatures (for example, amplicon signatures).

Multivariate disease-specific survival analysis was performed using the Cox’s proportional hazards model in R with Boolean attributed/non-attributed variables for each copy number signature and tumour type as covariates. To account for potential violations of Cox’s model’s proportional hazards assumption, we also conducted the same analysis using the accelerated failure time model with the Weibull distribution using the flexsurvreg function in R. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Simulating copy number profiles

See Supplementary Methods for details of the methods used to simulate copy number profiles from various processes.

Single-cell isolation, FACS analysis and DNA library generation for USARC ploidy estimation

Fresh frozen tumour tissue was thawed on ice, dissected and homogenized with 500 µl of lysis buffer (NUC201-1KT, Sigma). Following the release of single nuclei, samples were centrifuged, and the resulting precipitate removed. A 10 µl sample was taken to count and evaluate the extracted nuclei. The lysate was cleaned using a sucrose gradient following the manufacturer’s instructions (NUC201-1KT, Sigma). After cleaning, the nuclei were centrifuged at 800g for 5–10 min at 4 °C and resuspended in PBS, supplemented with 140 µg ml–1 RNase (19101 Qiagen) and stained with 1 µg ml–1 DAPI (Sigma-Aldrich), and 2.5 µg ml–1 Ki-67 antibody (BioLegend) per 1 million cells in 100 µl. Stained nuclei were analysed using a FACS Aria Fusion cell sorter (BD bioscience) and FACS DIVA software (v.8.0.1). Cells were sorted using a 130-μm nozzle with 12 psi set for sheath pressure. Each gated population of interest was collected into a separate 1.5-ml tube, and a custom sort precision of 0-16-0 (Yield-Purity-Phase) was used. For cells collected into plates, the sort precision used was Purity, defined as 32-32-0 (Yield-Purity-Phase). DAPI was measured using a 355-nm UV laser with a 450/50 bandpass filter. Ki-67 was measured using a 635-nm red laser with a 670/30 bandpass filter. Forward scatter and side scatter were both measured from a 488-nm blue laser on a linear scale. DAPI was also measured on a linear scale and was used to estimate the DNA content per single cell. A control diploid cell line was used to establish accurate ploidy measurements before sorting. Forward versus side scatter area was used to exclude debris, whereas the height versus area of the DAPI fluorescence was used to exclude doublets. FACS analysis revealed the presence of three major aberrant cell populations (Supplementary Methods), including a haploid population (1n), a nearly diploid population (2n, Ki-67 positive) and a WGD population (3n+). A non-proliferating, non-aberrant, normal cell population was also identified (2n, Ki-67 negative).

Once sorted, single nuclei suspensions were processed using a Chromium Single Cell DNA Library & Gel Bead kit (10X Genomics, PN-1000040) according to the manufacturer’s instructions, with a target capture of 1,000–2,000 cells. The resulting barcoded single-cell DNA libraries were sequenced with an Illumina HiSeq 4000 system using 150 bp paired-end sequencing with a coverage ranging from 0.01 to 0.08 X per cell. Germline bulk WGS was also performed on a XTen instrument (Illumina) as previously described16. Copy number signatures were also evaluated in single cells harbouring chromothripsis, as well as WGD events using sequencing data that had already been generated from a cell-based model system linking chromothripsis and hyperploidy69.

Single-cell allele-specific copy number alteration calling using ASCAT.sc

USARC single-cell paired-end reads generated using the chromium single cell CNV platform were processed using the 10X Genomics Cell Ranger DNA Pipelines (https://support.10xgenomics.com/single-cell-dna/software/pipelines/latest/what-is-cell-ranger-dna). Following sample demultiplexing, data were aligned to the GRCh38 reference genome and a barcoded BAM file was obtained for every considered single cell per individual USARC ploidy population. To analyse each barcoded BAM file and derive total copy number alterations for each single cell, we then applied ASCAT.sc v.1.0 (https://github.com/VanLoo-lab/ascat), our in-house pipeline, to analyse single-cell and shallow coverage WGS data. Similar to its predecessor ASCAT, which measures allele-specific copy number alterations in bulk tumour data56, ASCAT.sc infers single-cell TCN states from changes in the relative read depth (logR). Importantly, ASCAT.sc derives the logR from the number of reads aligning in different genomic bins, unlike ASCAT, which relies on both the logR and the allelic imbalance (otherwise known as the B-allele frequency) at SNP loci identified as heterozygous in the germline. Thus, ASCAT.sc utilizes logR shifts to segment the genome into regions with constant TCN states, thereby assigning integer copy number profiles to single cells. For single-cell allele-specific copy number alterations, we first performed single-cell segmentation using multiple piecewise constant fitting70 using the R package copynumber v.1.26.0 (https://bioconductor.org/packages/release/bioc/html/copynumber.html). We then provide ASCAT.sc with the available matched-normal germline sample and generate phased germline SNPs using Beagle (v.5.1)71 as part of the subclonal copy number calling pipeline, Battenberg72. ASCAT.sc then uses single cell logR values alongside phased SNP data, as well as allele counts for heterozygous SNPs (generated using alleleCount; https://github.com/cancerit/alleleCount) to calculate allele-specific copy number alterations in single cells. These results can be used to group cells into distinct tumour subclones while also excluding noisy single cells.

Copy number signatures on single-cell copy number profiles

For all single-cell datasets, adjacent genomic bins within a chromosome with the same major and minor copy number were combined into a single segment. Genomic bins for which no copy number state was assigned were removed from the profiles. Copy number summaries were then generated, and TCGA copy number signatures were scanned using sigProfilerSingleSample on all cells.

Because of the nature of the undifferentiated sarcoma for which single-cell sequencing was performed (near-genome-wide LOH), the majority of the genome should be LOH for tumour cells, and a minority of the genome should be LOH for normal cells. However, we observed a number of cells for which the majority of the genome had a copy number of (1, 4). This is an erroneous copy number pattern, which occurred owing to the difficulty of calling LOH from single-cell data in the context of multiple genome-doubling events. Cells with a proportion of the genome LOH < 0.4 and a proportion of the genome with imbalanced copy number (major CN!=minorCN) > 0.6 were excluded from further analysis to remove erroneous profiles.

For an assessment of copy number signatures in genomically unstable single cells, BAM files from TP53 mutant RPE1 cells were downloaded69. Copy number profiles were generated as for the USARC single cell data, and scanned for signatures using sigProfilerSingleSample.

FACS and copy number profiling of ploidy populations for RRBS

The sorting strategy for RRBS workflows was modified to collect groups of cells belonging to different ploidy populations based on DAPI staining (Supplementary Methods). Five tumour samples were processed in this manner, DNA was extracted using a Quick-DNA Miniprep Plus kit (Zymo, D4068) and library preparation and quality control was performed using an Ovation RRBS Methyl-Seq system (Nugen, 0353, 0553) according to the manufacturer’s instructions. Paired-end sequencing was performed on an Illumina NovaSeq instrument using an S1 flowcell 100 cycles (single end). Allele-specific copy number calling was performed using CAMDAC (https://github.com/VanLoo-lab/CAMDAC).

Copy number signatures for the 4 ploidy-sorted populations and the bulk population were extracted using sigProfilerExtractor, setting the number of signatures to extract at 4. Artificial genome-doubling of the identified signatures was performed as described above. The 5 samples were also scanned for the 21 TCGA signatures using sigProfilerSingleSample; identified copy number signatures were categorized by their predominant genome-doubling association (see above), and the prevalence of individual genome doubling category (WGD×0, WGD×1, WGD×2) signatures was evaluated.

Copy number signatures in germline TP53 mutant cancers

We used Battenberg-derived72 copy number profiles of WGS data from cancer samples of patients with Li–Fraumeni disease73,74. Additional clinical metadata and highly curated sequencing data for additional cases were obtained from D.M., A.S. and N.L.

Data analysis

All signatures decompositions, assignments and matrix generations were performed using the sigProfiler suite (see above) of Python packages using Python v.3.7.1.

All statistical analyses were performed in R v.4.0.2. Plotting was performed with base R or with packages ggplot2, ggrepel, RColorBrewer, circlize, ComplexHeatmap, colorspace, seriation, dendextend, beanplot and corrplot. Survival analysis was performed with the R packages survival and survminer. Multiple testing correction was performed using qvalue. Cosine similarities were calculated using the cosine function from lsa. TSNE analysis was performed using Rtsne. Data handling was performed with GenomicRanges, tidyr, stringr, parallel and gtools.

Ethics

Informed consent from patients and ethical approval for tissue biobanking was obtained through the UCL/UCLH Biobank for Studying Health and Disease (REC reference: 20/YH/0088; NHS Health Research Authority). Approval for the study and ethics oversight was granted by the NHS Health Research Authority (REC reference: 16/NW/0769).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Leave a Comment