METASTASIS (SUPPLEMENTARY INFORMATION)

METASTASIS (SUPPLEMENTARY INFORMATION)

To identify genes that could discriminate between primary tumors and their synchronous lymph node metastases for the 26 matched pairs of cohort 1, different statistical approaches were undertaken:

1. Unsupervised Hierarchical clustering.

A prefiltering of the expression data was performed by taking into account both MAS5 “Absolute Call” flags and average expression measurements within each group. Briefly, from a total number of 44,760 probesets, we selected only the ones called present or marginal (P or M) at least once across all samples, and whose mean raw expression levels were ≥ 200 within at least one class of samples (T, primary tumors, or M, lymph node metastases). The prefiltering method removed those probesets whose expression signal was constantly too close to the background signal throughout the entire set of samples (likely defined as “Absent” by the MAS5 algorithm and consequently not expressed). This procedure allowed the selection of 23,281 probesets. Unsupervised hierarchical clustering analysis performed on the two tumor classes T (primary tumors) and M (lymph node metastases), indicated that 21 of the 26 total lymph node metastases were consistently clustered together with the primary tumor from which they arose, further confirming that, despite the different tissue of origin of tumor pairs (breast tissue for the primary tumor and lymphatic tissue for the metastatic tumor), matched samples retained characteristic expression fingerprints (Supplementary Figure 2a). This finding is in line with previous reports (Chen et al., 2003; Feng et al., 2007; Hao et al., 2004; Perou et al., 2000; Weigelt et al., 2003; Weigelt et al., 2005).

2. MAS5 comparison analysis.

A first level of analysis was performed using the Affymetrix MAS5 “Comparison Analysis” algorithm (Affymetrix Statistical Algorithm Reference Guide, Affymetrix, Santa Clara, CA, version 5 edition) by directly comparing each tumor sample (baseline array) against its corresponding lymph node metastasis (experimental array), in order to detect and quantify changes in gene expression within individual matched pairs. As for the “absolute analysis”, the “comparison analysis” relies on the Wilcoxon’s signed-rank test to generate a qualitative output with an associated P-value and a quantitative metric associated with a confidence interval. The qualitative output indicates if a transcript in the experimental array is increased (I), decreased (D), or equivalent (NC) to its baseline counterpart. The quantitative metric provides an estimate of the relative difference in transcript abundance between the two arrays. Twenty-six comparison lists were generated and the percentage of patients where each single probeset was called up-regulated, down-regulated, or not-changed, across the whole data set, was calculated. Subsequently, the comparison table was ordered to obtain a ranking of up- and down-regulated genes in all the 26 paired samples. Of note, none of the probesets was found either up-regulated either down-regulated in 100% of the matches analyzed. Nonetheless, we identified several hundred genes whose expression levels were commonly decreased or increased in the metastatic specimens, with respect to primary tumors, a fact that indicates a high percentage of regulation (Supplementary Table 2).

      Probesets up-regulated in lymph node metastases, with respect to primary tumors, were selected when called increased in more than 40% of the comparisons and with an opposite regulation of less than 30% (375 probesets). Down-regulated probesets were selected with the opposite criterion (417 probesets). The Fold change (FC) was calculated on the median normalized average expression value for each probeset in the two classes of samples (Table 1 and Supplementary Table 2).

3. GeneSpring ANOVA analysis.

The expression profile data pre-processed with MAS5 were exported to GeneSpring version 7.0 (Silicon Genetics, Redwood City, CA) for further elaboration. According to the GeneSpring normalization procedure, the 50th percentile of all measurements was used as a positive control, within each hybridization array, and each measurement for each gene was divided by this control. The bottom 10th percentile was used for background subtraction. Among different hybridization arrays, each gene was divided by the median of its measurements in all samples. Data were then log transformed for subsequent analysis. A prefiltering of the expression data was performed by taking into account both MAS5 “Absolute Call” flags and average expression measurements within each group, as described above. In order to find genes whose expression levels significantly differed between primary tumors and lymph node metastases, we adopted a supervised method of analysis, using the GeneSpring software. Mean values were calculated among individual experimental groups for each probeset, and fold-change ratios between the primary tumor class (T) and lymph node metastasis class (M) were derived. A difference of 1.5-fold cut-off was applied to select up-regulated and down-regulated genes. A further statistical analysis was performed using a 1-Way ANOVA to filter out genes that did not significantly vary across different groups with multiple samples. This comparison was performed for each gene, and the genes with sufficiently small P-values (P < 0.05) were returned. A multiple testing correction analysis was added according to the Benjamini and Hochberg False Discovery Rate (FDR) procedure (Benjamini and Hochberg, 1995). By this analysis, 588 probesets were found to be significantly regulated between the two classes of samples. Many of the genes contained in this list were also scored as consistently up- and down-regulated in a high percentage of patients in the ranked MAS5 comparison Table (See Supplementary Table 2). A hierarchical clustering of the T and M classes using the 588-probeset list divided the data set into two major groups, one mainly composed of primary tumors and the second one of lymph node metastases, as expected (see Supplementary Table 2 and Supplementary Figure 2b).

4. Support Vector Machine with Recursive Feature Elimination (SVM-RFE) analysis.

We optimized an SVM with linear kernel and RFE, trained by a proper M-Fold Cross Validation (MFCV) procedure (Kohavi, 1995). A prefiltering step of data was performed taking into account MAS5 “Absolute Call” flags. Briefly, from a total number of 44,760 probesets, we selected only the ones called Present in at least 50% of the samples in at least one of the two classes, ending up with 15369 probesets. Each probeset was subsequently normalized across the two classes by subtracting its mean value and dividing by 3 standard deviations, a common procedure when training SVM algorithms (Campanini et al., 2004; Vapnik and Chapelle, 2000). To easily allow a stratified procedure the value of M, the number of subsets used in Cross-Validation, was set to 9; furthermore, the procedure was reinforced by many random subdivisions of data into these M subsets, to reach more stable results (Kohavi, 1995). For each random subdivision we performed MFCV while decreasing the number of genes by means of the RFE algorithm. The correct training for RFE implies that different gene rankings come from each of the folds within the CV procedure (Ambroise and McLachlan, 2002): therefore is it possible to define an average gene ranking list, for each of the random subdivisions, by simply averaging the positions of each gene in the lists derived from each fold. In order to optimize the two free parameters of SVM-RFE (error cost C and optimal gene number), we averaged all MFCV error curves with the same parameters on the 300 random subdivisions of data that were exploited. The optimal number of genes was found to be equal to “5” (EPHA3, KRT14, ODZ2, F2RL2 and FST). The corresponding Cross-Validation performances were: number of misclassified Tumors misT = 3.87 ± 0.10 (about 17% of Tumor set), number of misclassified Metastases misM= 4.41 ± 0.10 (about 15% of Metastasis set), with a 95% confidence interval. In order to verify the accuracy of the training process, we performed a Monte Carlo simulation on random datasets obtained by random permutation of each gene value, across both classes. We applied our method on 100 random data sets to evaluate averages of MFCV performances. The algorithm reached an error rate close to 50%, corresponding to chance prediction, and showing that the training procedure was correctly performed (data not shown).

      We derived our final gene-ranking list by averaging all MFCV lists. Because of noise and uncertainty due to scarcity of data, we could not consider only the genes corresponding to the optimal gene number, in order to find the final list to be analyzed for its biological information. Thus, we decided to take into account the 100 genes with highest ranking in the globally averaged list. A Wilcoxon paired test on the 26 Tumor-Metastasis pairs was performed to estimate a P-value for each probeset of the 100 gene list (see Supplementary Table 2 and Supplementary Figure 2c), in order to highlight those ones showing a large expression difference.

Independent validation and distant metastases analysis.

In order to validate the 270-probeset list, we built an independent dataset (cohort 2) composed of 81 lymph node-positive primary breast tumors, retrieved by public databases (GEO, http://www.ncbi.nlm.nih.gov/geo/, accession number GSE4922; (Ivshina et al., 2006). Selected samples were: X103B41, X112B55, X114B68, X130B92, X131B79, X138B34, X146B39, X147B19, X14B98, X150B81, X154B42, X159B47, X162B98, X165B72, X166B79, X170B15, X173B43, X175B72, X181B70, X182B43, X183B75, X186B22, X187B36, X193B72, X194B60, X19C33, X200B47, X201B68, X203B49, X207C08, X216C61, X222C26, X223C51, X225C52, X226C06, X230C47, X232C58, X234C15, X237C56, X240C54, X244C89, X245C22, X256C45, X257C87, X269C68, X26C23, X270C93, X271C71, X27C82, X287C67, X288C57, X289C75, X297C26, X311A27, X313A87, X33C30, X39C24, X40C57, X41C65, X44A53, X46A25, X47A87, X50A91, X51A98, X53A06, X55A79, X56A94, X58A50, X63A62, X66A84, X67A43, X69A93, X6B85, X73A01, X76A44, X77A50, X79A35, X7B96, X82A83, X85A03, X96A21, which were hybridized on the same Affymetrix GeneChip platform (HG-U133). Raw data files (CEL files) were reprocessed in house (http://services.ifom-ieo-campus.it/), with the Affymetrix's proprietary MAS5 preprocessing algorithm in order to make all samples comparable with those used in this study. In addition, we profiled 32 unmatched distant breast tumor metastases, from thirty-two different patients and from different organs. The expression profile data pre-processed with MAS5 were exported to GeneSpring for further elaboration, as described above. Among different hybridization arrays, each gene was divided by the median of its measurements in all samples. Data were then log transformed for subsequent analysis. Hierarchical clustering analysis presented in Figure 2 and 3 of the main text and the additional statistical analysis was performed with the GeneSpring software with the same criteria as described above. Starting from the initial list of 270 probesets, 126 probesets were selected, with a corrected (FDR) P-value of less than 0.05 (Supplementary Table 2), which were able to discriminate between primary breast tumors and breast tumor-originated distant metastases with an accuracy of 96%.

Comparison of our gene list to other published gene lists.

In addition to the expression profile studies mentioned and discussed in the main text, we would also like to comment on the comparison between our gene list and that reported by Ramaswamy et al. (Ramaswamy et al., 2003). These authors identified an expression pattern of 128 genes that distinguished unmatched primary and metastatic adenocarcinomas. The signature was derived by comparing the gene-expression profiles of 12 metastatic adenocarcinoma nodules of diverse origin (lung, breast, prostate, colorectal, uterus, ovary, only two were of breast tumor origin) and 64 primary adenocarcinomas, representing the same spectrum of tumor types. Our 126-probeset list and the signature of 128 genes identified by Ramaswamy et al. display almost no overlap (only 2 genes are in common, PTN and COL1A1). This is, however, not surprising, since the study design of Ramaswamy et al. was completely different from ours, and most likely the two studies have identified different signatures. In particular, the signature of Ramaswamy et al. possibly represents a tumor type-independent signature associated with metastasis, whereas in our study we selected, probably, a breast-specific metastasis signature.

In situ hybridization analysis of mRNA expression levels.

The ISH was performed as previously described (Rugarli et al., 1993). Briefly, we used S35-UTP-labeled sense and antisense riboprobes generated from the most specific region (300bp) of each gene, as identified by Blast searches. The identified cDNA regions were then isolated by PCR using oligos flanked by T3 and T7 RNA polymerase promoters, respectively, and transcribed in vitro directly. The sequences of the probes are available upon request. TMA sections were dewaxed, digested with Proteinase K (20 mg/ml) post-fixed, acetylated and dried. After overnight hybridization at 50⁰ C, sections were washed in 50% formamide, 2X SSC, 20 mM 2-mercaptoethanol at 60⁰ C, coated with Kodak NTB-2 photographic emulsion, and exposed for three weeks. The slides were lightly H&E counterstained for morphologic evaluation. All TMAs were first analyzed for the expression of housekeeping gene b-actin, to check for the mRNA quality of the samples. Cases showing absent or low b-actin signal were excluded from the analysis.

Clones

SerB5 cDNA was cloned by RT-PCR from MCF10A cells using the following primers: forward: 5’-GAATTCGATGCCCTGCAACTA-3’; reverse: 5-GAATTCGATGCCCTGCAACTA-3’. APOD cDNA (Accession AA456975, I.M.A.G.E:838611) was obtained from the "Sequence Verified Human cDNA Clones" library (Cat.n. 97001.V) issued by ResGen, (Invitrogen Corporation, http://www.resgen.com/products/SVHcDNA.php3#info). LTF (BC015823, I.M.A.G.E:4294752) and MMP7 (BC003635, I.M.A.G.E:4294752) full length clones were purchased from UK HGMP-RC (Human Genome Mapping Project Resource Centre) (MRC Geneservice Ltd, http://www.geneservice.co.uk/). All cDNAs were sequence verified.

Quantitative real-time PCR

Quantitative real-time PCR (Q-RT-PCR) was performed using TaqMan methodology (ABI Prism 7900HT, Applied Biosystems, Foster City, CA, USA). The following Assays-on-Demand (Applied Biosystems) were employed: Hs00184728_m1 (SERPINB5), Hs00158924_m1 (LTF), Hs00155794_m1 (APOD), Hs00159163_m1 (MMP7). Glyceraldehyde-3-phosphate dehydrogenase (GAPDH, Hs99999905-m1) was used as a housekeeping gene.

LEGENDS TO SUPPLEMENTARY TABLES

Legend to Supplementary Table 1: Clinical information associated with the patients of cohort 1 (26 breast paired primary tumors and lymph node metastases) and with the 32 patients of cohort 2 from whom distant metastatic samples were obtained. Tumor Histotype: IDC, Infiltrating Ductal Carcinoma; ILC, Infiltrating Lobular Carcinoma; NA: not assessed. pT, primary tumor size; pN, regional lymph nodes (N0, lymph node negative; N+, lymph node positive; Nx, not assessed); M, distant metastases at diagnosis; ER, estrogen receptor status; PgR, progesterone receptor status. Treatment: RT, Radiotherapy; CT ns, Chemotherapy, not specified; CMF, Cyclophosphamide, Methotrexate, 5-Fluorouracil; EpiDX; DX, doxorubicin; TAM, tamoxifen; ADM, Adryamicin; 5FU, 5-Fluorouracil; FEC, Fluorouracil, Epirubicin, Cyclophosphamide; MMT, Mitoxantrone, Methotrexate, Tamoxifen; HPR, N-(4-hydroxyphenyl) retinamide; ZOLADEX, Goserelin; Enantone.

Legend to Supplementary Table 2: List of probesets obtained from the different analyses applied. Average expression and median normalized values for each gene in the two classes of samples are reported, and fold changes calculated. The Summaries of the contents of the individual lists are as follows: 1: list of 792 probesets obtained by the MAS5 comparison analysis of cohort 1; the percentage of patients in which each gene was found to be Increased (I), Decreased (D) or Not Changed (NC) is also reported. 2: list of 588 probesets obtained by the GeneSpring ANOVA analysis of cohort 1. 3: list of best 100 probesets obtained by the SVM-RFE analysis of cohort 1. 4; list of 270 probesets in common among at least two of the previous lists. 5; list of 126 probesets, derived from the initial list of 270, obtained by statistical analysis on cohort 2. Median normalized average expression values are reported, with fold changes and relative P-values for each gene: Tumor (T) and Distant Metastasis (DM).

LEGENDS TO SUPPLEMENTARY FIGURES

Supplementary Figure 1. Evaluation of tumor cellularity using hematoxylin and eosin (HE) versus cytokeratin (CK). (a) Cellularity of samples processed for Affymetrix GeneChip analysis was evaluated by two expert pathologists by morphological analysis of frozen sections stained with HE. Three representative examples of nodal specimens used in the analysis are shown. As illustrated, at different magnifications, the nodal tissues were massively replaced by neoplastic cells (more than 80% of tumor cells clearly identifiable with HE staining). (b) Evaluation of tumor cellularity performed by HE and CK staining. A pan cytokeratin (CK) antibody that includes also CK19 (clone MNF166, Dako, Carpinteria, CA; 1:400) was used. As shown in these representative images of four diffusely metastatic lymph nodes, tumor areas outlined by CK are similar to those identified by HE (delineated by dashed lines). Of note, the use of HE might in some cases be more advantageous than CK staining. This is exemplified by case NM4, in which the use of CK would result in a wrong cellularity estimation compared to HE, due to the loss of CK staining of tumor cells (possibly as a consequence of biological or technical factors).

Supplementary Figure 2. Hierarchical clustering of 26 paired primary breast tumors and lymph node metastases (cohort 1) using different gene lists. (a) Unsupervised hierarchical clustering of 23,281 probesets. (b) Hierarchical clustering of 588 probesets obtained by the GeneSpring ANOVA analysis. (c) Hierarchical clustering of 100 probesets obtained by SVM-RFE analysis. Rows represent probesets and columns represent samples (sample color code: red, primary tumors; blue, metastases).

Supplementary Figure 3. The 17 genes selected for in situ analysis separate primary breast tumors from their metastases. Hierarchical clustering of the 17 genes on samples of cohort 1. Color codes are as in Supplementary Figure 2.

Supplementary Figure 4. ISH-TMA analyses on epithelial- and fibroblast-expressed genes. Data in this figure are supplementary to those displayed in Figure 4 of the main text. The figure contains, in addition to the ISH data on primary tumors and metastases reported in Figure 4, also ISH data on normal breast parenchyma from the same patients.

        Representative ISH images (from samples of cohort 3) are shown of epithelial-associated (LTF, MMP7, SERPINB5 and APOD) and stromal (SFRP2, POSTN, FN1) genes on paired normal, primary tumor (PT), and lymph node metastatic (NM) tissue samples. The bright field panels (left of each pair) are stained with hematoxylin/eosin; the dark field panels (right of each pair) show the ISH signals (bright areas). As shown, APOD expression was detected in normal epithelium with apocrine metaplasia and in stromal cells surrounding normal ducts, while SERPINB5 expression was restricted to myoepithelial cells in normal glands as confirmed by immunohistochemical analysis. SFRP2 and FN1 did not give any signal in normal breast epithelium, stroma, and, notably, nodal parenchyma, while POSTN was expressed in normal breast epithelium but not in its surrounding stroma. Additionally, FN1 and POSTN were also expressed in a small percentage of cases (10% and 24%, respectively) by epithelial cells in both primary tumors and their paired metastases, although the level of intensity of the signal was always below that of the surrounding stromal cells (data not shown). Original magnification 100X.

Supplementary Figure 5. SERPINB5 and LTF do not affect the cell proliferation rate. 4175 cells infected as shown (EV, empty vector control) were cultivated in standard conditions and counted at the indicated time points. Results are typical and representative of three independent experiments.

METASTATIS PROCESS

REFERENCES TO SUPPLEMENTARY INFORMATION

Ambroise C, McLachlan GJ. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99: 6562-6566.

Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 57: 289–300.

Campanini R, Dongiovanni D, Iampieri E, Lanconelli N, Masotti M, Palermo G et al. (2004). A novel featureless approach to mass detection in digital mammograms based on support vector machines. Phys Med Biol 49: 961-975.

Chen X, Leung SY, Yuen ST, Chu KM, Ji J, Li R et al. (2003). Variation in gene expression patterns in human gastric cancers. Mol Biol Cell 14: 3208-3215.

Feng Y, Sun B, Li X, Zhang L, Niu Y, Xiao C et al. (2007). Differentially expressed genes between primary cancer and paired lymph node metastases predict clinical outcome of node-positive breast cancer patients. Breast Cancer Res Treat 103: 319-329.

Hao X, Sun B, Hu L, Lahdesmaki H, Dunmire V, Feng Y et al. (2004). Differential gene and protein expression in primary breast malignancies and their lymph node metastases as revealed by combined cDNA microarray and tissue microarray analysis. Cancer 100: 1110-1122.

Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J et al. (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res 66: 10292-10301.

Kohavi R. (1995). Proceedings of the 14th International Joint Conference on Artificial Intelligence. Morgan Kaufmann: San Francisco, CA, pp 1137-1143.

Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA et al. (2000). Molecular portraits of human breast tumours. Nature 406: 747-752.

Ramaswamy S, Ross KN, Lander ES, Golub TR. (2003). A molecular signature of metastasis in primary solid tumors. Nat Genet 33: 49-54.

Rugarli EI, Lutz B, Kuratani SC, Wawersik S, Borsani G, Ballabio A et al. (1993). Expression pattern of the Kallmann syndrome gene in the olfactory system suggests a role in neuronal targeting. Nat Genet 4: 19-26.

Vapnik V, Chapelle O. (2000). Bounds on error expectation for support vector machines. Neural Comput 12: 2013-2036.

Weigelt B, Glas AM, Wessels LF, Witteveen AT, Peterse JL, van't Veer LJ. (2003). Gene expression profiles of primary breast tumors maintained in distant metastases. Proc Natl Acad Sci U S A 100: 15901-15905.

Weigelt B, Wessels LF, Bosma AJ, Glas AM, Nuyten DS, He YD et al. (2005). No common denominator for breast cancer lymph node metastasis. Br J Cancer 93: 924-932.

Comments