METASTASIS (SUPPLEMENTARY INFORMATION)
To identify genes that could
discriminate between primary tumors and their synchronous lymph node metastases
for the 26 matched pairs of cohort 1, different statistical approaches were
undertaken:
1. Unsupervised Hierarchical clustering.
A prefiltering of the expression data
was performed by taking into account both MAS5 “Absolute Call” flags and
average expression measurements within each group. Briefly, from a total number
of 44,760 probesets, we selected only the ones called present or marginal (P or
M) at least once across all samples, and whose mean raw expression levels were
≥ 200 within at least one class of samples (T, primary tumors, or M, lymph node
metastases). The prefiltering method removed those probesets whose expression
signal was constantly too close to the background signal throughout the entire
set of samples (likely defined as “Absent” by the MAS5 algorithm and
consequently not expressed). This procedure allowed the selection of 23,281 probesets. Unsupervised hierarchical
clustering analysis performed on the two tumor classes T (primary tumors) and M
(lymph node metastases), indicated that 21 of the 26 total lymph node
metastases were consistently clustered together with the primary tumor from
which they arose, further confirming that, despite the different tissue of
origin of tumor pairs (breast tissue for the primary tumor and lymphatic tissue
for the metastatic tumor), matched samples retained characteristic expression
fingerprints (Supplementary Figure 2a). This finding is in line with previous
reports (Chen et al., 2003; Feng et al., 2007; Hao et al.,
2004; Perou et al., 2000; Weigelt et al., 2003; Weigelt et al., 2005).
2. MAS5 comparison analysis.
A first level of
analysis was performed using the Affymetrix MAS5 “Comparison Analysis”
algorithm (Affymetrix Statistical Algorithm Reference Guide, Affymetrix, Santa
Clara, CA, version 5 edition) by directly comparing each tumor sample (baseline
array) against its corresponding lymph node metastasis (experimental array), in
order to detect and quantify changes in gene expression within individual
matched pairs. As for
the “absolute analysis”, the “comparison analysis” relies on the Wilcoxon’s
signed-rank test to generate a qualitative output with an associated P-value
and a quantitative metric associated with a confidence interval. The
qualitative output indicates if a transcript in the experimental array is
increased (I), decreased (D), or equivalent (NC) to its baseline counterpart.
The quantitative metric provides an estimate of the relative difference in
transcript abundance between the two arrays. Twenty-six comparison lists were
generated and the percentage of patients where each single probeset was called
up-regulated, down-regulated, or not-changed, across the whole data set, was
calculated. Subsequently, the comparison table was ordered to obtain a ranking
of up- and down-regulated genes in all the 26 paired samples. Of note, none of
the probesets was found either up-regulated either down-regulated in 100% of
the matches analyzed. Nonetheless, we identified several hundred genes whose
expression levels were commonly decreased or increased in the metastatic
specimens, with respect to primary tumors, a fact that indicates a high
percentage of regulation (Supplementary Table 2).
Probesets up-regulated in lymph node
metastases, with respect to primary tumors, were selected when called increased
in more than 40% of the comparisons and with an opposite regulation of less
than 30% (375 probesets). Down-regulated probesets were selected with the
opposite criterion (417 probesets). The Fold change (FC) was calculated on the
median normalized average expression value for each probeset in the two classes
of samples (Table 1 and Supplementary Table 2).
3. GeneSpring ANOVA analysis.
The expression profile data
pre-processed with MAS5 were exported to GeneSpring version 7.0 (Silicon
Genetics, Redwood City , CA ) for further elaboration. According to
the GeneSpring normalization procedure, the 50th percentile of all measurements
was used as a positive control, within each hybridization array, and each
measurement for each gene was divided by this control. The bottom 10th
percentile was used for background subtraction. Among different hybridization
arrays, each gene was divided by the median of its measurements in all samples.
Data were then log transformed for subsequent analysis. A prefiltering of the
expression data was performed by taking into account both MAS5 “Absolute Call”
flags and average expression measurements within each group, as described
above. In order to find genes whose expression levels significantly differed
between primary tumors and lymph node metastases, we adopted a supervised
method of analysis, using the GeneSpring software. Mean values were calculated
among individual experimental groups for each probeset, and fold-change ratios
between the primary tumor class (T) and lymph node metastasis class (M) were
derived. A difference of 1.5-fold cut-off was applied to select up-regulated
and down-regulated genes. A further statistical analysis was performed using a
1-Way ANOVA to filter out genes that did not significantly vary across
different groups with multiple samples. This comparison was performed for each
gene, and the genes with sufficiently small P-values (P < 0.05) were
returned. A multiple testing correction analysis was added according to the
Benjamini and Hochberg False Discovery Rate (FDR) procedure (Benjamini and Hochberg, 1995). By this analysis, 588 probesets
were found to be significantly regulated between the two classes of samples.
Many of the genes contained in this list were also scored as consistently up-
and down-regulated in a high percentage of patients in the ranked MAS5
comparison Table (See Supplementary Table 2). A hierarchical clustering of the
T and M classes using the 588-probeset list divided the data set into two major
groups, one mainly composed of primary tumors and the second one of lymph node
metastases, as expected (see Supplementary Table 2 and Supplementary Figure 2b).
We
optimized an SVM with linear kernel and RFE, trained by a proper M-Fold Cross
Validation (MFCV) procedure (Kohavi, 1995). A prefiltering step of data was
performed taking into account MAS5 “Absolute Call” flags. Briefly, from a total
number of 44,760 probesets, we selected only the ones called Present in at
least 50% of the samples in at least one of the two classes, ending up with
15369 probesets. Each probeset was subsequently normalized across the two
classes by subtracting its mean value and dividing by 3 standard deviations, a
common procedure when training SVM algorithms (Campanini et al., 2004; Vapnik and Chapelle, 2000). To easily allow a stratified procedure
the value of M, the number of subsets used in Cross-Validation, was set to 9;
furthermore, the procedure was reinforced by many random subdivisions of data
into these M subsets, to reach more stable results (Kohavi, 1995). For each random subdivision we
performed MFCV while decreasing the number of genes by means of the RFE
algorithm. The correct training for RFE implies that different gene rankings
come from each of the folds within the CV procedure (Ambroise and McLachlan, 2002): therefore is it possible to define an
average gene ranking list, for each of the random subdivisions, by simply
averaging the positions of each gene in the lists derived from each fold. In
order to optimize the two free parameters of SVM-RFE (error cost C and optimal
gene number), we averaged all MFCV error curves with the same parameters on the
300 random subdivisions of data that were exploited. The optimal number of
genes was found to be equal to “5” (EPHA3, KRT14, ODZ2, F2RL2 and FST). The corresponding
Cross-Validation performances were: number of misclassified Tumors misT = 3.87 ± 0.10 (about 17% of Tumor set), number
of misclassified Metastases misM= 4.41 ± 0.10 (about 15% of Metastasis set),
with a 95% confidence interval. In order to verify the accuracy of the training
process, we performed a Monte Carlo simulation
on random datasets obtained by random permutation of each gene value, across
both classes. We applied our method on 100 random data sets to evaluate
averages of MFCV performances. The algorithm reached an error rate close to
50%, corresponding to chance prediction, and showing that the training
procedure was correctly performed (data not shown).
We
derived our final gene-ranking list by averaging all MFCV lists. Because of
noise and uncertainty due to scarcity of data, we could not consider only the
genes corresponding to the optimal gene number, in order to find the final list
to be analyzed for its biological information. Thus, we decided to take into
account the 100 genes with highest ranking in the globally averaged list. A
Wilcoxon paired test on the 26 Tumor-Metastasis pairs was performed to estimate
a P-value for each probeset of the 100 gene list (see Supplementary Table 2 and
Supplementary Figure 2c), in order to highlight those ones showing a large
expression difference.
Independent validation and distant metastases analysis.
In order to validate the 270-probeset
list, we built an independent dataset (cohort 2) composed of 81 lymph
node-positive primary breast tumors, retrieved by public databases (GEO,
http://www.ncbi.nlm.nih.gov/geo/, accession number GSE4922; (Ivshina et al., 2006). Selected samples were: X103B41,
X112B55, X114B68, X130B92, X131B79, X138B34, X146B39, X147B19, X14B98, X150B81,
X154B42, X159B47, X162B98, X165B72, X166B79, X170B15, X173B43, X175B72,
X181B70, X182B43, X183B75, X186B22, X187B36, X193B72, X194B60, X19C33, X200B47,
X201B68, X203B49, X207C08, X216C61, X222C26, X223C51, X225C52, X226C06,
X230C47, X232C58, X234C15, X237C56, X240C54, X244C89, X245C22, X256C45,
X257C87, X269C68, X26C23, X270C93, X271C71, X27C82, X287C67, X288C57, X289C75,
X297C26, X311A27, X313A87, X33C30, X39C24, X40C57, X41C65, X44A53, X46A25,
X47A87, X50A91, X51A98, X53A06, X55A79, X56A94, X58A50, X63A62, X66A84, X67A43,
X69A93, X6B85, X73A01, X76A44, X77A50, X79A35, X7B96, X82A83, X85A03, X96A21,
which were hybridized on the same Affymetrix GeneChip platform (HG-U133). Raw
data files (CEL files) were reprocessed in house (http://services.ifom-ieo-campus.it/),
with the Affymetrix's proprietary MAS5 preprocessing algorithm in order to make
all samples comparable with those used in this study. In addition, we profiled
32 unmatched distant
breast tumor metastases, from thirty-two different patients and from different
organs. The expression profile data pre-processed with MAS5 were exported to
GeneSpring for further elaboration, as described above. Among different
hybridization arrays, each gene was divided by the median of its measurements
in all samples. Data were then log transformed for subsequent analysis.
Hierarchical clustering analysis presented in Figure 2 and 3 of the main text
and the additional statistical analysis was performed with the GeneSpring software
with the same criteria as described above. Starting from the initial list of
270 probesets, 126 probesets were selected, with a corrected (FDR) P-value of
less than 0.05 (Supplementary Table 2), which were able to discriminate between
primary breast tumors and breast tumor-originated distant metastases with an
accuracy of 96%.
Comparison of our gene list to other published gene lists.
In addition to the expression profile studies mentioned and
discussed in the main text, we would also like to comment on the comparison
between our gene list and that reported by Ramaswamy et al. (Ramaswamy et al., 2003). These authors identified an
expression pattern of 128 genes that distinguished unmatched primary and
metastatic adenocarcinomas. The signature was derived by comparing the gene-expression
profiles of 12 metastatic adenocarcinoma nodules of diverse origin (lung,
breast, prostate, colorectal, uterus, ovary, only two were of breast tumor
origin) and 64 primary adenocarcinomas, representing the same spectrum of tumor
types. Our 126-probeset list and the signature of 128
genes identified by Ramaswamy et al.
display almost no overlap (only 2 genes are in common, PTN and COL1A1). This
is, however, not surprising, since the study design of Ramaswamy et al. was completely different from
ours, and most likely the two studies have identified different signatures. In
particular, the signature of Ramaswamy et
al. possibly represents a tumor type-independent signature associated with
metastasis, whereas in our study we selected, probably, a breast-specific
metastasis signature.
In situ hybridization analysis of mRNA expression levels.
The ISH was performed as
previously described (Rugarli et al., 1993). Briefly, we used S35-UTP-labeled sense and antisense
riboprobes generated from the most specific region (300bp) of each gene, as
identified by Blast searches. The identified cDNA regions were then isolated by
PCR using oligos flanked by T3 and T7 RNA polymerase promoters, respectively,
and transcribed in vitro directly.
The sequences of the probes are available upon request. TMA sections were
dewaxed, digested with Proteinase K (20 mg/ml) post-fixed,
acetylated and dried. After overnight hybridization at 500 C,
sections were washed in 50% formamide, 2X SSC, 20 mM 2-mercaptoethanol at 600
C, coated with Kodak NTB-2 photographic emulsion, and exposed for three weeks.
The slides were lightly H&E counterstained for morphologic evaluation. All
TMAs were first analyzed for the expression of housekeeping gene b-actin, to check for the
mRNA quality of the samples. Cases showing absent or low b-actin signal were
excluded from the analysis.
Clones
SerB5 cDNA was cloned by RT-PCR from
MCF10A cells using the following primers: forward: 5’-GAATTCGATGCCCTGCAACTA-3’;
reverse: 5-GAATTCGATGCCCTGCAACTA-3’. APOD cDNA
(Accession AA456975, I.M.A.G.E:838611) was obtained from the "Sequence
Verified Human cDNA Clones" library (Cat.n. 97001.V) issued by ResGen,
(Invitrogen Corporation, http://www.resgen.com/products/SVHcDNA.php3#info).
LTF (BC015823, I.M.A.G.E:4294752) and MMP7 (BC003635, I.M.A.G.E:4294752) full
length clones were purchased from UK HGMP-RC (Human Genome Mapping Project
Resource Centre) (MRC Geneservice Ltd, http://www.geneservice.co.uk/).
All cDNAs were sequence verified.
Quantitative real-time PCR
Quantitative real-time PCR (Q-RT-PCR)
was performed using TaqMan methodology (ABI Prism 7900HT, Applied Biosystems, Foster City , CA ,
USA ). The
following Assays-on-Demand (Applied Biosystems) were employed: Hs00184728_m1
(SERPINB5), Hs00158924_m1 (LTF), Hs00155794_m1 (APOD), Hs00159163_m1 (MMP7). Glyceraldehyde-3-phosphate dehydrogenase (GAPDH,
Hs99999905-m1) was used as a housekeeping gene.
LEGENDS TO SUPPLEMENTARY TABLES
Legend to Supplementary Table 1: Clinical information associated with the patients of cohort 1
(26 breast paired primary tumors and lymph node metastases) and with the 32
patients of cohort 2 from whom distant metastatic samples were obtained. Tumor
Histotype: IDC, Infiltrating Ductal Carcinoma; ILC, Infiltrating Lobular
Carcinoma; NA: not assessed. pT, primary tumor size; pN, regional lymph nodes
(N0, lymph node negative; N+, lymph node positive; Nx, not assessed); M,
distant metastases at diagnosis; ER, estrogen receptor status; PgR,
progesterone receptor status. Treatment: RT, Radiotherapy; CT ns, Chemotherapy, not specified; CMF, Cyclophosphamide,
Methotrexate, 5-Fluorouracil; EpiDX; DX, doxorubicin; TAM, tamoxifen; ADM, Adryamicin; 5FU, 5-Fluorouracil;
FEC, Fluorouracil, Epirubicin, Cyclophosphamide; MMT, Mitoxantrone, Methotrexate, Tamoxifen;
HPR, N-(4-hydroxyphenyl) retinamide; ZOLADEX, Goserelin;
Enantone.
Legend to Supplementary Table 2: List of probesets obtained from the different analyses
applied. Average expression and median normalized values for each gene in the
two classes of samples are reported, and fold changes calculated. The Summaries
of the contents of the individual lists are as follows: 1: list of 792
probesets obtained by the MAS5 comparison analysis of cohort 1; the percentage
of patients in which each gene was found to be Increased (I), Decreased (D) or
Not Changed (NC) is also reported. 2: list of 588 probesets obtained by the
GeneSpring ANOVA analysis of cohort 1. 3: list of best 100 probesets obtained
by the SVM-RFE analysis of cohort 1. 4; list of 270 probesets in common among
at least two of the previous lists. 5; list of 126 probesets,
derived from the initial list of 270, obtained by statistical analysis on
cohort 2. Median normalized average expression values are reported, with fold
changes and relative P-values for each gene: Tumor (T) and Distant Metastasis
(DM).
LEGENDS TO SUPPLEMENTARY FIGURES
Supplementary Figure 1. Evaluation of tumor cellularity using hematoxylin
and eosin (HE) versus cytokeratin (CK). (a) Cellularity
of samples processed for Affymetrix GeneChip analysis was evaluated by two
expert pathologists by morphological analysis of frozen sections stained with
HE. Three representative examples of nodal specimens used in the analysis are
shown. As illustrated, at different magnifications, the nodal tissues were
massively replaced by neoplastic cells (more than 80% of tumor cells clearly
identifiable with HE staining). (b)
Evaluation of tumor cellularity performed by HE and CK staining. A pan
cytokeratin (CK) antibody that includes also CK19 (clone MNF166, Dako, Carpinteria , CA ;
1:400) was used. As shown in these representative images of four diffusely
metastatic lymph nodes, tumor areas outlined by CK are similar to those
identified by HE (delineated by dashed lines). Of note, the use of HE might in
some cases be more advantageous than CK staining. This is exemplified by case
NM4, in which the use of CK would result in a wrong cellularity estimation
compared to HE, due to the loss of CK staining of tumor cells (possibly as a consequence
of biological or technical factors).
Supplementary Figure 2. Hierarchical clustering of 26 paired primary
breast tumors and lymph node metastases (cohort 1) using different gene lists. (a)
Unsupervised hierarchical clustering of 23,281
probesets. (b) Hierarchical
clustering of 588 probesets obtained by the GeneSpring ANOVA analysis. (c) Hierarchical clustering of 100
probesets obtained by SVM-RFE analysis. Rows represent probesets and columns
represent samples (sample color code: red, primary tumors; blue, metastases).
Supplementary Figure 3. The 17 genes selected for in situ analysis separate primary breast tumors from their
metastases. Hierarchical
clustering of the 17 genes on samples of cohort 1. Color codes are as in
Supplementary Figure 2.
Supplementary Figure 4. ISH-TMA analyses on epithelial- and
fibroblast-expressed genes. Data in this figure are supplementary to those displayed in Figure
4 of the main text. The figure contains, in addition to the ISH data on primary
tumors and metastases reported in Figure 4, also ISH data on normal breast
parenchyma from the same patients.
Representative
ISH images (from samples of cohort 3) are shown of epithelial-associated (LTF, MMP7,
SERPINB5 and APOD) and stromal (SFRP2,
POSTN, FN1) genes on paired normal, primary tumor (PT), and lymph node
metastatic (NM) tissue samples. The bright field panels (left of each pair) are
stained with hematoxylin/eosin; the dark field panels (right of each pair) show
the ISH signals (bright areas). As shown, APOD
expression was detected in normal epithelium with apocrine metaplasia and in
stromal cells surrounding normal ducts, while SERPINB5 expression was restricted to myoepithelial cells in normal
glands as confirmed by immunohistochemical analysis. SFRP2 and FN1 did not
give any signal in normal breast epithelium, stroma, and, notably, nodal
parenchyma, while POSTN was expressed
in normal breast epithelium but not in its surrounding stroma. Additionally, FN1 and POSTN were also expressed in a small percentage of cases (10% and
24%, respectively) by epithelial cells in both primary tumors and their paired
metastases, although the level of intensity of the signal was always below that
of the surrounding stromal cells (data not shown). Original magnification 100X.
Supplementary Figure 5. SERPINB5 and LTF do not affect the cell
proliferation rate. 4175
cells infected as shown (EV, empty vector control) were cultivated in standard
conditions and counted at the indicated time points. Results are typical and
representative of three independent experiments.
METASTATIS PROCESS
REFERENCES TO SUPPLEMENTARY INFORMATION
Ambroise C, McLachlan GJ. (2002). Selection bias in gene
extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99:
6562-6566.
Benjamini
Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J. Roy. Stat. Soc. B. 57: 289–300.
Campanini R, Dongiovanni D, Iampieri E, Lanconelli
N, Masotti M, Palermo G et al. (2004). A novel featureless approach
to mass detection in digital mammograms based on support vector machines. Phys Med Biol 49:
961-975.
Chen
X, Leung SY, Yuen ST, Chu KM, Ji J, Li R
et al. (2003). Variation in gene expression patterns in human gastric
cancers. Mol Biol Cell 14:
3208-3215.
Feng Y, Sun B, Li X, Zhang L, Niu Y, Xiao C et al. (2007). Differentially expressed genes between primary
cancer and paired lymph node metastases predict clinical outcome of
node-positive breast cancer patients. Breast
Cancer Res Treat 103: 319-329.
Hao
X, Sun B, Hu L, Lahdesmaki H, Dunmire V, Feng Y et al. (2004). Differential gene and protein expression in primary
breast malignancies and their lymph node metastases as revealed by combined
cDNA microarray and tissue microarray analysis. Cancer 100: 1110-1122.
Ivshina
AV, George J, Senko O, Mow B, Putti TC, Smeds J et al. (2006). Genetic reclassification of histologic grade
delineates new clinical subtypes of breast cancer. Cancer Res 66: 10292-10301.
Kohavi
R. (1995). Proceedings of the 14th
International Joint Conference on Artificial Intelligence. Morgan Kaufmann:
San Francisco, CA, pp 1137-1143.
Perou
CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA et al. (2000). Molecular portraits of human breast tumours. Nature
406: 747-752.
Ramaswamy
S, Ross KN, Lander ES, Golub TR. (2003). A molecular signature of metastasis in
primary solid tumors. Nat
Genet 33: 49-54.
Rugarli EI, Lutz B, Kuratani SC, Wawersik S, Borsani
G, Ballabio A et al. (1993). Expression pattern of the Kallmann
syndrome gene in the olfactory system suggests a role in neuronal targeting. Nat Genet 4:
19-26.
Vapnik
V, Chapelle O. (2000). Bounds on error expectation for support vector machines.
Neural Comput 12:
2013-2036.
Weigelt
B, Glas AM, Wessels LF, Witteveen AT, Peterse JL, van't Veer LJ. (2003). Gene
expression profiles of primary breast tumors maintained in distant metastases. Proc Natl Acad Sci U S A 100:
15901-15905.
Weigelt
B, Wessels LF, Bosma AJ, Glas AM, Nuyten DS, He YD et al. (2005). No common denominator for breast cancer lymph node
metastasis. Br J Cancer 93:
924-932.
METASTATIS PROCESS
Comments
Post a Comment