Deep sequencing-based expression analysis shows major advanc(2)

2018-12-17 14:28

Microarray analysis

The microarray analysis of the exact samples as used for DGE is described in our previous paper (9). Microarray data are available through Gene Expression Omnibus under series GSE8349 [NCBI GEO] .

Alignment to Ensembl transcripts

To enable comparison with microarray probes, all canonical sequence tags and microarray probe sequences were put in FASTA format and then aligned to the ENSEMBL mus_musculus_core_46_36g cDNA (transcript) database using the PERL API. The probe sequences on the Agilent (AGL: WMG G4122A), Illumina (ILL: Sentrix Mouse-6 Expression BeadChip) and home-spotted long oligonucleotide arrays (LGTC: 65-mer Sigma-Compugen mouse library,

version 1), wereprovided by the manufacturer. For the Affymetrix chips (AFF: Mouse Genome 430 v2.0 Array), the sequence from the first probe in the probeset to the last probe in the probeset was taken. For the Applied Biosystems arrays (ABI: AB1700), only the surrounding 180 nt of the probe were given and these were taken into the alignment. Microarray and Ilumina DGE tag-profiling results were compared in pairs. Only ENSEMBL transcripts that were shared between the Illumina Genome Analyzer platform and a certain microarray platform were considered.

Statistical analysis of differential gene expression

Initially, a Student's t-test was performed to determine significant differences in gene expression between the group of wild-type and transgenic samples. Before performing the t-test, we corrected for differences in the total number of counts by multiplying with a linear scaling factor that is defined as the total number of tags obtained for a certain sample divided by the average number of obtained tags in all samples. In addition, we stabilized the variance by applying a square root

transformation on thelinearly scaled data. This square root

transformation gives a better stabilization of the variance in the region of low abundance than a logarithmic transformation. In addition, the square root transformation can handle observations with zero counts. As a better suited alternative for the t-test, we applied a Bayesian model

developed by Vencio et al. (10). We consideredonly canonical tags which had at least one count in each group. It fits a probability density function per gene and per group, employing the Beta-Binomial distribution,

and taking into accountthe number of observed tags in each sample and the library size (=total number of tags) for each sample. A Bayesian error rate is calculated that reflects the posteriori chance that the

probability density function of the group of wild types is in reality not different from the one of the transgenic mice. To estimate the number of false positives in the list of differentially expressed genes obtained by setting a cutoff on the maximum Bayesian error rate, we calculated the

number of genes below the same Bayesian error rate in all unique permutations for the comparison of two groups, where the first group contained two wild-type and two transgenic mice, and the second group contained the other two wild-type and transgenic mice.

Quantitative PCR analysis

The RNA samples used for the qPCR assays were the same as for the DGE experiments. cDNA was synthesized using the Transcriptor First Strand cDNA Synthesis Kit (Roche). Quantitative RT-PCRs (qPCRs) were done on the Lightcycler480 (Roche), with SYBR-Green detection or (when amplification efficiencies with SYBR-Green were below 90%) using the universal probe library (UPL, Roche). Each cDNA was analyzed in quadruplicate, after which

the averagethreshold cycle (Ct) was calculated per sample. The relativeexpression levels were calculated with the 2–Ct method, while using the average threshold cycles for all genes analyzed to correct for differences in cDNA input.

Biological pathway analysis The global test (11) (available from Bioconductor: www.bioconductor.org) was used to test which Gene Ontology (GO)-defined pathways were

significantly deregulated in DCLK compared to wild-type mice. After summarization of the tags for each Entrez Gene entry, the global test was run on the scaled and square root transformed data. The asymptotic method was used to calculate the P-values. Additional filtering of pathways was done on the median of the z-scores for each gene in the pathway (median should be >1.5), to retrieve only those pathways for which the majority of genes contribute to the significance of the pathways.

RESULTS

TOP Sequencing statistics ABSTRACT We sequenced hippocampal DGE libraries from four INTRODUCTION individual wild-type and four individual DLCK MATERIALS AND METHODS transgenic mice. We obtained 2.4 ± 1.22106 sequence RESULTS reads per sample with 2.02105 unique tag sequences. DISCUSSION Figure 1 shows the distribution of the tags over the SUPPLEMENTARY DATA different classes that we discriminate (see FUNDING

‘Materialsand methods’ section and Supplementary REFERENCES

Table 1). Canonicaltags, i.e. those which map to the most 3' CATG site in high-confidencetranscripts, account for 70% of the total number of reads. Since they account for only

20% of all unique tags, these appear tohave an overall much higher abundance than tags corresponding to low-confidence transcripts (see also Supplementary Figure 2). Around 8% of the reads mapped to mitochondrial RNAs. The collective percentage of reads in repeat regions, regions with

no evidencefor transcription, and tags that could not be mapped to thegenome was around 12%.

Figure 1. Categorization and abundance of tags. Distribution (in percentage of total) of unique tags (black bars) and individual reads (counts; open bars) over different categories (average from eight samples): high-confidence transcripts

View larger version (7K): (canonical), low-confidence transcripts

[in this window] (noncanonical), mitochondrial RNA [in a new window] (mito), ribosomal RNA (ribo), genomic [Download PowerPoint region with no evidence for transcription

slide] (just genome), repetitive genomic region (repeats) and tags with no hits in the

genome.

Reproducibility To evaluate the reproducibility of DGE across different laboratories, the same RNAs were pooled and a wild-type and a transgenic pool were analyzed in triplicate at a different site (Illumina Inc., Hayward, CA) using the same protocol. The Pearson correlation coefficients for the number of counts and the normalized (scaled and square root-transformed) number of counts across technical replicates in the same laboratory were >0.99.The correlation between the normalized number of counts from the summed

individual samples in our laboratory and the pool analyzed in the other laboratory were 0.98 and 0.96 for wild-type and transgenic samples,

respectively (plots in Supplementary Figure 3). This is indicative of low technical variability, even across different laboratories.

Dynamic range

The dynamic range of DGE is three to four orders of magnitude. The most abundant transcript, arising from the Ckb gene (brain isoform of creatine kinase), comprises 0.55% of all canonical tags [5.52103 transcripts per million (t.p.m.)]. The lowest expressed transcripts which were still consistently detected in all samples had an abundance of 2 t.p.m., which corresponds with an average of 0.3 copies per cell (12). The hippocampus

is a rich source of unique transcripts: 28 341 different canonicaltags were detected in both wild-type and transgenic groups; including noncanonical mappings increases the number even further. Within the noncanonical group alone, 45 550 tags were identified in both groups. Alternative polyadenylation

DGE is able to discriminate between transcripts with different 3'-ends when they are separated by at least one restriction site. A remarkable 47% of the detected ENSEMBL transcripts were detected by more than one tag. This is unlikely to be caused by partial digestion of the NlaIII enzyme, in which case the more abundant and the less abundant tag for the same transcript would be found at an approximately fixed ratio. In addition, the majority of tags had been identified before in LONG-SAGE libraries. Most likely, it is due to the use of alternative

polyadenylation signals in the 3'-UTR. In addition, a small fraction may be explained by alternative cleavage site selection from the same

polyadenylation site (13). The observed 47% alternative polyadenylation is much higher than the 29% estimated previously based on EST sequences (14). We note that the actual incidence may yet be higher, because 3'-ends downstream of the annotated ENSEMBL transcripts are not mapped to the transcript, and alternative polyadenylation sites with no CATG sites in-between are missed. On the other hand, we have only investigated the hippocampus, while this incidence may well vary between tissues. Antisense transcription

By considering canonical and noncanonical tags with an abundance of >2 t.p.m., and employing the strand-specific nature of the sequencing reads obtained, we find evidence for bidirectional transcription in 51% of all detectable Unigene clusters. While confirming earlier observations of bidirectional transcription in the majority of genes (15–19), our results show that the antisense transcripts are also expressed at substantial levels. Although in most cases the sense transcripts have higher abundance than the antisense transcripts, the opposite is true in

11% of the cases (Supplementary Figure 4). The on-the-bead cDNA synthesis, together with the absence of a correlation between the abundance of sense and antisense transcripts (i.e. antisense tags are generally not more prominent in highly abundant transcripts), almost excludes the possibility that the antisense tags are found due to reverse transcriptase artifacts, as suggested previously (20).

Differentially expressed genes

As a first indication for subtle, yet significant differential gene expression between the two groups of mice, the intragroup Pearson

correlations (among wild-type or transgenic samples) were higher (0.96) than the correlations between samples from the different experimental

groups (0.93) (P-value: 0.056, permutationtest, Supplementary Table 2). A Fisher or similar 2 x 2 contingency table statistical test has previously been used to identify tags with significantly different abundance in two pooled SAGE libraries (21). In such experiments, biological variation between samples is not addressed. Our sequencing of the pooled samples clearly demonstrates the hazards of pooling. Table 1 shows tags that were highly significant in the pooled experiment (based on Fisher's test), and not significant when analyzing the individual samples (Student's t-test).

Clearly, these tags originate fromwild-type sample 1 only. Significant expression of the Mup1 transcript in wild-type sample 1 only was confirmed by qRT-PCR (Supplementary Table 3). Detailed study shows that all these transcripts are highly expressed in blood. Blood contamination of one of the samples, not noted during the tissue dissection procedure, thus leads to the false-positive identification of several differential transcripts in the pooled experiment. While sequencing of pooled SAGE libraries was previously the only option, it is now both advisable and affordable to sequence individual samples.

View this Table 1. Counts for blood-derived transcripts including table: P-values from Fisher test and Student's t-test [in this window] [in a new window]

Since we sequenced multiple libraries from individual samples, we can estimate within- and between-group variation. Initially, we used a

共6页:

Deep sequencing-based expression analysis shows major advanc(2).doc 将本文的Word文档下载到电脑下载失败或者文档不完整，请联系客服人员解决！

下载这篇word文档