Introduction

Here we describe the development of the oligo DNA microarray for gene expression for Lotus japonicus. To obtain as many gene-specific probes as possible, the nucleotide sequences of transcripts (EST, full-length cDNA, and tentative contigs) and genome sequences were aligned to design 60-mer probe sequences (source sequence).

Two microarray designs sharing the same probe set were created with eArray system (Agilent Technologies) .

Microarray Name Design ID Design Format Probe Group Name GEO Platform ID
Kazusa-001* 013809 1 X 22K 013809_Kazusa-001_1 GPL14827
4x44K_Kazusa-001 015205 4 X 44K GPL14826

* The Kazusa-001 microarray is no longer available from eArray.

Download

Probe Annotation Table (a zip compressed Microsoft Excel File, 2.33 MB, click here for details)

Source Sequences for Probe Design (Multiple FASTA format, 7.60 MB).

Microarray Construction

The microarray was constructed by eArray custom microarray design service (Agilent Technologies, Inc. Santa Clara, CA). The 3'-terminal reads with poly A tails (67887 sequences) were selected from the EST collections of Miyakojima MG-20 strain (Endo et al., 2000; Asamizu et al., 2004). Full-length cDNA sequences were prepared from Gifu B-129 strain (2409 sequences, GenBank accession numbers: AK336297-AK338706). The tentative contigs of Lotus japonicus with "CDS: header" marked as "complete" (401 sequences) were selected from the LjGI Release 3.0 constructed by The Institute for Genomic Research (TIGR), and currently available from DFCI. The transcript sequences potentially containing 3'-terminal (70,697 sequences in total) were then subjected to a BLAT search (Kent, 2002) of the pre-release version of the Lotus japonicus genome (Sato et al., 2008) by which clusters of the transcript sequences (hereafter referred to as EST clusters) were formed on the genome sequence. According to the number and location of the EST cluster hits, the predicted genes on the genome were categorized as follows (Fig. S1).

Fig. S1
    Click the image to zoom in.

Fig. S1 Schematic representation of the types of the genome regions or sequences (enclosed in red) used for designing probe sequences.

1) A gene with a single EST cluster hit (GA, 6078 genes).

2) A gene with several EST cluster hits. The gene sequences were split into several regions manually (GB, 817 regions).

3) A set of genes with a single EST cluster hit. The genome regions were split or joined manually (GC, 16 regions).

4) A gene with no EST cluster hits (GD, 7025 genes).

The sequence of the predicted genes and their 3'-flanking sequences covered by EST clusters of GA to GD were subjected to further processing.

We found some EST clusters on the genome without any hit to predicted genes. The direction of such hypothetical transcripts from these genome regions were determined by the majority of the directions of the transcripts in the EST clusters. Both the forward and reverse sequences were adopted for the regions with ambiguous directions. The sequences of these regions were subjected to the further processing (GE, 6750 sequences).

The rest of the transcript sequences which did not produce any hits on the genome sequence were assembled with Phrap (available at http://bozeman.mbt.washington.edu/) and the obtained contigs (EC) and singlets (ES) were used. The directions of the contigs were judged by the directions of the member sequences, and the ambiguous ones were discarded.

The sequences (GA-ES) were then sorted by length and the resulting 23,109 sequences that were longer than 200 nucleotides were selected as the source sequences for probe design. The 5'-region of the source sequences that were longer than 10000 nucleotides were trimmed. Design of the probe sequences was done by Agilent, and the probes for 23052 sequences were generated. We finally selected 21495 probes with consideration for sequence redundancy (NSPM) and specificity (X-hyb) corresponding to the following source sequences: 4801 in GA, 817 in GB, 16 in GC, 6572 in GD, 5000 in GE, 1075 in EC, and 3214 in ES.

Probe Annotations

The predicted genes detectable by the probes were re-calculated based on the published genome sequence (Sato et al., 2008) as follows.

1. The source sequences of the probes were subjected to a BLASTn search (Altschul et al., 1990) against the genome (build 1.0 available at http://www.kazusa.or.jp/lotus/release1/ ).

2. The top three matching sequences of the predicted genes with the highest bit score, sequence similarity > 95%, identical bases > 100, and e-value < 1e-10 were selected as the candidate target genes for each probe.

3. The candidate genes were evaluated as follows.

3-1. The hit region of the BLASTn search was extracted. If part of the hit region was placed outside of the 60-mer probe sequence, the whole region between the probe and the hit region was recognized as the extracted hit region, as shown below.

60-mer probe sequence              oooooo
probe design source sequence   ------------------------------------
                                       ||||||||||   <- BLASTn hit region
subject (genome sequence)      ------------------------------------
extracted hit region               oooooooooooooo

3-2. The predicted genes for which the sequence overlapped with the extracted hit region (3-1) were selected.

3-3. Among the predicted genes selected in 3-2, genes satisfying the following conditions were further selected as the target genes.

- The direction of the predicted gene was the same as that of the probe.

- The probe sequence was overlapped with the coding region or an adjoining 500 bp of 3'-flanking region of the predicted gene.

- The probe sequence was overlapped with the BLASTn hit region.

- There was no description of "retro" or "pseudo" for the predicted gene.

* Probe sequences were not checked to ensure that these were derived only from exons.

4. A BLASTp search was conducted for the target genes selected in 3-3 against the protein sequences of Arabidopsis (TAIR8_pep, http://www.arabidopsis.org/ ), NCBI nr, and UniRef 100 (EMBL). The three sequences showing the highest bit score and having e-value < 1e-30 were selected and recorded in the annotation table.

 

We found 386 probes out of 21495 probes which were mistakenly designed using the reverse chain of the source sequence. These probes are marked as "reverse" in the "Reverse Check" column of the annotation table. These should be excluded from data analyses.

References:

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J Mol Biol 215: 403-410.

Asamizu, E., Nakamura, Y., Sato, S., and Tabata, S. 2004. Characteristics of the Lotus japonicus gene repertoire deduced from large-scale expressed sequence tag (EST) analysis. Plant Mol Biol 54: 405-414.

Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., and Soboleva, A. 2011. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic acids research 39: D1005-1010.

Endo, M., Kokubun, T., Takahata, Y., Higashitani, A., Tabata, S., and Watanabe, M. 2000. Analysis of expressed sequence tags of flower buds in Lotus japonicus. DNA Res 7: 213-216.

Kent, W.J. 2002. BLAT--the BLAST-like alignment tool. Genome Res 12: 656-664.

Sato, S., Nakamura, Y., Kaneko, T., Asamizu, E., Kato, T., Nakao, M., Sasamoto, S., Watanabe, A., Ono, A., Kawashima, K., Fujishiro, T., Katoh, M., Kohara, M., Kishida, Y., Minami, C., Nakayama, S., Nakazaki, N., Shimizu, Y., Shinpo, S., Takahashi, C., Wada, T., Yamada, M., Ohmido, N., Hayashi, M., Fukui, K., Baba, T., Nakamichi, T., Mori, H., and Tabata, S. 2008. Genome structure of the legume, Lotus japonicus. DNA Res 15: 227-239.

Contact Information

Nozomu Sakurai, e-mail: sakurai AT kazusa.or.jp