Statistical analysis of microarray data - Data formats
A very tricky issue with microarrays is the multiplicity of data formats, and the weaknesses of their definitions.In this new field, there is a tendency to define a new ad hoc data format for each new software. Some programs, like TIGR-MeV, accept multiple input formats. Some other only accept their custom format.
The alternative formats are usually not very different from each other, they mainly differ in the order of the columns, the convention for adding comments, and the auxiliary information attached to each spot (gene).
There are basically two families of files : single or multiple chip file.
Single chip files
Each file contains detailed information about a single chip (experiment). One row per spot (gene), one column per criterion. The different columns typically contain information aboutAdditional colums are found in the more elaborate formats (GenePix, spot, TMev).
- Green channel intensity
- Red channel intensity
- Green channel background
- Red channel background
- Spot number
- Position of the spot on the slide (column, row, block, ...)
Multiple chip files
These files combine information from multiple chips (each chip can represent a given experiment, sample, tissue type, patien type, time point, ...).Typically, a multiple chip file contains one row per spot (gene), and one column per chip (plus a few columns with a description of the gene, and some additional parameters). The information in the experiment columns is usully restricted to Red/Green ratios or log(ratios).
Multiple chip files are convenient for managing time series or large mutant collections, but, since the information is restricted to one value per spot, some normalization procedures (e.g. Lowess) cannot be applied anymore.
Some popular data formats
The table below provides links to formal description of some widely used data formats.
Program Format File Extension Exp. URL and comments TIGR MIDAS tav tav single http://www.tigr.org/software/tm4/menu/midas.pdf This format is poorly structured : the 6 first columns are defined, the next columns can be filled in various ways. In addition, there is no header to indicate which column contains what. This has to be specified in the Preferences of the programs MIDAS and MeV. Tricky.
TIGR MeV mev mev single http://www.tigr.org/software/tm4/menu/mev.pdf It is an update to the .tav format. Comments are supported (rows starting with #). A header row indicates the column content.
TIGR MeV Eisen's Cluster
Stanford txt
tabmultiple Tab-delimited text file. One row per spot (gene), one column per chip (experiment). The first column contains the ID of each spot. Additional columns can optionally be used to give additional information (gene name, description, spot quality, ...).
This format is convenient for treating a large number of experiments together, but it contains much less information that the single-chip formats, since each measurement is restricted to a single value (log-ratio for cDNA arrays, intentisty ratio for Affymetrix chips).
GenePix Genepix result gpr single http://www.axon.com/gn_GenePix_File_Formats.html#gpr
spot spot spot single http://www.cmis.csiro.au/iap/Spot/spotmanual.htm This format is self-containe: a header defines the content of each column. This is also the format used by the R package SMA.
Interconversions between expression files
The TIGR developed ExpressConverter, a program for converting expression data from different file formats (GenePix, ImaGene, ScanArray) towards the two formats supported by the TIGR M4 suite (tav and mev).
Unfortunately, this program does not support conversions in the opposite way. We did not find any program to convert spot (the R format) to other formats, or to import tav/mev files in R. In summary, developers seem more motivated by writing import than export routines. Any additional information on converters is welcome, and can be sent to Jacques van Helden (jvanheld@bigre.ulb.ac.be).
We summarize hereafter the types of conversions that are possible with ExpressConverter and R.
From To Program Method GenePix
ImaGene
ScanArraytav
mevExpressConverter Graphical user interface GenePix spot R read.genepix()
write.spot()