Quality Control for spotted arrays

April 19, 2007
Agnes Paquet1, Andrea Barczak1, (Jean) Yee Hwa Yang2

1. Department of Medicine, Functional Genomics Core Facility, University of California, San Francisco
paquetagnes@yahoo.com
2. School of Mathematics and Statistics, University of Sydney, Australia
http://arrays.ucsf.edu/analysis/arrayquality.html

1. Introduction to arrayQuality

ArrayQuality is a R package, available as part of Bioconductor, designed to help assessing quality of spotted array experiments at several stages of the microarray lifecycle. It provides reports containing several plots and statistical measures that can help you determine if your hybridizations and slides are of good quality. More information about Bioconductor is available at http://www.bioconductor.org.

This guide provides an introduction to microarray quality and a description of the main functionnalities of the package. A full description of the package is given by the individual function help documents available from the R online help system. To access the online help, type help(package=limma) at the R prompt or else start the html help system using help.start() or the Windows drop-down help menu.

2. Installing arrayQuality

Requirements

ArrayQuality is a library for the R project, part of Bioconductor. You will need to have R installed on your computer before installing arrayQuality. For more information about R, see the R project at http://www.r-project.org. ArrayQuality can work on different files at the same time, ONLY if they are from the SAME print-run (same GAL file). If you want to generate quality reports of slides from different print-runs, you need to place them in different folders, one for each print-run.

Installing arrayQuality

ArrayQuality can be installed either from Bioconductor or from the functional Genomics Core Facility web site at http://arrays.ucsf.edu/software. The version from Bioconductor is updated every 6 months. If you would like to use a more recent version, you can obtain the latest one from  http://arrays.ucsf.edu/software or from the developmental version of Bioconductor.



3. Quick starting guide to arrayQuality

General hybridization quality

This component of the package is aimed at verifying the performance of your hybridization, given the good quality of the slide, before any preprocessing steps or further quality assessment on individual spots. Our package provides two kinds of quality control plots:

Diagnostic plots can be generated directly for output of the GenePix (.gpr files), Spot (.spot files)and Agilent image processing software packages. Most arguments can be customized to match your own data: which probes are used as controls, which columns of the image processing output file are used to define your spot types... You can also specify your own collection of good quality slides. For more details, please refer to other sections of this manual.

To generate quality plots from image processing output files

We provide 3 main functions to generate quality plots: gpQuality(), spotQuality() and agQuality(). We will use gpQuality as an example, but the following can be directly applied to spotQuality or agQuality.

  1. Create a directory and move the image processing output files (e.g. .gpr files) of the slides of interest to this directory. Make sure that all files in the directory come from the SAME print-run (same GAL file).


  2. Start R, and change R working directory to the one you have just created. In the R menu, select File, then click on “Change dir…”. Browse to your directory from the pop-up window, or enter it manually, and click OK. To double check that you are in the correct directory: in the File menu, click on “Display file(s)…”.


  3. To load the package in your R session: type
    library(arrayQuality)

    If needed, you may have to install other required packages like marray, limma, convert and hexbin.


  4. To generate both diagnostic plots and comparative boxplots on all files in the directory, type:
    result <- gpQuality(organism=”Mm”)


  5. To generate diagnostic plots only, run:
    result <- gpQuality(organism="Mm", compBoxplot="FALSE")
    In this case, quantitative quality measures will not be calculated and the HTML report will not be generated.


  6. To write down your quantitative quality measures and your normalized data to a file: set output = TRUE when calling gpQuality:
    result <- gpQuality(organism="Mm", output=TRUE)


  7. By default, arrayQuality uses print-tip loess normalization. If you prefer to use another method, you can specify it in the norm argument:
    result <- gpQuality(norm="none")
    For more details about normalization methods, please refer to marray package help.


To generate quality plots from marrayRaw or RGList objects

You can use the function maQualityPlots() to generate diagnostic plots directly from your R object.

Results

gpQuality, spotQuality and agQuality output:

Print-run quality control

This component of arrayQuality provides diagnostic plots for 9-mers hybridization and Quality Control hybridization.

9-mers analysis: PRv9mers()

This type of hybridization, which we term “9mers hyb”, uses small oligonucleotides (random 9-mers) which will hybridize to each probe on the arrays. This will help to determine the quality of spot morphology as well as the presence or absence of spotted oligonucleotides. The resulting data will be used to create a list of all missing spots.
In the package, the graphical function to assess 9mers hybridization quality is PRv9mers(). It runs using one single command line script.

To generate diagnostic plots

  1. Copy all 9-mers hybridizations gpr files from the SAME print-run (same GAL file) to a directory.


  2. Start R and change R working directory to the one containing your gpr files (see above for more details).


  3. Load the package in your R session:
    library(arrayQuality)

    If needed, you may have to install other required packages like marray, limma, convert and hexbin.


  4. To generate diagnostic plots, type:
    PRv9mers(prname=”12Mm”)
    The prname argument represents the name of your print-run. For more details about other arguments, please refer to the online manual.


Results

PRv9mers()provides the following results:
  1. Diagnostic plots as image in .png format for each tested slide
  2. An Excel file (typically named 9Mm9mer.xls, where 9Mm is the name of your print-run, as passed to prname) containing for each spot on the slide:
  3. An Excel file (typically named 9MmMissing.xls, where 9Mm is the name of your print-run, as passed to prname) containing information on missing probes only:
  4. A text file (typically named 9MmQuickList.txt, where 9Mm is the name of your print-run, as passed to prname) containing the missing probes ids, each on a separate line. This file can be opened in any word processing programd as well as being a "quick list" in Acuity software (http://www.axon.com/gn_Acuity.html).


Quality Control hybridization: PRvQCHyb()

9-mers hybridizations help verify that oligonucleotides have been spotted properly on the slides. The next print-run quality control step will be:
  1. Detect any difference in overall signal intensity compared to other print-runs
  2. Check if the GAL file was generated properly, i.e. check that no error was made with ordering or orientation of the plates during the print.
  3. Reproducibility:
    A good way to verify the quality of a new print is to hybridize known samples to new slides. Then, we can compare signal intensity from the new slides to existing data, and check that there is no loss in signal. Log ratios (M) for known samples should be similar across print-runs. Example of samples used for QCHybs includes Human Reference pool, Mouse liver, Mouse lung, with dye swaps.

To generate diagnostic plots

  1. Create a directory and move the image processing output files (.gpr files only) of the slides of interest to this directory. Make sure that all files in the directory come from the SAME print-run (same GAL file).


  2. Start R, and change R working directory to the one you have created (see general hybridization paragraph above for more details)


  3. Load the package in your R session:
    library(arrayQuality)

    If needed, you may have to install other required packages like marray, limma, convert and hexbin.


  4. For QCHyb analysis, run the following command:

    PRvQCHyb()

Results

PRvQCHyb() returns a diagnostic plot as an image in .png format for each tested slide.

Introduction to microarray quality

A microarray experiment is composed of several steps, including experimental design, sample preparation, and various statistical analyses (figure 1). They are represented in the microarray lifecycle below. As microarray technology is complex and sensitive, it is important to assess the performance of each step before going to the next one. In addition, this is also a good way to trace back the cycle to understand potential causes for upstream problems.

microarray experiment lifecycle
Figure 1: Microarray experiment lifecycle

For spotted array experiments, quality controls can be summarized into 4 steps:
  1. Print quality
  2. mRNA quality
  3. Array hybridization quality
  4. Spot quality
Each step must be performed in a sequential order, as represented in Figure 2.

Quality control steps for microarray experiment

Figure 2: Quality Control for spotted arrays experiment

Our package provides graphical tools to look at two of these components: print-run quality and array hybridization quality.
  1. Print quality:

    This component is highly tailored to the Shared Genomics Core Facility at UCSF, but the framework can be adapted to other Core facilities or laboratories printing their arrays. It is an essential component of a printed array experiment, as any print pin, probe or slide surface defect will affect the quality of hybridization to the slide, and this can’t be fixed by statistics. Only prints that did pass the quality control check will be used for actual hybridization.


  2. Hybridization quality:

    This is a global assessment of the hybridization performance. It helps determine for example any problem with the dyes, or uneven hybridization. Then, once you have determined that your hybridization is good, you can look at each individual spot quality, remove bad spots, and perform statistical analysis.


3. Print-run quality control

When a print-run is completed, it is necessary to verify the quality of the resulting arrays. This can be done by using two kinds of hybridization to the new slides. The first type of hybridization, which we term “9mers hyb”, uses small oligonucleotides (random 9-mers), which will hybridize to each probe. This hybridization will help to determine the quality of spot morphology as well as the presence or absence of spotted oligonucleotides. The resulting data will be used to create a list of all missing spots.

The second type of hybridization, which we will term Quality Control Hybridization (QCHyb), uses mRNA from predefined cell lines (e.g. liver vs. pool, K562 vs. Human Universal Reference pool from Stratagene). These hybridizations can be use as a more quantitative description of the slides. The same comparison hybridizations are done for different print-run, assessing their reproducibility. QCHybs are also used to verify accuracy of GAL files, number of missing spots, binding capacity, background signal intensity…

The arrayQuality package provides specific tools to help assess quality of slides for both 9-mers and QC hybridization.

3.1 9-mers hybridizations

In the package, the graphical function to assess 9mers hybridization quality is PRv9mers(). It runs using one single command line script. To use it:
The prname argument represents the name of your print-run. For more details about other arguments, please refer to the online manual.

Results

PRv9mers() provides the following results:
  1. Diagnostic plots as image in .png format for each tested slide


  2. An Excel file (typically named 9Mm9mer.xls, where 9Mm is the name of your print-run, as passed to prname) containing for each spot on the slide:
  3. An Excel file (typically named 9MmMissing.xls, where 9Mm is the name of your print-run, as passed to prname) containing information on missing probes only:



  4. A text file (typically named 9MmQuickList.txt, where 9Mm is the name of your print-run, as passed to prname) containing the missing probes ids, each on a separate line. This file can be opened in any word processing program.

Description of the diagnostic plots

Figure 3 shows an example from a typical 9-mers hybridization. This image is divided in 5 plots.

  1. The first column (left) represents boxplots of log intensity, by plates (top) and by print-tip group (bottom). In this example, you will notice on the boxplot by plates (top left corner) that plates 44 and 48 have lower intensity and wider range than the others. Both plates contain mostly empty controls, as designed by Operon.
  2. Central plot: spatial plot of intensity. This helps to locate missing spots. The color scale reflects the signal intensity, the darker the color of the plot, the stronger the signal. Missing spots are represented in white. In Figure 3 spatial plot, top right corner white spots come from the empty spots.
  3. Right column: Density plot of the foreground and background log intensity.
    1. Foreground density plot: it should be composed of 2 peaks. A smaller peak in the low intensity region containing missing spots and negative control spots, and a higher one representing the rest of the spots (probes). The number of present and absent spots, excluding empty controls, estimated by EM algorithm is indicated on the graph.
    2. Background density plot: one peak in the low intensity region. If a slide is of good quality, the background peak should not overlap too much with the foreground peak corresponding to the bulk of the data.
Density plots are used to compare foreground and background peaks, using the X-axis scale. They should be clearly separated. The number of missing spots should be low. Missing spots ids may be incorporated in the analysis later, e.g. by down weighting them in linear models.

Examples

This example uses 9-mer hybridization data performed in the Functional Genomics Core Facility in UCSF. This print-run was created using Operon Version 2 Mouse oligonucleotides.

> library(arrayQuality)
> datadir <- system.file("gprQCData", package="arrayQuality")
> PRv9mers(fnames="12Mm250.gpr",path=datadir, prname="12Mm")


Example of 9-mers hybridization diagnostic plot

Figure 3: Example of diagnostic plot for 9-mers hybridization

3.2 Quality Control hybridizations 


9-mers hybridizations help verify that oligonucleotides have been spotted properly on the slides. The next print-run quality control step will be:

 

1.      Detect any difference in overall signal intensity compared to other print-runs

a.      70-mers oligonucleotides hybridizations

b.      Selection of several test slides to ensure that the same quantity of material was spotted across the platter, as a print-run will generate 255 slides using the same well for one probe. QCHybs are performed using one slide from the beginning of the print, one from the middle, one from the end (e.g. numbers 20,100 and 255 in the Functional Genomics Core Facility).

2.      Check if the GAL file was generated properly, i.e. check that no error was made with ordering or orientation of the plates during the print.

3.      Reproducibility:

A good way to verify the quality of a new print is to hybridize known samples to new slides. Then, we can compare signal intensity from the new slides to existing data, and check that there is no loss in signal. Log ratios (M) for known samples should be similar across print-runs. Example of samples used for QCHybs includes Human Reference pool, Mouse liver, Mouse lung, with dye swaps.

The function in the package which performs the quality assessment for QCHybs is PRvQCHyb().

-          Copy the QCHybs gpr files from the SAME print-run (same GAL file) in a directory.

-          Change R working directory to the one containing your gpr files as described in section 1.

-          Type:

> PRvQCHyb(prname="9Mm")

where prname is the name of the print-run. For more details about its arguments, please refer to the online manual.
 

Results 

PRvQCHyb() returns a diagnostic plot as an image in .png format for each tested slide.

Throughout our document, we will be using the color code described in Table 1 to highlight control spots.

 



Positive controls
Red
Empty controls
Blue
Negative controls
Navy Blue
Probes
Green
Missing spots
White

Table 1: Color code used in arrayQuality


Restrictions:

Currently, PRQCHyb() supports Mouse genome (Mm) only. We will add Human data as soon as it becomes available.

 

Description of the diagnostic plots

Figure 4 shows an example of a nice print-run QCHyb.

  1. MA-plot of raw M values. No background subtraction is performed. The colored lines represent the loess curves for each print-tip group. The red dots highlight  any spot with corresponding weighted value  less than 0. Users can create their own weigthing scheme or function. Things to look for in a MA-plot are saturation of spots and the trend of loess curves, which is an indicator of the amount of normalization to be performed.
  1. Boxplot of raw M values by print-tip group, without background subtraction.

  1. Spatial plot of rank of raw M values (no background subtraction): Each spot is ranked according to its M value. We use a blue to yellow color scale,where blue represents the higher rank (1), and yellow represents the lower one. Missing spots are represented as white squares.

  1. Spatial plot of A values. The color indicates the strength of the signal intensity, i.e. the darker the color, the stronger the signal. Missing spots are represented in white.

  1. Histogram of the signal-to-noise log-ratio (SNR) for Cy5 and Cy3 channels. The mean and the variance of the signal are printed on top of the histogram. In addition, overlay density of SNR stratified by different control types (status) are highlighted. Their color schemes are provided in Table 1. The SNR is a good indicator for dye problems. The negative controls and empty controls density lines should be closer, almost superimposed.

  1. Comparison of Mvalues of probes known to be differentially expressed from the tested array to average Mvalues obtained during previous hybridizations. This plot is aimed at verifying the reproducibility of print-runs. The dotted lines are the diagonal (no change) and the +2/-2 fold change lines. Each probe is represented by a number, and described in the file MmDEGenes.xls. Most of the spots should lie between the +2/-2 fold-change regions. If the technique was perfect, you should see a straight line on the diagonal. If any probe falls off this region (number 29 here), you can look up its number in our probe list in MmDEgenes.xls and get more information about it.

  1. Dot plot of controls A values, without background subtraction. Controls with more than 3 replicates are represented on the Y-axis, the color scheme is represented in Table 1. Intensity of positive controls should be in the high-intensity region, negative and empty controls should be in the lower intensity region. Positive controls range and negative/empty controls range should be separated. Replicate spots signal should be tight.

 

Example

Data for this example was provided by the Functional Genomics Core Facility in UCSF. We have tested slide number 137 from print-run 9Mm. This print-run uses Operon Version 2 Mouse oligos. Results are represented figure 4.
 

> library(arrayQuality)

> datadir <- system.file("gprQCData", package="arrayQuality")

> PRvQCHyb(fnames=”9Mm137.gpr”, path=datadir, prname="9Mm")

Example of QCHyb diagnostic plot

Figure 4: Diagnostic plot for print-run Quality Control hybridization



4. General hybridization quality


 

This component is aimed at verifying the performance of your hybridization, given the good quality of the slide, before any preprocessing steps or further quality assessment on individual spots. This is where you determine if your experiment quality is good enough to enter your dataset. For example, you will need to remove any hybridization with very low SNR, or large spatial artifacts.

Our package provides two kinds of quality control plots. The first one is a qualitative quality control measurement as a diagnostic plot. It is a quick visual way to determine hybridization quality gathering information from several statistical tools. More details on individual diagnostic plots can be found in the vignette “marrayPlots” in the package marray. The second one is a more quantitative comparison of slide quality. We extract some statistical measures from the test slide and we compare them against results obtained for a collection of slides of “good quality” to assess the quality of the hybridization. This comparison is visualized through a comparative boxplot. Results are displayed in a HTML report. Figure 5 shows a screen shot of a typical HTML report. Users can click on each image to obtain a higher resolution plot.

Diagnostic plots can be generated  for different image processing software format: GenePix format files (.gpr files), Spot format files (.spot) and Agilent format files, or from marrayRaw or RGList objects. Most arguments can also be customized to match your own data: which probes are used as controls, which column of the image processing output file is used to define your spot types... You can also specify your own collection of good quality slides using the functions globalQuality and qualRefTable. For more details about these functions, please refer to the online help and the example at the end of this Section.


To generate quality plots: gpr files: gpQuality()

We provide 3 main functions to generate quality plots: gpQuality(), spotQuality() and agQuality(). We will use gpQuality as an example, but the following can be directly applied to spotQuality or agQuality. gpQuality()will generate both diagnostic plots and comparative boxplots. It uses by default spot types from the Functional Genomics Core Facility in UCSF. To use your own spot types, please refer to the end of this Section.

-          Copy the gpr files from the SAME print-run (same GAL file) in a directory.

-          Change R working directory to the one containing your gpr files as described in Section 1

-          To generate both diagnostic plots and comparative boxplots on all files in the directory, run:
       
         > result <- gpQuality(organism=”Mm”)

 
       -          To generate diagnostic plots only, run:
                
         >  result <- gpQuality(organism="Mm", compBoxplot="FALSE")

In this case, quantitative quality measures will not be calculated and the HTML report will not be generated.

- To write down your quantitative quality measures and your normalized data to a file: set output = TRUE when calling gpQuality:

       > result <- gpQuality(organism="Mm", output=TRUE)

        This command will create two files: quality.txt, which contains your quality measures, and NormalizedData.xls, which contains your normalized M values. If you have set compBoxplot = FALSE, quantitative quality measures are not calculated. Therefore, you will not generate the quality.txt file.
 


To generate quality plots from marrayRaw/RGList objects: maQualityPlots

This function can be use to obtain quality plots for data generated with other image processing software, like Spot for example. maQualityPlots() will generate diagnostic plots only. It uses the spot types defined when creating the R object. To learn more about how to read data into a marrayRaw or a RGList object, please refer to marray or limma packages Vignettes.

-          To generate diagnostic plots: if rawdata is your marrayRaw/RGList object, type:

> maQualityPlots(rawdata)


Results

gpQuality() outputs

-          two plots for each test slide (a diagnostic plot and a comparative boxplot)

-          a HTML quality report

        -    A marrayRaw object describing all tested slides

        -    A quality measures matrix: this matrix contains all comparison measures values extracted for each test slide. Each column of the matrix represents a different slide.

For each slide, you will find on the report how many of your slide’s results are below the recommended range. If you want to specify a directory to store the results, you can do it by modifying the argument resdir accordingly. For more details about gpQuality arguments, please refer to the online manual.


Example ofgpQuality HTML report
Figure 5: Example of HTML report generated by gpQuality

maQualityPlots() output:         -          one diagnostic plot for each test slide


Restrictions

gpQuality calls two key functions, maQualityPlots and qualBoxplot. qualBoxplot supports Mouse (Mm) and Human (Hs) genomes only. To generate quality plots for other genomes, you need to set gpQuality argument compBoxplot = FALSE. In this case, only the diagnostic plots will be generated.

 

Description of the diagnostic plots:

Figure 6 represents an example of a good hybridization diagnostic plot.

  1. MA-plot of raw M. No background subtraction is performed. The colored lines represent the loess curves for each print-tip group. The red dots highlight  any spot with corresponding weighted value  less than 0. Users can create their own weigthing scheme or function. Things to look for in a MA-plot are saturation of spots and the trend of loess curves, which is an indicator of the amount of normalization to be performed.

  1. MA-plot of normalized data density. By default, print-tip loess normalization is used. Instead of the typical MA-plot, we have used the package "hexbin" to highlight density of dots on the MA-plot. A light yellow color indicates a high density of dots, whereas blue color represents a lower density. This plot gives you information on the bulk of your data intensity (low/high signal)

  1. Spatial plot of rank of raw M values (no background subtraction): Each spot is ranked according to its M value. We use a blue to yellow color scale,where blue represents the higher rank (1), and yellow represents the lower one. Missing spots are represented as white squares. This is a quick way to visually detect uneven hybridization and missing spots.

  1. Spatial plot of normalized M values ranks. By default, print-tip loess normalization is used. Each spot is ranked according to its M value. We use a blue to yellow color scale,where blue represents the higher rank (1), and yellow represents the lower one. Missing spots are represented as white squares. In addition, flagged spots are higllighted by a black square. This type of graphical representation helps verify that normalization removed any spatial effects.

  1. Spatial plot of raw A values. The color indicates the strength of the signal intensity, i.e. the darker the color, the stronger the signal. Missing spots are represented in white.

  1. Histogram of the signal-to-noise log-ratio (SNR) for Cy5 and Cy3 channels. The mean and the variance of the signal are printed on top of the histogram. In addition, overlay density of SNR stratified by different control types (status) are highlighted. Their color schemes are provided in Table 1. The SNR is a good indicator for dye problems. The negative and empty controls density lines should be closer, almost superimposed.

  1. Dot plot of controls normalized M values. Controls with more than 3 replicates are represented on the Y-axis, the color scheme is represented in Table 1. Controls M values should be tight. and close to 0.

  1. Dot plot of controls A values, without background subtraction. Controls with more than 3 replicates are represented on the Y-axis, the color scheme is represented in Table 1. Intensity of positive controls should be in the high-intensity region, negative and empty controls should be in the lower intensity region. Positive controls range and negative/empty controls range should be separated.

Description of the comparative Boxplot:

Figure 7 shows an example of a comparative boxplot.

We have chosen a wide range of measures to quantify the quality of a typical hybridization: single channel measures (range of foreground signal, MAD of background, signal to noise ratio…), two channel measures (median A values for each type of controls, amount of normalization needed…), percentage of flagged spots... Some measures have been negated such that the quality scale had an increasing trend from problematic to good quality.

For each measure, we have represented the following on the graph :

-         Boxplot of the reference slides values.

-        1st and 3rd quantiles before scaling for each boxplot.

-         Y-axis on the right : for each measure, we have printed 2 values. The first one is the percentage of reference slides measures under your slide’s result. The second one is your slide value for this measure before scaling.

      -    We have scaled all the results to be able to compare them on the same graph.  

      -   The red dots are the test slide scaled values

 
The 16 measures we have selected are listed below.

1. rangeRf: Range of Cy5 foreground, where the range is defined by:

rangeRf = max(log2 (median Cy5 foreground)) - min(log2(median Cy5 foreground))
where median Cy5 foreground corresponds to the "F635 Median" column of the gpr file.


2. rangeGf: Range of Cy3 foreground, where the range is defined by:

rangeGf = max(log2 (median Cy3 foreground)) - min(log2(median Cy3 foreground))
where median Cy3 foreground corresponds to the "F532 Median" column of the gpr file.


3. -RbMad: Cy5 background MAD

RbMad = mad[log2(Cy5 background)]

where:
- Cy5 background corresponds to the "B635 Median" column of the gpr file
- MAD = median{ | Y –mu | }, when Y is normal

4.  -GbMad: Cy3 background MAD

GbMad = mad[log2(Cy3 background)]

where:
- Cy3 background corresponds to the "B532 Median" column of the gpr file
- MAD = median{ | Y –mu | }, when Y is normal


5.  Median RS2N: Median Signal To Noise log-ratio for Cy5

RS2N = log2( mean Cy5 foreground / Median Cy5 background )

RS2Nmedian = median(RS2N) 

where:
- mean Cy5 foreground is the "F635 Mean" column of the gpr file
- median Cy5 background is the "B635 Median" column of the gpr file
 

6. Median GS2N: Median Signal To Noise log-ratio for Cy3

GS2N = log2( mean Cy3 foreground / Median Cy3 background )

GS2Nmedian = median(GS2N)

where:
- mean Cy3 foreground is the "F532 Mean" column of the gpr file
- median Cy3 background is the "B532 Median" column of the gpr file


7. -Median A for empty control:

Median A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2

Median A for empty control = median( A(Empty controls))

where:
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file
- Empty controls are the probes labelled "Empty"

 

8. -Median A for negative control:

Median A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2

Median A for negative control = median( A(Negativecontrols))

where:
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file
- Negative controls are the probes labelled "Negative"

9. Median A values for Positive controls: 

A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2

Median A for positive control = median( A(Positive controls))

where:
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file
- Positive controls are the probes labelled "Positive"


10. Difference between A values for Positive controls and  A values for Negative controls

difference =  median( A(Positive controls)) - median( A(Negativecontrols))

11. -varRepA: variance of replicates spots A values

varRepA = var[ A(replicates) ]

where:
- A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file


12. -msePtip: MSE of M values by print-tip group, no background subtraction

 M =  log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
msePtip = MSE( mean M by print-tip)

where:
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file
- MSE(X) = E( (X-t)2 ), with t a parameter and X an estimator of t.
 

13. -mseFit: MSE of lowess curve

fit = lowess(A, M) 
mseFit = MSE(fit$y)

where:
-
A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
- M =  log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
- median Cy5 foreground corresponds to the "F635 Median" column of the gpr file
- median Cy3 foreground corresponds to the "F532 Median" column of the gpr file
- MSE(X) = E( (X-t)2 ), with t a parameter and X an estimator of t.


14. -Percentage of flagged spots

[number of spot with flag < 0 / number of spots] * 100

where flag is the information from the "Flags" column of the gpr file. Only spots with flag less than 0 are taken into account.

 
15. -M values MMRmad

MMR = Mmean – Mmedian

MAD(MMR)

where:
- Mmean =  log2(Mean Cy5 foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal

- Mmedian = log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
M values calculated using median signal

- mean Cy5 (Cy3) foreground is the "F635 Mean" ("F532 Mean") column of the gpr file
- median Cy5 (Cy3) foreground is the "F635 Median" ("F532 Median")  column of the gpr file
- MAD = median{ | Y –mu | }, when Y is normal
 

16. -Percentage of spots with abs[MMR] > 0.5

where:
- MMR = Mmean – Mmedian

- Mmean =  log2(Mean Cy5 foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal

- Mmedian = log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
M values calculated using median signal

- mean Cy5 (Cy3) foreground is the "F635 Mean" ("F532 Mean") column of the gpr file
- median Cy5 (Cy3) foreground is the "F635 Median" ("F532 Median")  column of the gpr file

Example

Data for this example was provided by the Functional Genomics Core Facility in UCSF. We have tested slide number "137" from print-run "9Mm". This array was fabricated using Operon Version 2 Mouse oligos and the hybridization measures differential gene expression in two RNA samples, Mouse Liver and Mouse Reference Pool. Results are represented Figure 5 and Figure 6.

To generate diagnostic plots, comparative boxplots, HTML report and to write your quality measure and normalized data to a file in a directory named "Results":

> library(arrayQuality)

> datadir <- system.file("gprQCData", package="arrayQuality")

> result <- gpQuality(fnames = "9Mm137.gpr", path = datadir,organism = ”Mm”, output = TRUE, resdir = "Results")


Example of general hybridization diagnostic plot
 
Figure 6: General hybridization quality diagnostic plot


General hybridization quality: comparative boxplot

Figure 7: Comparative boxplot





Customizing gpQuality


arrayQuality is currently using look-up tables adapted to hybridizations performed in the Functional Genomics Core Facility in UCSF.  Depending on your data, you may find that the probes defined as controls in arrayQuality are not present on your array, leading to NAs in the comparative boxplot, or you may be working with a genome for which we are not providing references. gpQuality has several arguments that you can modify in order to use your own spot types or your own collection of good slides. gpQuality arguments are listed below:

gpQuality(fnames = NULL, path = ".", organism = c("Mm", "Hs"),
          compBoxplot = TRUE, reference = NULL,
          controlMatrix = controlCode, controlId = c("ID", "Name"),
          output = FALSE, resdir =".", dev= "png", DEBUG = FALSE,...)


To use your own set of spot types (i.e. controls...): you will need to change controlMatrix and/or controlId.

To use your own collection of good slides: you will need to modify reference.


To use your own set of spot types:

The spot types used in arrayQuality are defined in a 2 column matrix called controlCode.

Pattern
Name
Buffer
Buffer
Empty
Empty
EMPTY
Empty
AT
Negative
M200009348
Positive
M200003425
Positive
NLG
con

Table2: Examples of controls used in arrayQuality


To define your own spot types, you will need to replace the default values in controlCode with your values. The easiest way to do it is to create a  tab-delimited text file named SpotTypes.txt, and read it into arrayQuality using the function readcontrolCode. It is also possible to create a new controlCode matrix directly.

1. If you want to use a Spot Types file:

A spot types files is a tab-delimited text file which allows you to identify different types of spots from the gene list. It should contain at least a column named SpotType where all different spot types are listed and one or more other columns, which should have the same names as columns in the GAL file, containing patterns or regular expressions sufficient to identify the spot-type. For more information, you can refer to the limma package userguide.

Warning: You will need to include a spot type named probes!!

Below is an example of spot types files for the swirl dataset. In this case there are only two types of spots, probes and controls.

example of spot types file

Example of spot types file


To read the new spot types in arrayQuality:

- Create your spot types file.
- Find which column of the file contains probes identification for each type. In the example Figure 8, it is the "ID" column. You will need to pass this column name as argument at the next step.
- Read the spot types files using the readcontrolCode function.

> controlCode <- readcontrolCode(file=”mySpotTypes.txt”, controlId="ID")

- Find which column of the gpr file can be used to identify your new spot types. It is typically the "ID" or the "Name" column.
- To generate both types of plots: call gpQuality specifying your new controlCode matrix in controlMatrix and which column is used to define your spot types in controlId.

> result <- gpQuality(controlMatrix = controlCode, controlId=”Id”)


2. If you want to create a new controlCode matrix directly

You will need to create another controlCode table containing two columns as well, and then overwrite the default controlCode loaded with arrayQuality.
    - A column named "Pattern" containing your control IDs
    - A column named "Name", describing what king of control is each probe (in particular what are Positive, Negative, Empty controls)

You can do it by creating a tab delimited text file and read it in R after loading arrayQuality:

> library(arrayQuality)
> mycontrolCode <- as.matrix(read.table("mycontrolCode.txt", sep="\t",

                          header=TRUE, quote="\"", fill=TRUE)))

Then, pass your new matrix as argument when calling gpQuality. You can specify which column of the gpr file contains probes identifiers in the controlId arguments (typically, it would be "Id" or "Name").
> results <- gpQuality(controlMatrix = mycontrolCode, controlId = "ID")


To use your own reference slides:

 If you would like to use your own set of reference slides, you will need to follow a few steps to create the necessary look-up tables. This feature can be used for example if you want to study hybridization quality for other genomes, or if you would like to compare slide quality within a a large dataset. To generate your own references:

    1. Gather the slides of "good" quality you would like to use as reference in a directory, for example "MyReferences". Slides can be from different print-runs.

    2. Change R working directory to "MyReferences", as described in Section 1.

    3. Load arrayQuality package by typing library(arrayQuality) in R

    4. Create your reference quality measures by typing:
    > myReference <-  globalQuality()

    5. Change R working directory to the directory containing slides  you would like to test, as described in Section 1. You can only compare slides from the same print-run here. If you have an experiment         using two print-runs, you will need to run gpQuality two times.

    6. Run gpQuality using the reference measures and the scaling table you have generated:
    > results <- gpQuality(reference = myReference)

    Other gpQuality arguments described above can also be applied here.