NimbleGen Masthead
 
spacer Product Images

ChIP-chip - Data Guide

Signal Intensity (Raw) Data

Signal intensity data is extracted from the scanned images of each array using NimbleScan, NimbleGen’s data extraction software. Signal intensities for each probe are saved in pair files (.txt), the raw data format for ChIP-chip experiments.

Scaled Log2-Ratio Data

Each feature on the array has a corresponding scaled log2-ratio. This is the ratio of the input signals for the experimental and test samples that were co-hybridized to the array. The log2-ratio is computed and scaled to center the ratio data around zero. Scaling is performed by subtracting the bi-weight mean for the log2-ratio values for all features on the array from each log2-ratio value. View log2-ratio data files (.gff) using SignalMap.

Peak Data

Using NimbleScan, peak data files (.gff) are generated from the scaled log2-ratio data. NimbleScan detects peaks by searching for 4 or more probes whose signals are above the specified cutoff values, ranging from 90% to 15%, using a 500bp sliding window. The cutoff values are a percentage of a hypothetical maximum, which is the mean + 6[standard deviation]. The ratio data is then randomized 20 times to evaluate the probability of “false positives.” Each peak is then assigned a false discovery rate (FDR) score based on the randomization. In general, use these guidelines when reviewing FDR scores:

  • The lower the FDR score, the more likely the peak corresponds to a protein binding site.
  • For most data sets, peaks with FDR score ≤ 0.05 very often represent the highest-confidence protein binding site(s).
  • Peaks with FDR score between 0.05 and 0.2 are also indicative of a binding site.
  • Peaks with FDR score > 0.2 are generally not considered high-confidence binding sites.

Viewing Peak Data Graphically
Open peak data files (.gff) in SignalMap. The peaks are color-coded and separated into 4 tiers for quick identification.

  • Red: 1st-tier peaks (highest probability of a peak); FDR score ≤ 0.05
  • Orange: 2nd-tier peaks; FDR score ≤ 0.1
  • Yellow: 3rd-tier peaks; ≤ 0.1 FDR score ≤ 0.2
  • Grey: 4th-tier peaks (lowest probability of a peak); FDR score > 0.2

Position the mouse pointer over each peak to display additional information (see table below).

Field Description
Score The peak score, which is the log2-ratio of the 4th-highest probe in the peak.
Pos Genomic coordinates of the peak.
Attr Attributes specific for the peak.
Color The peak color.
Cutoff_p The percentage cutoff value (varies from 90% to 15%) when this peak is detected.
Cumul_peaks Cumulative number of peaks up to that value of FDR.
Fdr The FDR score.
Attr_2, attr_1, attr_0, etc. Additional information, such as the settings of the peak finding algorithm.

Viewing Peak Data in a Table Format
You can also open peak data files (.gff) in a spreadsheet program, such as Microsoft Excel, to view data in a table format. The peaks are sorted by FDR score with the most significant peaks listed first. See the table above for a detailed description of peak data.

Promoter Reports

For each annotated gene, NimbleScan searches for peaks that appear in a specified promoter region around the transcription start site (TSS). The region searched is design-specific; for most mammalian designs, the search region spans from 5kb upstream to 1kb downstream of the TSS.

You can view the promoter reports using spreadsheet software, such as Microsoft Excel:

  • Report_All_Peaks – Lists all peaks with an FDR ≤ 0.2 and maps them to promoter regions. Each row in the report lists a peak-transcript pair. For each transcript, if more than one peak lies within the promoter region, there will be multiple rows for that transcript.
  • Report_Nearest_Peaks – Lists all peaks with an FDR ≤ 0.2 and maps them to promoter regions. Each row in the report lists a peak-transcript pair. For each transcript, if more than one peak lies within the promoter region, only the peak nearest to the TSS is reported.

To effectively analyze peak data, you should sort the data in promoter reports according to FDR, peak score, gene name, chromosome, distance to TSS, etc. To sort data in Microsoft Excel, highlight row 1 and choose Data -> Filter -> Auto Filter. You can then sort individual columns by ascending/descending values, top 10 values, or individual values.

The table below identifies the fields on the promoter reports (.xls):

Field Description
PEAK_ID An ID for each peak.
CHROMOSOME Chromosome associated with the peak.
PEAK_START First base of the peak on the chromosome.
PEAK_END Last base of the peak on the chromosome.
PEAK_SCORE The log2-ratio of the fourth highest probe in the peak.
PEAK_FDR FDR value of the peak.
FEATURE_TRACK The annotation track against which peaks were mapped; it is the transcription start site for promoter reports.
FEATURE_STRAND Strand of the transcript.
FEATURE_START First base of the feature on the chromosome.
  Note: For the transcription start site, feature size is 1; therefore, start and end positions are the same.
FEATURE_END Last base of the feature on the chromosome.
  Note: For the transcription start site, feature size is 1; therefore, start and end positions are the same.
FEATURE_TO_PEAK_DISTANCE Center-to-center distance of peak to feature.
Name Gene symbol of the transcript.
accession GenBank accession number of the transcript.
description Full gene name of the transcript.
ncbi_gene_id NCBI Entrez GeneID of the transcript.
synonyms Other alias symbol(s) of the transcript.
Parent The internal identification number of the transcript from which this transcription start site is generated.
Custom Designs

If your array design is customized, some of the files described above may not be provided. For instance, annotation files (.gff) may not be readily available for less common genomes, which will result in no promoter reports being generated. In addition, the gene description file (.ngd) is available only for certain designs, since these files were replaced by annotation files (.gff) in newer designs. Also, if a positions file (.pos) is not available (because genomic coordinates were not provided for a custom design), no ratio files (.gff), peak data files (.gff), or promoter reports (.xls) are generated.

3rd Party Software Options

There are many third party packages into which one can import and analyze NimbleGen ChIP-chip data. Five 3rd party packages are listed below:

The identification of motifs and sequences for qPCR design from your ChIP-chip data can now be easily performed by using the Cis-regulatory Element Annotation System (CEAS). This site accepts peaks GFF files for current builds of human (hg18) and mouse (mm8). Click here to download a guide to using the CEAS website.

Elucidating the function and transcriptional network of large gene lists can often be cumbersome and difficult to understand. Using the Database for Annotation, Visualization and Integrated Discovery (DAVID), you can functionally annotate your ChIP-chip data using a list(s) of genes that had their promoter regions bound by the factor of interest. Click here to download a guide to using the DAVID website.

 

CONTACT US

WORKSHOPS

WEBINARS

ONLINE TRAINING

NEW RESEARCH

LITERATURE

NOW AVAILABLE