JGI Joint Genome Institute CIG Center for Integrative Genomics

Metazome Help

Nodes,Clusters and Consensus Sequences

Nodes and Clusters:

Please go to the info page for information on nodes and clustering.

Consensus Sequences

A consensus peptide sequence is constructed for each cluster from the MSA (multiple sequence alignment). The consensus is that sequence which maximizes the sum of the pairwise scores with each cluster member's peptide sequence. For Metazome, BLOSUM62 was used as the scoring matrix, with gap opening and extension costs of 11 and 2, respectively. This relatively simple approach produces a cluster consensus sequence that is comparable to more sophisticated profile construction algorithms, in terms of its ability to post facto correctly assign (via BLAST) cluster members to their correct clusters.

Viewing Cluster Details

The Cluster Summary page provides a detailed picture of given cluster's constituent genes. This page is accessed by clicking the "magnifying glass" icon ( ) next to the cluster of interest on the Search Results or BLAST Results pages.

Cluster Naming and Classification
A brief summary and high level classification of the ancestral gene represented by this cluster. The summary includes the node name, the number of crown (extant organism) genes in the cluster, an automatically generated cluster name, and, where possible, both KOG letter and KEGG Brite classification of the cluster (also referred to as "Cluster KEGG Orthology").

Names for non-singleton clusters are either KOG-based (if more than 50% of a cluster's member genes are annotated with the same KOG, the cluster name is the KOG description) or SwissProt-based (if a cluster is purely orthologous, meaning all organisms at that node have one and only one member in the cluster, then the cluster will be named according to the SwissProt Description Entry of a "prominent", meaning well-annotated member). If neither of these cases applies, then the SwissProt or Trembl name of the member that is most similar to the cluster consensus sequence will be used. If this still does not yield a name, then the cluster is named simply "Hypothetical Gene." Singleton clusters are named with the definition line of their sole member.

The KOG Class assignment follows the same rule as KOG naming, described above. The KEGG Brite classification is done similarly, with the modification that only 50% of the member genes that could possibly have a KO (KEGG Orthology) are required to agree. This modification is due to the fact that not all organisms have been analyzed via KO, which is a prerequisite for KEGG Brite classification. The two shallowest levels of the KEGG Brite classification are hyperlinked. Clicking on the links will find all clusters at the current node assigned this KEGG Brite annotation.

Note that for composite clusters, naming and classification information are not provided.

Genes in this Cluster
Information on each member gene of this cluster is available in this section. This information includes: the species code, the genomic location (chromosome/scaffold, with the start and end coordinates available via mouse hovering), reference identifiers for this gene in other datasets (e.g, RefSeq, Unigene, Uniprot,Ensembl, JGI), gene symbol(s), defline, a cartoon of any PFAM domains found on this gene's product, and a cartoon of the upstream and downstream neighbors of this gene. Note that each of these columns can be hidden or made visible by selecting it in the Display Options panel (see below). If a is visible next to a row, the row can be expanded to show more information.

The CHROM entry is always hyperlinked to a Gbrowse-based local genomic view of the gene model. For some organisms, this view will also include supporting EST evidence, homologous proteins aligned to the genome, and genome-to-genome synteny information plots (VISTA). Place the mouse cursor over the CHROM entry, and the gene start and end positions will be visible as well.

The first id in the DbXREF column is always from the primary source dataset (the dataset from which this gene was obtained, typically Ensembl or JGI). If available in other datasets, their identifiers are listed as well (you will need to expand the row to see these other identifiers). Where possible, all reference identifiers are hyperlinked to an information page provided by their source's curators. A two-letter code is used to indicate the source database of a given identifier. The codes are

EGNCBI EntrezGene
HUHUGO Gene Nomenclature Database

The Domains tab provides a cartoon view of any PFAM domains called on this gene's peptide. The same PFAM domain in different peptides will be rendered in the same color. Mouse over a domain to see the PFAM id, description, domain coordinates displayed in a pop-up to the left of the domains. The selected domain will also be highlighted in all rows in which it appears. Click on the domain to see it highlighted in all other rows in which it appears. Note that all peptides in the cluster are scaled to the same length for viewing.

The Synteny tab provides a view of the 5 upstream and 5 downstream neighbors ("syntenic block") of this gene (known as the "anchor gene",which is always rendered in black, except in the case of composite clusters, where the anchor genes are not necessarily from the same cluster). The syntenic blocks are oriented so that anchor genes are always on the same strand (consistent with their implied descent from a common ancestral gene). Mousing over any syntenic gene will produce an info box displaying the gene's primary id, and the name and id of the cluster containing that gene. The box also includes a link to that cluster's summary page. To access the link, click on the syntenic gene (which freezes the info box and highlights all other genes that are members of the same cluster), and move the cursor over to the hyperlink, and click.

Functional Analysis
The functional and domain annotations (e.g., KOG, KEGG, GO, PFAM, PANTHER) that have been assigned to members of this cluster are displayed here. For each annotation type, the identifier and description are provided, as well as this annotation's phylogenetic fingerprint (i.e., how many of the genes in this cluster have been assigned this annotation, broken down by organism).

Multiple Sequence Alignment
A clustalw Multiple Sequence Alignment (MSA) has been precalculated for each gene family. You can view the MSA in this panel, as well as download a conservation-colored html file of the alignment (please use the Get Sequences tab if you want the raw clustalw output). Note that any organisms which have been hidden in the Display Options will also be excluded from the MSA, though the MSA will not be recalculated. If you want to recalculate the MSA with certain sequences excluded or modified, you should go to the Align cluster members tab and launch Jalview.

Note that MSA's are not pre-calculated for composite clusters. If a composite cluster has fewer than 75 members, the Multiple Sequence Alignment tab will be visible, and clicking on it will launch a real-time alignment. For composite clusters with greater than 75 members, the MSA tab will not be accessible.

Align cluster members
This tab provides access to the Jalview Multiple Sequence Viewer. Click on "Align Member protein sequences" to load this cluster's peptide sequences into Jalview. Click on "Align Member coding sequences" to launch Jalview with the cluster's coding sequences instead. For all "reasonably" sized clusters, the Clustalw Multiple Sequence Alignment has been pre-computed, and will automatically load when Jalview is launched (protein sequences only). Otherwise, you can apply Clustalw (or MUSCLE, a similar Multiple Sequence Aligner) within Jalview. Once you have an alignment, you can build Neighbor-Joining or Maximum Likelihood phylogenetic trees. All alignments, sequence, and trees can be downloaded from Jalview in multiple formats. Please see the Analyzing Cluster Sequences section for more information.

There are several methods available for finding clusters related to the current cluster by descent, sequence similarity, or functional annotation.
Clusters related by descent:
Click on the View Ancestor link to be taken to the cluster summary page of the parent (immediate ancestor) of the current cluster. This cluster is guaranteed to contain all the members of the current cluster. If the current cluster is a root (i.e., most ancient) node cluster, of course, no ancestor exists. Click on the View descendants link to find the children (immediate descendants) of the current cluster. The union of these child clusters exactly reconstructs the current cluster. We currently do not include crown nodes (nodes consisting solely of a single extant organism). If you are already at terminal (most modern), you won't see a link for descendant clusters. If you are interested in tracing the ancestry of only a particular subset of the members of the current cluster, select them (by checkbox) and click "Find all clusters with selected gene(s)". This will return only those clusters containing all of the selected genes.
Clusters related by sequence similarity:
Each (non-composite) cluster is represented by a consensus peptide sequence, which is based on a residue-by-residue consensus constructed from the multiple sequence alignment of the cluster members' peptide sequences. One can search for clusters whose consensus sequence is similar to that of the current cluster by clicking the "BLAST for similar clusters" link. This link will load the blast search page with the current node selected. One can use consensus sequences from a different node as the target database by using the node selector on the blast page.

For composite clusters of fewer than 75 members, an MSA and consensus sequence are calculated on-the-fly, and the "BLAST for similar clusters" link functions exactly as for non-composite clusters. For composite clusters with more than 75 members, however, this link is not available.

Clusters related by functional annotation:
Use the checkboxes in the Functional Analysis section of the Cluster summary to select one or more functional annotations that have been assigned to the current cluster. Click on "Find node clusters with selected annotation(s)" to find all clusters at the current node which also have been assigned all the selected annotations.
Get Sequences
Use this tab to download sequence associated with a given gene family/cluster. You can download the peptide or nucleotide (CDS) sequence for each cluster member, the consensus sequence for the cluster, or the raw clustalw Multiple Sequence Alignment. Choose "View" to load the fasta sequence into a browser window, or "Download" to save it to a file. Note that any species hidden via Display Options will not be included in the "Cluster Sequences" download, though they will be included in the "Raw CLUSTALW alignment." Note that for composite clusters with more than 75 members, neither the Raw CLUSTALW alignment nor the consensus sequence are available.
Display options
Click on "columns" to select which columns are displayed in the "Genes in this cluster" secti\ on. The "Graphical Analyses" column refers to the Domain and Synteny views. The synteny color control refers to how many of the (displayed and hidde\ n) syntenic blocks must contain members of a cluster for that cluster's members to be rendered in color (all members of the same cluster will be rendered in \ the same non-white color). By default this number is 2, but can be increased or decreased by clicking "+" or "-" in the column heading.

The Species Visible section of the Display Options allows the user to hide results from particular species. Unchecking a species' checkbox\ will cause information for that organism's genes to be removed from the cluster display. This affects the "Genes in this Cluster", "Functional Analysis", a\ nd "Multiple Sequence Alignment" tabs. as well as the . If you wish to make these filter choices permanent, click on the "Save Species Settting" button.

The Click-Info tab displays additional information about PFAM domains and syntenic genes when they are selected (by mouseclick) in the Domains or Synteny tabs. For a selected PFAM domain, the PFAM identifier and description are displayed. For a selected syntenic gene, the id and name of that gene's cluster is displayed, along with a link to the cluster summary page.

Analyzing Cluster Sequences

Jalview is used for sequence viewing, alignment, and tree-building. When you launch Jalview (having selected one or more clusters), the protein or coding sequences of each cluster member are loaded into an alignment panel. If the set of sequences corresponds to a single cluster, the pre-computed MSA (multiple sequence alignment) is also retrieved and loaded into another alignment panel. Otherwise, you can launch a CLUSTALW or MUSCLE MSA yoursel (under the "Align" menu in Jalview). Sequences are grouped by greatest pairwise similarity after an alignment. Once you have an MSA, you can build a neighbor-joining or maximum likelihood tree from the aligned sequences (under the "Tree" menu).

You can always remove a sequence from the set by highlighting the sequence name and choosing "Edit->Delete" from the menu. If you'd like to add one or more sequences to the set, choose "Edit->Add Sequence(s)". It's important to re-align the set after you add or delete sequences.

The Features menu allows you to visualize PFAM domains directly on the sequences in the alignment panels. Simply select "Features->PAC Protein Domains", and a list of PFAM identifiers and descriptions will appear in a panel to the right. Clicking on any one of these entries will highlight (in blue) that particular PFAM domain on all sequences in the pnale.

If you would like to save a MSA or Tree, choose the "File->Save As" menu item, and specify the desired file format (Fasta, clustal, MSF, etc.). If you'd like an image or HTML page of the alignment, choose the "File->Export" menu item instead.

More help on Jalview is available here.

©2007 University of California Regents. All rights reserved