About the ClusterMine360: A Database of Microbial PKS/NRPS Biosynthesis
There is a increasingly large amount of information available on microbial secondary
metabolite biosynthesis. In particular, gene clusters containing polyketide synthases
(PKS) and non-ribosomal peptide synthetases (NRPS) have received significant attention
which has resulted in the sequencing of many of these clusters during the past decade.
However, in order to take advantage of this data, it needs to be easily accessible
and discoverable. While the sequences themselves are generally available on NCBI,
they are frequently difficult to locate partially due to the large amounts of information
hosted in their databases. In order to accelerate research in this area, it would
be beneficial to have this information gathered together so that existing data can
be leveraged to the fullest.
While there are some existing databases in this space, most of them are static and
are not being actively updated. Few institutions or research groups have the resources
to maintain ongoing manual curation. However, new data is being generated and there
needs to be an easy way to add and update the database with new information. To
this end, we have designed this database to employ as much auto-curation as possible.
That is to say, to use the power of the computer to assist in curation of gene cluster
data. Additionally, we have tried to adopt a community approach to the design of
our database. Interested individuals are able to sign up for a free account allowing
them to add or update data. By crowd-sourcing, it allows participation by those
who are interested in contributing while also lessening the need for a dedicated
Community-based curation, however, also has some unique challenges. In particular,
it can be difficult to ensure high levels of data quality. We have tried to address
this issue by ensuring that users only need to provide a few details while the bulk
is derived from other sources or analysis tools. For example, clusters in the database
can be sorted by phylum and species. The user does not need to provide the species,
genus, phylum and complete lineage of the producer of a given cluster. This data
is retrieved directly from NCBI databases minimizing the risk of bad data being
input by a user. Additionally, as another example, we use the antibiotic And Secondary
Metabolite SHell (antiSMASH) tool to provide analysis on each cluster. The results
are parsed and used to automatically assign characteristics such as pathway type
to each cluster.
- Cluster: Agglomeration of genes involved in some common function.
For the purposes of this database, the clusters contain gene encoding PKS/NRPS and
other associated enzymes.
- Compound Family: Many gene clusters produce multiple compounds
that share a common core. A compound family is therefore defined as a group of compounds
that are highly related and share a common backbone.
- Synonym: Many natural products have more than one name. Synonyms
can be added to a compound family record to help with searching and to help prevent
duplication from adding the same compound twice to the database.
- Related Families:Compound families can be related to each other
either by similar structure or by similarity in biosynthetic genes.
- antiSMASH: The antibiotic And Secondary Metabolite SHell (antiSMASH)
is the result of the collaboration of groups at Groningen, Tübingen, and UCSF.
It analyzes sequences for the presence of secondary metabolite clusters. Once a
cluster is dectected, it determines the cluster's pathway type, if PKS/NRPS domains
are present, it will try to predict the domain type, specificity, activity and/or
stereochemistry as appropriate. Finally, it does a BLAST earch to find related clusters
and it also attempts to predict the product of the cluster. Details on antiSMASH
can be found on their website at
antismash.secondarymetabolites.org or by reading the
article they published in 2011.
antiSMASH: Rapid identification, annotation and analysis of secondary metabolite
biosynthesis gene clusters. Marnix H. Medema, Kai Blin, Peter Cimermancic, Victor
de Jager, Piotr Zakrzewski, Michael A. Fischbach, Tilmann Weber, Rainer Breitling
& Eriko Takano Nucleic Acids Research (2011), doi:
Organization of the Database
The database is organized around compound families. Each compound family is linked
to one or more clusters. It is also linked to synonyms, an image of the compound's
structure, pathway types that are involved in producing the compound, as well as
to related families that are similar in terms of structure or in terms of the genetic
similiarty of their gene clusters. Suggestions for synonyms are retrieved from the
ChemSpider database as are images. If no image is found, images can be generated
from a SMILES string or manually uploaded. While users can indicate the pathway
types, these are cross-referenced with the results of antiSMASH analysis to ensure
validity. If a user is unsure, he can choose 'Unknown' and the pathway types will
be assigned automatically from antiSMASH.
Clusters are linked to their host organism and its lineage. They are also linked
to the sequencing references provided in the NCBI GenBank record. Links to these
articles show up on the cluster details page. Users also have the easily add additional
references to the cluster by submitting PubMed IDs on the Cluster Edit page. Clusters
are also linked to its antiSMASH results. The sequences for the PKS/NRPS domains
that are extracted from the antiSMASH file are also available to download from the
Cluster Details page either individually as a .fasta file or as a zip file containing
fasta files for all of the domains for a given cluster.
Large sequences that may contain multiple clusters can also be submitted for analysis
and inclusion in the Sequence Repository. These records are not tied to any compound
family and therefore do not appear in the main database. These large sequences,
including complete genomes, can be added by GenBank or RefSeq ID. The entire sequence
will be processed by antiSMASH and the domains from PKS/NRPS clusters will be extracted
and saved in the Sequence Repository. The antiSMASH files are also saved and available
on the Sequence Repository Details page.
Sequence records in the Sequence Repository are linked to both the cluster they
originated from and to the compound family (if applicable) that the cluster produces.
They are also linked to the producing organism's lineage. Sequence records come
from the PKS/NRPS domains identified by antiSMASH. antiSMASH attempts to discover
the PKS/NRPS domains in a given cluster following which it predicts its function
and other properties. For example, for acyl transferase or adenylation domains,
it will try to determine which metabolite it is selective for. The ketoreductases
are also analysed to determine if they are active or inactive, and, if they are
active, it tries to predict the stereochemistry of the product.
The antiSMASH tool is used as a major source of sequence analysis and also as the
source of domains in the sequence repository. We would like to thank the antiSMASH
team for providing this tool and would like to thank Kai Blin, in particular, for
his assistance on many occasions.
- Indigo Chem API
The Indigo Chem API is an open source library used to generate structure images
from SMILES strings and do substructure searching. The team at GGA Software Services,
which supports the development of Indigo, was very helpful in providing assistance
in setting this up including making changes to allow for our particular configuration.
Accesible through a web API, the ChemSpider free chemical database is used as the
source of many structure images. It is also the source of many synonyms.
The NCBI rest services are used to return sequence metadata for a given GenBank
or RefSeq ID.