ClusterMine360: A Database of Microbial PKS/NRPS Biosynthesis

[ Log In ]

About the ClusterMine360: A Database of Microbial PKS/NRPS Biosynthesis

Problem Statement

There is a increasingly large amount of information available on microbial secondary metabolite biosynthesis. In particular, gene clusters containing polyketide synthases (PKS) and non-ribosomal peptide synthetases (NRPS) have received significant attention which has resulted in the sequencing of many of these clusters during the past decade. However, in order to take advantage of this data, it needs to be easily accessible and discoverable. While the sequences themselves are generally available on NCBI, they are frequently difficult to locate partially due to the large amounts of information hosted in their databases. In order to accelerate research in this area, it would be beneficial to have this information gathered together so that existing data can be leveraged to the fullest.

While there are some existing databases in this space, most of them are static and are not being actively updated. Few institutions or research groups have the resources to maintain ongoing manual curation. However, new data is being generated and there needs to be an easy way to add and update the database with new information. To this end, we have designed this database to employ as much auto-curation as possible. That is to say, to use the power of the computer to assist in curation of gene cluster data. Additionally, we have tried to adopt a community approach to the design of our database. Interested individuals are able to sign up for a free account allowing them to add or update data. By crowd-sourcing, it allows participation by those who are interested in contributing while also lessening the need for a dedicated full-time curator.

Community-based curation, however, also has some unique challenges. In particular, it can be difficult to ensure high levels of data quality. We have tried to address this issue by ensuring that users only need to provide a few details while the bulk is derived from other sources or analysis tools. For example, clusters in the database can be sorted by phylum and species. The user does not need to provide the species, genus, phylum and complete lineage of the producer of a given cluster. This data is retrieved directly from NCBI databases minimizing the risk of bad data being input by a user. Additionally, as another example, we use the antibiotic And Secondary Metabolite SHell (antiSMASH) tool to provide analysis on each cluster. The results are parsed and used to automatically assign characteristics such as pathway type to each cluster.

Important Definitions

  • Cluster: Agglomeration of genes involved in some common function. For the purposes of this database, the clusters contain gene encoding PKS/NRPS and other associated enzymes.
  • Compound Family: Many gene clusters produce multiple compounds that share a common core. A compound family is therefore defined as a group of compounds that are highly related and share a common backbone.
  • Synonym: Many natural products have more than one name. Synonyms can be added to a compound family record to help with searching and to help prevent duplication from adding the same compound twice to the database.
  • Related Families:Compound families can be related to each other either by similar structure or by similarity in biosynthetic genes.
  • antiSMASH: The antibiotic And Secondary Metabolite SHell (antiSMASH) is the result of the collaboration of groups at Groningen, Tübingen, and UCSF. It analyzes sequences for the presence of secondary metabolite clusters. Once a cluster is dectected, it determines the cluster's pathway type, if PKS/NRPS domains are present, it will try to predict the domain type, specificity, activity and/or stereochemistry as appropriate. Finally, it does a BLAST earch to find related clusters and it also attempts to predict the product of the cluster. Details on antiSMASH can be found on their website at or by reading the article they published in 2011.

    antiSMASH: Rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters. Marnix H. Medema, Kai Blin, Peter Cimermancic, Victor de Jager, Piotr Zakrzewski, Michael A. Fischbach, Tilmann Weber, Rainer Breitling & Eriko Takano Nucleic Acids Research (2011), doi: 10.1093/nar/gkr466.

Organization of the Database

DB Organization

The database is organized around compound families. Each compound family is linked to one or more clusters. It is also linked to synonyms, an image of the compound's structure, pathway types that are involved in producing the compound, as well as to related families that are similar in terms of structure or in terms of the genetic similiarty of their gene clusters. Suggestions for synonyms are retrieved from the ChemSpider database as are images. If no image is found, images can be generated from a SMILES string or manually uploaded. While users can indicate the pathway types, these are cross-referenced with the results of antiSMASH analysis to ensure validity. If a user is unsure, he can choose 'Unknown' and the pathway types will be assigned automatically from antiSMASH.

Clusters are linked to their host organism and its lineage. They are also linked to the sequencing references provided in the NCBI GenBank record. Links to these articles show up on the cluster details page. Users also have the easily add additional references to the cluster by submitting PubMed IDs on the Cluster Edit page. Clusters are also linked to its antiSMASH results. The sequences for the PKS/NRPS domains that are extracted from the antiSMASH file are also available to download from the Cluster Details page either individually as a .fasta file or as a zip file containing fasta files for all of the domains for a given cluster.

Large sequences that may contain multiple clusters can also be submitted for analysis and inclusion in the Sequence Repository. These records are not tied to any compound family and therefore do not appear in the main database. These large sequences, including complete genomes, can be added by GenBank or RefSeq ID. The entire sequence will be processed by antiSMASH and the domains from PKS/NRPS clusters will be extracted and saved in the Sequence Repository. The antiSMASH files are also saved and available on the Sequence Repository Details page.

Sequence records in the Sequence Repository are linked to both the cluster they originated from and to the compound family (if applicable) that the cluster produces. They are also linked to the producing organism's lineage. Sequence records come from the PKS/NRPS domains identified by antiSMASH. antiSMASH attempts to discover the PKS/NRPS domains in a given cluster following which it predicts its function and other properties. For example, for acyl transferase or adenylation domains, it will try to determine which metabolite it is selective for. The ketoreductases are also analysed to determine if they are active or inactive, and, if they are active, it tries to predict the stereochemistry of the product.

Data Flow

DataFlow Diagram


  • antiSMASH
    The antiSMASH tool is used as a major source of sequence analysis and also as the source of domains in the sequence repository. We would like to thank the antiSMASH team for providing this tool and would like to thank Kai Blin, in particular, for his assistance on many occasions.
  • Indigo Chem API
    The Indigo Chem API is an open source library used to generate structure images from SMILES strings and do substructure searching. The team at GGA Software Services, which supports the development of Indigo, was very helpful in providing assistance in setting this up including making changes to allow for our particular configuration.
  • ChemSpider
    Accesible through a web API, the ChemSpider free chemical database is used as the source of many structure images. It is also the source of many synonyms.
  • NCBI
    The NCBI rest services are used to return sequence metadata for a given GenBank or RefSeq ID.