What is TransClust? TransClust is a high-throughput clustering software that is based on
Weighted Transitive Graph Projection.
It's main advantage over other approaches is that it's underlying model directly reflects hidden
transitive substructures typical e.g. for biomedical data sets. In comparison to other clustering
methods the desitiy parameter (the threshold) can be choosen as intuitively as for k-means.
What is Clustering? Clustering is a widely established technique for exploratory data analysis with applications in
statistics, computer science, biology, social sciences, or psychology. It is applied to empirical
data in basically any scientific field to gain an initial impression of structural similarities.
For this purpose, it is of great advantage to have an efficient and easy-to-use tool that can be
applied ubiquitously to a large scope of data types.
What is the Weighted Transitive Graph Projection problem?
It's a graph-theoretic problem. The goal is to transform a given intransitive graph into a transitive
one by adding and removing edges. The idea is to find a solution with minimal number of edge
modifications for the unweighted case. In our case, the weighted, a cost function for adding/removind
edges is used. It reflects the distance between a user-given threshold and a pairwise similarity
function, i.e. removing an edge labeld with a very high similarity is expensive, as is adding an edge
with a very high dissimilarity. This WTGP problem can be used as model for clustering (see the picture below).
Imagine that in bigger graphs (>3 nodes) it might be cheaper to delete two edges than to add one or vice versa.
The threshold, given by the user, strongly reflects the cluster density: High (restrictive) values make it more
expensive to add most of the edges (resulting in many small clusters) while lower values make it cheap to
add edges but expensive to remove them (resulting in few big clusters). The advantage of this approach is the
ability to reflect hidden transitive substructures directly within the clustering model. However, this problem
is NP-hard. For TransClust we use graph-layout-based heuristics to reduce the problem first and to find
exact (or at least good) solutions for more handy sub-instances afterwards.
In the picture below, the edge sizes correspond to the similarity function. Edges below a user-given threshold are
not included in the picture. It's obvious that edge 2->3 has to
added to the graph. The more critical edges are 3->5, 4->5, and most of all 4->6. Deleting these edges results in
two clusters, with paying the costs for the deletions. Not removing them means having one big cluster but paying
the
costs for adding all the missing edges to make the graph transitive. Which solution is preferred (cheaper)
strongly depends on the user-defined density parameter.
What user-input is required? TransClust has only two simple requirements: The first is a measure of similarity (or distance) for
each unordered pair in the dataset, saved in a similarity matrix. The second is a densitiy parameter
(similarity threshold) that gives the boundary between similar and dissimilar objects. Note that these
two requirements hold for any clustering method. However, most other tools ask the user to define
further essential parameters. Furthermore, the densitivity normally cannot be set intuitively.
TransCluster needs the following input:
- a pairwise similarity function and a threshold to compute a similarity matrix,
- a similarity matrix and a threshold to compute a cost matrix, or
- a cost matrix.
- (optionally) a gold standard file to determine the optimal threshold (density parameter) for the given problem automatically.
Note that for the first case an implementation for processing all-vs-all BLAST results of protein sequences
exists and is accessible through this web interface.
Where can TransClust be applied? TransClust has been successfully applied to a variety
of data types and is capable of clustering datasets up to a couple of hundred thousand objects
in a matter of hours.
What is this web site good for? This web site provides an online interface to TransClust
to allow for (restricted) execution of clustering jobs without the necessity to install TransClust on
a local PC. Besides, we offer links to start TransClust with Java Web Start on your local PC to run
large clustering jobs. The web site accepts the following input:
- for remote protein homology detection, a BLAST and a FASTA file for the construction of a similarity matrix, or
- a similarity matrix for arbitrary objects and a threshold to compute a cost matrix, or
- a cost matrix for arbitrary objects.
Example files
Download the following example files to 'play' with the application of TransClust to protein sequence clustering.
Proteins
-
Example cost matrix file - It was generated from protein sequences of the amidohydrolases superfamily from the SFLD gold standard; see the original paper (Brown et al.) for more details. Similarities were calculated using -LOG10 of birectional best BLAST hits (BeH), the cost matrix created with a threshold of 100.
-
Example similarity file - It was generated from the above mentioned amidohydrolases protein sequences from the SFLD gold standard; similarities calculated using -LOG10 of birectional best BLAST hits (BeH).
-
Example FASTA file and all-vs-all BLAST file - The FASTA file contains the above mentioned protein sequences; the all-vs-all BLAST file was generated with an E-Value threshold of 100.
-
Gold standard file - The gold standard file for the above introduced amidohydrolases proteins assigns each protein to a protein family. See the original paper (Brown et al.) for more details.
Social network
-
Example cost matrix file - This traditional example for clustering applications was generated from social interactions of 34 karate club members. The club members splitted into two parts (two clubs) and we use the number of common social activities between each pair of club members as similarity function. See the original paper for more background information. This cost matrix was created by using the below provided similarity file (threshold: 0.1).
-
Example similarity file - The file was generated from the above mentioned social network of the Zachary karate club using the social interactions of the club members as similarity values.
-
Gold standard file - This gold standard file shows that the social network (the karate club) was splitted into two subgroups. Using the files provided above, will cluster the club into exactly the same two clusters.
The following two tutorials provide concerte clustering workflows through the above introduced datasets:
Java Web Start tutorial,
Online interface tutorial. Especially the first dataset may also be used with our Cytoscape plug-ins, see our
Cytoscape plug-ins tutorial.
Are there any upload/runtime restrictings?The stand-alone tool has no restrictions in terms of file sizes and running times. You may start it
by using Java Web Start or by download the JAR file (see the above provided links). You currently have the following restrictions when using this
web interface for limited jobs:
- Max. upload size per file: 64M
- Max. script memory: 128M
- Max. running time: 30 sec
Where is TransClust developed?
|
It's mainly developed at the Center for Biotechnology, Bielefeld
University, Germany and the International Computer Science Institute, Berkeley, USA.
|
 |
 |
|
Other popular protein sequence clustering tools
Markov Clustering
Spectral Clustering
Affinity Propagation
GeneRAGE
FORCE