Help Login Create account

Data released on October 14, 2016

Supporting data for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets"

Baker, R; DeSalle, R; Kolokotronis, S; Kreiswirth, B; Mathema, B; Narechania, A; Planet, P, J (2016): Supporting data for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets" GigaScience Database. http://dx.doi.org/10.5524/100247 RIS BibTeX Text

Collective animal behavior such as the flocking of birds or the shoaling of fish has inspired a class of algorithms designed to optimize distance-based clusters in various applications including document analysis and DNA microarrays. In the flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize and clusters emerge without the need for partitional seeds. In addition to their unsupervised nature, flocking offers several computational advantages including the potential to decrease the number of required comparisons.
In Clusterflock, we implement a flocking algorithm designed to find groups (flocks) of orthologous gene families (OGFs) that share a common evolutionary history. Pairwise distances that measure the phylogenetic incongruence between OGFs guide flock formation. We test this approach on several simulated datasets varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also demonstrate its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signal, we can pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold.
Clusterflock is an open source tool that can be used to discover horizontally transferred genes, recombined areas of chromosomes, and the phylogenetic “core” of a genome. Though we use it in an evolutionary context, it is generalizable to any clustering problem. Users can write extensions to calculate any distance metric on the unit interval and use these distances to “flock” any type of data.

Contact Submitter

Related manuscripts:

doi:10.1186/s13742-016-0152-3

Additional information:

https://github.com/narechan/clusterflock

https://hub.docker.com/r/narechan/clusterflock-0.1/

https://youtu.be/ELZTVOiqKn8

Keywords:

swarms flocking algorithm unsupervised clustering data mining horizontal gene transfer recombination Staphylococcus aureus 

Software

http://gigadb.org/images/data/cropped/100247.jpg

Files: (FTP site) Table Settings

Columns:

File Description
Sample ID
File Type
File Format
Size
Release Date
Download Link
File Attributes

File NameSample IDFile TypeFile FormatSizeRelease Date 
Mixed archiveTAR235.92 MB2016-10-10
GitHub archivearchive17.03 MB2016-10-10
VideoUNKNOWN40.33 MB2016-10-10
ReadmeTEXT1.81 KB2016-10-10
Displaying 1-4 of 4 File(s).

History:

+

Other datasets you might like: