This repository holds computer code (and, temporarily, data) used as part of [McCormack:20??]_.
In essence, the software included within the repository are meant to:
Because of the complex nature of the above steps, there are a number of independent and interdependent programs within this repository, that we hopefully better explain in workflow. The fruits of these programs labor are found in the Downloads as described in the Data section. Since many people are mostly interested in these data, we discuss these data first.
For many of the programs within, you should at least have access to a multi-core computer with a sufficient amount of RAM (>8GB). A number of the programs allow you to to parallelize jobs using the multiprocessing module of Python, and this provides and almost linear increase in processing speed.
Additionally, several programs (particularly those within phylo/*) require access to a cluster-computing system. We used a large cluster running the LSF queuing system.
Several of these programs cannot, realistically, be run on very small systems unless you have a very long time to wait.
There are numerous software dependencies that you must satisfy if you wish to run all of the code contained within this repository. These include:
MP-EST - we use a modified version of the original code
The code is available at http://github.com/baddna/seqcap.
We are in the process of cleaning and standardizing the code, while also increasing the available documentation for the code - both in the source files and here, in the documentation. Additionally, we have made incremental improvements to a number of routines used herein, and we will be merging these changes into this repository after the initial commit and tagging of files (see Tagging).
You will notice, should you scrutinize the code, that some programs write to an sqlite database while others write to a mysql database. The reasons for this additional level of complexity are several-fold. Generally speaking, we started using sqlite as the initial database for holding data generated as part of this project, but we moved to mysql when we decided that we needed a greater level of concurrency (sqlite does not support concurrent writes.
As we incrementally improve individual files, we will make the switch to mysql-only (or ORM, which should be more generic) support.
In the development version of the code (currently a private repository - see Additional Notes) we have also replaced slower external dependencies (e.g. BLAST) with similar, yet speedier alternatives (e.g. lastz)
Because code is a moving target, we have tagged particular commits that contain files of a certain vintage and/or purpose. For instance, the 0.1 tag holds the initial version of the code that we use to search MAF files and run parallel blast jobs. Subsequently, we updated these files to work more efficiently (e.g. the multiprocessing module in place of pp).