Table Of Contents

Previous topic

Documentation

Next topic

2. Data

This Page

1. Introduction

This repository holds computer code (and, temporarily, data) used as part of [McCormack:20??]_.

In essence, the software included within the repository are meant to:

  1. identify ultra-conserved regions of genomic DNA (AKA ultra-conserved elements or UCEs) in [vertebrate] organisms
  2. design in silico (or in vitro) “probes” for the enrichment of UCEs
  3. align in silico probes to genomic DNA of numerous sources and extract those reads + some flanking DNA
  4. cluster in silico probes + flanking sequence from individual species into “loci” representing entire UCEs. We can easily translate this step to an in vitro approach by clustering reads we derive from target enrichment where the enriching agents are the probes designed above - then we just need cluster these probes + reads into the corresponding UCEs from which they were derived.
  5. align, across species, UCE “loci” from above
  6. generate gene trees from these inter-species alignments (using PhyML)
  7. generate the species trees from individual gene trees (using STAR, STEAC, or MP-EST)

Because of the complex nature of the above steps, there are a number of independent and interdependent programs within this repository, that we hopefully better explain in workflow. The fruits of these programs labor are found in the Downloads as described in the Data section. Since many people are mostly interested in these data, we discuss these data first.

1.1. Dependencies

1.1.1. Hardware Dependencies

For many of the programs within, you should at least have access to a multi-core computer with a sufficient amount of RAM (>8GB). A number of the programs allow you to to parallelize jobs using the multiprocessing module of Python, and this provides and almost linear increase in processing speed.

Additionally, several programs (particularly those within phylo/*) require access to a cluster-computing system. We used a large cluster running the LSF queuing system.

Several of these programs cannot, realistically, be run on very small systems unless you have a very long time to wait.

1.1.2. Software Dependencies

There are numerous software dependencies that you must satisfy if you wish to run all of the code contained within this repository. These include:

1.2. Notes on the Code

The code is available at http://github.com/baddna/seqcap.

We are in the process of cleaning and standardizing the code, while also increasing the available documentation for the code - both in the source files and here, in the documentation. Additionally, we have made incremental improvements to a number of routines used herein, and we will be merging these changes into this repository after the initial commit and tagging of files (see Tagging).

You will notice, should you scrutinize the code, that some programs write to an sqlite database while others write to a mysql database. The reasons for this additional level of complexity are several-fold. Generally speaking, we started using sqlite as the initial database for holding data generated as part of this project, but we moved to mysql when we decided that we needed a greater level of concurrency (sqlite does not support concurrent writes.

As we incrementally improve individual files, we will make the switch to mysql-only (or ORM, which should be more generic) support.

In the development version of the code (currently a private repository - see Additional Notes) we have also replaced slower external dependencies (e.g. BLAST) with similar, yet speedier alternatives (e.g. lastz)

1.3. Additional Notes

  1. we have moved the code within this repository here from a private repository that I (BCF) maintain for development. You should generally be happy about this, because it has allowed us to do a fair amount of housekeeping. It also allows us to work on some things that we may not want everyone to know about (yet!). However, if you believe a program is missing that may be in this private repository, please let me know, and I’ll attempt to move it over. Eventually, we plan to move all code to this repository, and remove the private repository. At that point, we will develop from within this repository. The downside of this approach is that we will lose some history information of particular pieces of code.
  2. we have an updated workflow for a number of the steps detailed below (particularly the initial steps of UCE location and probe design). When the time comes, we will tag pertinent files in the current repo (see Tagging), and then move in the new bits.
  3. some of the methods/code within are likely confusing, particularly if you are trying to piece together what we did without actually reading the code. For the most part, we’ll try to give you some guidance, but you’ll also need to read the code. It may be helpful to enlist someone with knowledge of Python to aid this process.

1.4. Tagging

Because code is a moving target, we have tagged particular commits that contain files of a certain vintage and/or purpose. For instance, the 0.1 tag holds the initial version of the code that we use to search MAF files and run parallel blast jobs. Subsequently, we updated these files to work more efficiently (e.g. the multiprocessing module in place of pp).