This program optimizes a degenerate codon library to cover the desired set of amino acids at several positions while staying within a diversity limit for the library. It is a fast way to generate small libraries.
The typical case for which we imagine SwiftLib to be useful is this: imagine that you have constructed 100 redesigned models of a particular protein where you allowed 15 residues to mutate. For position 82 (one of the residues you allowed to mutate), alanine appeared 10 times, arginine 35 times, lysine 25 times, leucine 20 times, asparagine 5 times, tryptophan 3 times, and valine twice and similar distributions were seen at the other 14 positions. (You would enter all of these counts in SwiftLib's table below.) Let's say you're aiming to create a library for yeast display and want to ensure that you don't exceed a (DNA) diversity of 1e7. In this case, you would rather have a library that excluded tryptophan at position 82 than one that excluded arginine. Trypotophan's exclusion represents error: you wanted something, but you couldn't get it. The goal of this program is to find the assignment of degenerate codons within the given diversity limit that yields the minimal error over all positions being randomized.
Additionally, it is possible to ask the algorithm to allow multiple degenerate codons at one or more positions. You simply indicate the primer boundaries and how many oligos you are willing to buy. The algorithm will choose which positions to use multiple degenerate codons at to get the best coverage. The number of oligos that must be purchased to cover the randomized residues that lie inside the same primer boundaries is the product of the number of degenerate codons used at each of those positions. E.g. if there are three residues that are part of the same stretch using, 2, 3, and 4 degenerate codons, then to cover all combinations, 2*3*4 = 24 oligos would have to be purchased. One of the more expensive parts of considering multiple degenerate codons is enumerating all combinations of degenerate codons; allowing i degenerate codons at a single position requires looking at (153)i combinations. For this reason we do not recommend using more than 4 degenerate codons at any one position.
AlgorithmThe program works by dynamic programming. If all of the errors are integers, then it is possible to ask, for each position, what degenerate codon has the smallest diversity given that it produces a given error. The smallest library for positions [1..i] given a particular error can be readily computed using a simple recurrence. The best library is the one with the smallest error with a diversity below the desired cap. The running time is O(n2m2) for n positions and m error gradations. In the case above, n=15, and m=100 (the maximum error would be 100 given by a codon that doesn't contain any of the desired amino acids; the larger the maximum error the longer the running time). Analysis for the multiple-degenerate-codon algorithm is slightly more complicated and will be published shortly.
Privacy? This program is implemented in JavaScript and therefore runs inside your browser. It does not send any data anywhere. You do not need to worry about anyone decoding the library you're creating.
InputThe input for SwiftLib is a table of positions for which you would like to vary in your library, and a numeric preference for each amino acid at these positions that you would like to favor, or disfavor. So, given the above example, the input would be a table with 15 columns (1 for each residue), and the numeric preference for each amino acid would be the number of occurances of that amino acid at that position. Aside from using positive integers to favor an amino acid, one can also use negative integers to disfavor an amino acid at a given positions. Furthermore one can use the '*' and '!' wildcards to specify that the amino acid is required ('*') or forbidden ('!') at the given position. Empty fields in the table will be treated as if they contain a '0' For convenience, SwiftLib allows the creation of this table manually, through a CSV format, or through a collections of FASTA formatted sequences.
Source CodeSwiftLib is implemented in JavaScript, so the source code can be downloaded by opening the JavaScript console in your browser (e.g. on a Mac in Chrome, alt-command-j). You can also download the code from github here. If you are interested in the integer-linear programming (ILP) solution described in the paper, a tarball with a set of python scripts that can be used to generate the ILP inputs and for processing the GLPK solver's outputs can be downloaded here.
Citation
If you use this program in your work, please cite:
Jacobs, Yumerefendi, Kuhlman & Leaver-Fay
SwiftLib: rapid degenerate-codon-library optimization through dynamic programming (2014)
Nucleic Acids Research, doi: 10.1093/nar/gku1323
[Link]
[PDF]
Please direct questions to: leaverfa at email dot unc dot edu.
Drag | |
---|---|
Residue Number: | |
Primer Boundary | |
Max DCs | |
A (Alanine) | |
C (Cysteine) | |
D (Apsartic Acid) | |
E (Glutamic Acid) | |
F (Phenylalanine) | |
G (Glycine) | |
H (Histidine) | |
I (Isoleucine) | |
K (Lysine) | |
L (Leucine) | |
M (Methionine) | |
N (Asparagine) | |
P (Proline) | |
Q (Glutamine) | |
R (Arginine) | |
S (Serine) | |
T (Threonine) | |
V (Valine) | |
W (Tryptophan) | |
Y (Tyrosine) | |
STOP |
To save the data you have entered in the table below for future sessions, switch to the "CSV Input" tab. This will display your data in a "comma separated value" (CSV) format. Copy your data out of the text box that appears and put it in a document for later use. You may load data into the table by pasting the data from a previously-saved table and clicking on the "Update Table" button (which is only visible in the "CSV Input" tab). If you get an error message saying that it found 46 columns instead of 23, make sure you clear out the text window before you paste your data in it.
To use the "comma separated value" (CSV) format simply pasted your comma-delimited data in the textbox. The input text must take the same form as the manual input table (22 rows, the first being the residue positions, the next 20 being the amino acids in alphabetical order by one-letter code, the last being the stop penalty for that position). Once completed, click the 'Update Table' button to populate the table with your inputs.
Once populated, one must still set the diversity bounds and the universal stop codon penalty (if desired)
To use the FASTA input, paste FASTA formatted sequences in the textbox. All sequences must be the same length. Once completed, clicking the 'Update Table' button with populate the input table with the frequency of each amino acid at each position in the sequence. Positions which never vary are excluded from the table.
Once populated, one must still set the diversity bounds and the universal stop codon penalty (if desired)
To use the the ClustalW input, paste a ClustalW formatted sequence alignment in the textbox. Please note that any position in the alignment that has only a single Amino Acid at that position will be removed from the table.
Once populated, one must still set the diversity bounds and the universal stop codon penalty (if desired)
To use the MSF input, paste a MSF formatted sequence alignment in the textbox.
Once populated, one must still set the diversity bounds and the universal stop codon penalty (if desired). Please note that any position in the alignment that has only a single Amino Acid at that position will be removed from the table.