Find homopolymeric tracts in a FASTA genome

Assuming standard FASTA format, this BASH one-liner finds homopolymeric tracts (HTs, stretches of the genome where a single nucleotide is repeated many times, e.g. AAAA or TTTTTTT) in a genome and outputs the region.  Such regions are prone to sequencing errors, but are also mutational hotspots as they are susceptible to slippage errors during replication and transcription. Some evidence suggests that HTs may have a regulatory role in prokaryotes.

tail -n+2 GENOME.fa | tr -d '\n' | grep -ob -E "(\w)\1{4,}" | sed 's/:/\t/g' | awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' | sort -k1n

Explanation

# strip the FASTA header
tail -n+2 GENOME.fa
# remove newlines
tr -d '\n'
# match >4 (i.e. 5 or more) of the same character, output the match and byte offset
grep -ob -E "(\w)\1{4,}"
# replace the ":" added by grep with a tab
sed 's/:/\t/g'
# prints the genomic position (start + end) of the HT, nucleotide (ACGT) and the length of the tract
awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }'
# sorts by natural numeric position in the genome
sort -k1n

Example output (Pseudomonas fluorescens Pf0-1 NC_007492.2):
35 39 C 5
157 162 A 6
374 378 C 5
440 444 T 5
529 533 T 5
1432 1436 T 5
3304 3308 C 5
3310 3315 C 6
3626 3630 G 5
4063 4067 G 5

SNPsvg

I wrote a quick Perl script to visualize SNPs in a gene from experimental evolution sequencing data.  Useful for making figures when one gene is hit by mutations in multiple lineages.  It outputs an SVG file with the reference sequence and the changes.

For the moment, it only visualizes substitutions, not insertions/deletions or anything more exotic.  More to come.

Example: SNPs found in Pseudomonas aeruginosa gene PA2449 (converted to PNG)

figure

(Thanks to Sofia Robb for teaching Perl as part of Programming for Evolutionary Biology!)

Download SNPsvg here.

 

Regex to change bracket citations into bibtex keys

Useful for converting in-text citations (from e.g. Word) into a LaTeX document.  Converts bracket notation into first 3 chars of first author’s last name (or 2 chars, if only 2 chars long), plus two-digit year: e.g. (Bobby 2009) becomes \citep{Bob09}.  See also my post on how to insert bibtex references into Word.

Search for:

\(([A-Za-z]{2,3})[A-Za-z -.]* (18|19|20)([0-9a-z]{2,3})\)

Replace with:

\\citep{\1\3}

Test citations:

(Bobby 2009)
(Bobby and Jonny 1909)
(Bobby et al. 1923)

(Maynard Smith 1989)
(Maynard Smith and Haigh 1974)
(Maynard Smith and Bobby-Jonny 1974)
(Maynard Smith and Maynard Smith 1974)
(Maynard Smith et al. 1993)

(Maisnier-Patin 1900)
(Maisnier-Patin and Bobby-Jonny 1956)
(Maisnier-Patin et al. 2002)

(Aa 1932)
(Aaa 1932)

Fluctuation test calculator: FALCOR copy/paste issue

Update 2019-07-31 FALCOR is offline. I did try contacting the author through Twitter, but have received no response. Since people still seem to look at this post regularly, consider switching to the R package flan for fluctuation tests. There’s even a way to launch a FALCOR-style web app through Shiny (interactive websites powered by R), ShinyFlan. We will be using this package in future.

FALCOR (Fluctuation AnaLysis CalculatOR) is a handy Java applet designed to estimate mutation rates from fluctuation test data. A Java update prevents access to the clipboard to unsigned Java applets. If you are having trouble copy/pasting your data into FALCOR, try modifying your java.policy file to include the line: permission java.awt.AWTPermission “accessClipboard” under where it says //”standard” properties that can be read by anyone.

// "standard" properties that can be read by anyone
permission java.awt.AWTPermission "accessClipboard"

Save the file, exit and restart Java (or any programs using it, like Firefox or LibreOffice), and copy/paste should work again. You may wish to consider removing the clipboard access after using FALCOR for security reasons.

Muzzling Canadian government workers

90% of federal scientists cannot speak freely

Over the past few years, there has been much attention given to the restrictions placed on scientists employed by the Canadian government.  A report recently produced by PIPSC has shown that 90% of government scientists do not feel they can speak freely to the media about their research.  Although this is an important issue, the problem runs much deeper.  What the reports fail to mention is that all federal government employees, scientists or not, must obtain permission to speak to the media in the capacity of their jobs.  The policies aren’t a only a direct affront to science, but also on the free sharing of information.

Fixation and loss of resistance mutations

The top plates contain genamicin, and the bottom two rifampicin. The bottom plates are colonies of Pseudomonas aeruginosa rpoB mutants +P518 (left), and S536F (right). The presence of a colony on the top but not the bottom represents the fixation of a rifamicin-sensitive invader (GFP-PA01) introduced into the rifampicin-resistant population. Note the diversity of colony morphology, suggesting divergence from the ancestral genotype has occurred in all three genetic backgrounds.