Author Archives: dannagifford

BibTeX database from PDFs via DOI

This somewhat-ridiculous BASH one-liner will create a BibTeX database file (.bib) from a bunch of PDFs via the Crossref API for DOIs, providing the PDF has a DOI on the first page.  As DOI was introduced in 2000, this will probably not work on vintage PDFs.

 for pdfs in *.pdf; do pdftotext -f 1 -l 1 "$pdfs" - |tr -d "\n" | grep -oE "(doi|DOI):\s?[A-Za-z0-9./-\(\)-]+[0-9]" | tr '[:upper:]' '[:lower:]' | sed -r 's;doi:\s?;;g' | sed -r 's;$;/transform/application/x-bibtex;g' | xargs curl -fsS 2>/dev/null | sed -e '$a\'; done > allpdf.bib

Experimental evolution with the Singer ROTOR HDA

Setting up a large selection experiment on antibiotic fitness landscapes. I’ve decided to use the Singer ROTOR HDA for consistency and repeatability. The Singer ROTOR HDA uses a pre-sterilized pad system (called RePads) to transfer bacteria between solid or liquid medium. RePads are available in 96, 384, 1536 or 6144 formats. Read more about the Singer ROTOR HDA.

Vented versus non-vented Petri dishes

Vented Petri dishes have a small lip on the top edge of the dish that allows the lid to sit a little up from the bottom, allowing for some air flow.  Non-vented Petri dishes allow the lid to sit more or less flat on the bottom.  I was wondering what the best applications are for triple, single and non-vented Petri dishes, and found this guide from Thomas Scientific (link is dead).

  • Triple vented: aids gaseous exchange. Ideally suited for short term work
  • Single vented: limits gaseous exchange, minimizes evaporation and dehydration. Ideally suited for long term work
  • Non-vented: most suitable for anaerobic and long term work

Edit 28-02-2019 there’s an even better summary from Tritech Research:

Some of our dish models are available in both “vented” and “non-vented” styles. Standard Petri Dishes are always vented, so if the don’t say vented or non-vented, you should assume they are vented. “Vented” means that the lid is slightly elevated above the base. This allows for good, plentiful air exchange. This is useful when you want to encourage evaporation, for example, when you want to use poured plates as soon as possible, and the plates themselves, or a liquid seeding solution, needs to dry beforehand. The basic design of the dish tends to maintain sterility because particles would have to go up and over the dish’s wall to get inside, and this is rare in normal airflow.

With “non-vented” dishes, the lid fits quite flatly on the base. While it is not a hermetic seal, the space between dish and lid is extremely small. This results in even less potential for external contamination and a significantly reduced evaporation rate. For example a 60mm vented Petri Dish containing 10ml of agar medium typically dries out in 2-3 weeks; whereas, a similar 60mm non-vented dish typically lasts 2-3 months. Most C. elegans labs, except those in very humid climates, prefer the non-vented dishes. Non-vented dishes provide sufficient air exchange for the worms to breathe while greatly increasing the life of the dish.


Non-vented Petri dish


Vented Petri dish

JMP report file .jrp file to csv

Here’s a quick and dirty Perl script to get a data table out of a .jrp file from JMP.  Not guaranteed to work for all files, as I’ve only tested it on one (so modification may be necessary).

#! /usr/bin/perl -w

use strict;
use Getopt::Long;
my $jmp_file;
my $colhead;
my $values;
my $row;
my @records;
my $ndata; 
GetOptions ('jmp=s' => \$jmp_file) or die("Error in arguments\n");

open (JMP,"<$jmp_file") || die "cannot open JMP input file $jmp_file";
if($_=~/New Column\(\s\"(.+?)\",.+$/){ # Get column names
if($_=~/Set Values\(\s[\[\{](.+?)[\]\}] \) \),/){ # Get row values
my @row = split ", ", $values;
unshift @row, $colhead;
$ndata= scalar @row;
push @records, \@row;

# Rotate table 90 degrees (rows-to-columns)
my $nrecords=scalar @records;
for(my $i=0;$i<$ndata; $i++){
for(my $j=0;$j<$nrecords;$j++){
print "$records[$j][$i]";
print "," if $j<($nrecords-1) 
print "\n";

Find homopolymeric tracts in a FASTA genome

Assuming standard FASTA format, this BASH one-liner finds homopolymeric tracts (HTs, stretches of the genome where a single nucleotide is repeated many times, e.g. AAAA or TTTTTTT) in a genome and outputs the region.  Such regions are prone to sequencing errors, but are also mutational hotspots as they are susceptible to slippage errors during replication and transcription. Some evidence suggests that HTs may have a regulatory role in prokaryotes.

tail -n+2 GENOME.fa | tr -d '\n' | grep -ob -E "(\w)\1{4,}" | sed 's/:/\t/g' | awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' | sort -k1n


# strip the FASTA header
tail -n+2 GENOME.fa
# remove newlines
tr -d '\n'
# match >4 (i.e. 5 or more) of the same character, output the match and byte offset
grep -ob -E "(\w)\1{4,}"
# replace the ":" added by grep with a tab
sed 's/:/\t/g'
# prints the genomic position (start + end) of the HT, nucleotide (ACGT) and the length of the tract
awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }'
# sorts by natural numeric position in the genome
sort -k1n

Example output (Pseudomonas fluorescens Pf0-1 NC_007492.2):
35 39 C 5
157 162 A 6
374 378 C 5
440 444 T 5
529 533 T 5
1432 1436 T 5
3304 3308 C 5
3310 3315 C 6
3626 3630 G 5
4063 4067 G 5