Find homopolymeric tracts in a FASTA genome

Assuming standard FASTA format, this BASH one-liner finds homopolymeric tracts (HTs, stretches of the genome where a single nucleotide is repeated many times, e.g. AAAA or TTTTTTT) in a genome and outputs the region.  Such regions are prone to sequencing errors, but are also mutational hotspots as they are susceptible to slippage errors during replication and transcription. Some evidence suggests that HTs may have a regulatory role in prokaryotes.

tail -n+2 GENOME.fa | tr -d '\n' | grep -ob -E "(\w)\1{4,}" | sed 's/:/\t/g' | awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' | sort -k1n

Explanation

# strip the FASTA header
tail -n+2 GENOME.fa
# remove newlines
tr -d '\n'
# match >4 (i.e. 5 or more) of the same character, output the match and byte offset
grep -ob -E "(\w)\1{4,}"
# replace the ":" added by grep with a tab
sed 's/:/\t/g'
# prints the genomic position (start + end) of the HT, nucleotide (ACGT) and the length of the tract
awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }'
# sorts by natural numeric position in the genome
sort -k1n

Example output (Pseudomonas fluorescens Pf0-1 NC_007492.2):
35 39 C 5
157 162 A 6
374 378 C 5
440 444 T 5
529 533 T 5
1432 1436 T 5
3304 3308 C 5
3310 3315 C 6
3626 3630 G 5
4063 4067 G 5

Leave a comment...

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s