Assuming standard FASTA format, this BASH one-liner finds homopolymeric tracts (HTs, stretches of the genome where a single nucleotide is repeated many times, e.g. AAAA or TTTTTTT) in a genome and outputs the region. Such regions are prone to sequencing errors, but are also mutational hotspots as they are susceptible to slippage errors during replication and transcription. Some evidence suggests that HTs may have a regulatory role in prokaryotes.
tail -n+2 GENOME.fa | tr -d '\n' | grep -ob -E "(\w)\1{4,}" | sed 's/:/\t/g' | awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' | sort -k1n
Explanation
# strip the FASTA header tail -n+2 GENOME.fa # remove newlines tr -d '\n' # match >4 (i.e. 5 or more) of the same character, output the match and byte offset grep -ob -E "(\w)\1{4,}" # replace the ":" added by grep with a tab sed 's/:/\t/g' # prints the genomic position (start + end) of the HT, nucleotide (ACGT) and the length of the tract awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' # sorts by natural numeric position in the genome sort -k1n
Example output (Pseudomonas fluorescens Pf0-1 NC_007492.2):
35 39 C 5
157 162 A 6
374 378 C 5
440 444 T 5
529 533 T 5
1432 1436 T 5
3304 3308 C 5
3310 3315 C 6
3626 3630 G 5
4063 4067 G 5
…