
22-Jan-2014
13-Apr-2016 updated

fasta36/scripts

Perl scripts for annotating sequences and expanding libraries

-- Sequence alignment scoring/annotation

Two program scripts -- annot_blast_btop2.pl and blastp_cmd.sh -- have
been added to support sub-alignment scoring of BLASTP alignments.

annot_blast_btop2.pl takes three inputs: (1) a query sequence file; (2)
a domain annotation script (see below), and (3) a BLAST tabular format
output with two additional fields, "score" and "btop":

annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl blast_tab_btop_file

The blast_tab_btop_file can be produced using the blastp_cmd.sh shell
script, which uses ASN.1 output and blast_formatter to produce both a
standard alignment file and the modified blast tabular btop file.

-- Implied Multiple sequence alignment --

As part of a strategy to improve PSSM-based similarity searching, two
scripts that use a BTOP encoded alignment string from either -m 8CB or
-m 9B output files to produce a Clustal-like multiple sequence
alignment (MSA) that can be used as input to psiblast to produce an
ASN.1 text file (which can be converted with datatool to ASN.1 binary,
which can be read by ssearch36 -P "file.asn1 2").  We used the BTOP
encoding, rather than the more common CIGAR string (-m 9C), or the
older alignment encoding (-m 9c), because the BTOP encoding only
requires the query sequence to reproduce both the query and subject
aligned residues.  Thus:

m8_btop_msa.pl --query gstt1_drome.aa gstt1_sp.bl_btop > gstt1_sp.ss_msa

where "gstt1_sp.bl_btop" is "-m 8CB" output, produces:
  ====
  SSEARCHm8 multiple sequence alignment


  sp|P20432|GSTT1_DROME     MVDFYYLPGSSPCRSVIMTAKAVGVELNKKLLNLQAGEHLKPEFLKINPQHTIPTLVDNG
  sp|P04907|GSTF3_MAIZE     ---LYGMPLSPNVVRVATVLNEKGLDFEIVPVDLTTGAHKQPDFLALNPFGQIPALVDGD
  sp|P12653|GSTF1_MAIZE     -------------------------------INFATAEHKSPEHLVRNPFGQVPALQDGD
  sp|P0ACA5|SSPA_ECO57      ------------------------------------------DLIDLNPNQSVPTLVDRE
  sp|P00502|GSTA1_RAT       VLHYFNARGRMECIRWLLA--AAGVEFDEKFI--QSPEDL--EKLKKDGNDQVPMVEIDG


  ...
  ====

which can be used with psiblast -in_msa gstt1_sp.ss_msa.
m9B_btop_msa.pl does the same for "-m 9B" output.

-- Domain annotation --

(Nov. 2015) These domain annotation scripts allow overlapping domains,
and must be used with versions of the FASTA programs that support the
current "start - stop domain_description" format (in contrast to the
older format which put domain starts and stops on separate lines with
'[' and ']'). Until this release, the "overlapping" domain scripts had
'_e' in their name, e.g. ann_pfam28_e.pl.  The "_e" scripts have been
renamed, losing the '_e', and the old non-'_e' scripts have been
removed from the distribuition.

All of the "ann_*.pl" scripts are used to annotate query or library
sequences using the -V option.  See ../test/test2V.pl for examples.


ann_feats2ipr.pl -- generate Uniprot sites, Interpro domains, from a mySQL database
ann_feats2l.pl -- generate Uniprot sites, domains from a mySQL database

ann_feats_up_www2.pl -- generate Uniprot sites, domains from an EBI
		     	web server that converts Uniprot DAS to gff3.

ann_feats_up_www.pl -- generate Uniprot sites, domains from a Uniprot
		       gff web server (less information than ann_feats_www2.pl)

ann_ipr_www.pl -- Interpro domains from Interpro WWW site.

ann_pdb_cath.pl -- generate CATH domains using PDB accessions from a mySQL database
ann_pdb_vast.pl -- use VAST domains, but domain names are not informative

ann_pfam27.pl -- generate Pfam domains using local Pfam mySQL database (Pfam27 with auto_pfamA, auto_pfamseq)
ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database (Pfam28, no auto_pfamA, auto_pfamseq)
ann_pfam_www.pl -- use Pfam Website, and XML::Twig, to get Pfam domain info.

ann_exons_ens.pl -- generate exon boundaries on SwissProt proteins from Ensembl.
ann_exons_up_www.pl -- generate exon boundaries on SwissProt proteins using the EBI/Proteins/API/coordinate service
ann_exons_ncbi.pl -- generate exon boundaries on NCBI refseq proteins.

-- Library expansion

expand_uniref50.pl -- allows search of uniref50 to be expanded
expand_links.pl -- script to take hits from a smaller library and expand to complete library
links2sql.pl -- create links for expand_links.pl

exp_up_ensg.pl -- expand uniprot sequences to include Ensembl splice variants

-- Plot local alignments (.lav files)

lav2plt.pl    -- used to produce postscript or svg plots of "lalign36 -m 11" lav output files
color_defs.pl -- used by lav2plt.pl to produce domain colors
lavplt_ps.pl  -- used by lav2plt.pl --dev ps
lavplt_svg.pl -- used by lav2plt.pl --dev svg

