====== Running BLASTp locally ======
Use this to find sequence similarity of your favorite protein in a database of proteins, for example, in the proteome of another organism. Alternative to the [[https://blast.ncbi.nlm.nih.gov/Blast.cgi|NCBI BLAST]] server, it can also be ran locally with the command line.
Example sequence for BLAST.
>HSP70B
MPVQQMTSMRSQSLAGAPVAPVKAGRAGVSRRGLAVSVRAEKVVGIDLGTTNSAVAAMEG
GKPTIITNAEGGRTTPSVVAFTKTGDRLVGQIAKRQAVVNPENTFFSVKRFIGRRMSEVG
SESTQVPYRVIEDGGNVKIKCPNAGKDFAPEEISAQVLRKLTEDAAKFLNDKVEKAVITV
PAYFNDSQRQATKDAGKIAGLEVLRIINEPTAASLAYGFDKKANETILVFDLGGGTFDVS
VLEVGDGVFEVLSTSGDTHLGGDDFDKRIVDFLADDFKKSEGIDLRKDRQALQRLTEAAE
KAKIELSGMAQTSINLPFITATADGPKHIDTQLTRAKFEEMCNDLLERCKVPVQQALRDA
KLSISDIQEVILVGGSTRIPAVQEIVRKLSGGKDPNVTVNPDEVVALGAAVQAGVLAGEV
SDIVLLDVTPLSLGLETLGGVMTKLIPRNTTLPTSKSEVFSTAADGQTSVEINVLQGERE
FARDNKSLGTFRLDGIPPAPRGVPQIEVKFDIDANGILSVTATDKGTSKKQDIRITGAST
LDKGDVERMVKEAEKFAGEDKKRRESVETKNQAETMVYQTEKQLKEFEGKVPADIKAKVE
AKLGELKAALPADDAEATKAAMNALQQEVMAMGQAMYSQAGAAPGGAPGAEPGAGAGAGG
APGGKKDDDVIDAEFTDKK
===== Locally with the command line =====
Create BLAST database in a folder with the same name as the database. As an example here, the database would be the proteome of Chlamydomonas.
makeblastdb -in chlamydomonas.fa -dbtype prot -out chlamydomonas
Run BLASTp from the database folder\\
(Note: adjust the parameters to your needs; i.e. evalue & max_target_seqs)\\
(Note II: there are different ways to run blast locally. See BLASTall, BLASTn, etc; the parameters are not the same as in BLASTp)
blastp -query ../secuencias_query.fasta -db chlamydomonas -out ../resultados_blastp.txt -evalue 0.05 -outfmt "6 std qcovs"
BLASTp output by column. Information of BLAST terms can be found in the [[https://www.ncbi.nlm.nih.gov/books/NBK62051/|glossary]].
- query
- hit
- identity
- alignment length
- #mismatch
- #gaps
- start query
- end query
- hit start
- hit end
- e-value → we want this close to zero
- bit score
- Coverage
/*
===== Filter your results =====
Which BLAST hits do you actually keep? \\ → The answer is always "It depends". You need to know your data and there are different methods for filtering.
=== Identity and coverage thresholds ===
* Depending on your dataset you can keep the hits that meet certain percentage of identity and the percentage of length covered in the target sequence. Note that these thresholds are arbitrary.
* For example, in a database of bacterial proteins, the [[https://p3.theseed.org/p3_docs/user_guide/genome_feature_data_and_tools/specialty_genes.html|suggested thresholds by PATRIC]] are 80% - 80% when using BLAST for bacteria, but 50% - 50% for finding human homologs.
* Another example: Sachli's threshold. She uses sequence identity > 90% and coverage > 95% to keep hits within bacteria strains of the same species.
=== E-value ===
* The [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/pdf/nihms519883.pdf|e-value]] can be tricky to use, since it will change depending on the size of the database.
=== Bitscore ===
* The Bitscore is another indicator, but in contrast to the e-value, it is independent of sequence length and database size.
*/
===== Noted issues =====
* In BLASTn, sequences with a repeated pattern of nucleotides may not always be found through BLAST due to the low complexity filter.
* Keep in mind the [[https://www.gqlifesciences.com/3-problems-with-using-blast-for-sequence-alignments-in-ip-searching/|limitations of BLAST]].