====== Practical guide to BLASTp ======
Use this to find sequence similarity of your favorite protein in the proteins of another organism.
===== NCBI server =====
You can use [[https://blast.ncbi.nlm.nih.gov/Blast.cgi|NCBI BLAST]] to check for sequence similarity in NCBI's database of genomes, or to a set of sequences that you can upload. However, using BLAST for many sequences can be inconvenient to process as an output, and running BLAST locally would be the best approach
===== Locally with the command line =====
Create BLAST database in a folder with the same name as the database. As an example here, the database would be the proteome of Chlamydomonas.
makeblastdb -in chlamydomonas.fa -dbtype prot -out chlamydomonas
Run BLASTp from the database folder\\
(Note: adjust the parameters to your needs; i.e. evalue & max_target_seqs)\\
(Note II: there are different ways to run blast locally. See BLASTall, BLASTn, etc; the parameters are not the same as in BLASTp)
blastp -query ../secuencias_query.fasta -db chlamydomonas -out ../resultados_blastp.txt -evalue 0.05 -outfmt "6 std qcovs" -max_target_seqs 1
BLASTp output by column. Information of BLAST terms can be found in the [[https://www.ncbi.nlm.nih.gov/books/NBK62051/|glossary]].
- query
- hit
- identity
- alignment length
- #mismatch
- #gaps
- start query
- end query
- hit start
- hit end
- e-value → we want this close to zero
- bit score
- Coverage
===== Filter your results =====
Which BLAST hits do you actually keep? \\ → The answer is always "It depends". You need to know your data and there are different methods for filtering.
=== Identity and coverage thresholds ===
* Depending on your dataset you can keep the hits that meet certain percentage of identity and the percentage of length covered in the target sequence.
* For example, in a database of bacterial proteins, the [[https://p3.theseed.org/p3_docs/user_guide/genome_feature_data_and_tools/specialty_genes.html|suggested thresholds by PATRIC]] are 80% - 80% when using BLAST for bacteria, but 50% - 50% for finding human homologs.
* Another example: Sachli's threshold. She uses sequence identity > 90% and coverage > 95% to keep hits within bacteria strains of the same species.
=== E-value ===
* The [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/pdf/nihms519883.pdf|e-value]] can be tricky to use, since it will change depending on the size of the database.
=== Bitscore ===
* The Bitscore is another indicator, but in contrast to the e-value, it is independent of sequence length and database size.