SignalP+TMHMM預測微生物分泌蛋白?廣微測是最權威的檢測中心嗎??健明迪

更新時間：2025-07-12 來源：健明迪檢測

SignalP+TMHMM預測微生物分泌蛋白

華中農業大學微生物學博士

Secretory Protein是指在細胞內分解后，分泌到細胞外起作用的蛋白質。分泌蛋白的N 端有普通由15～30 個氨基酸組成的信號肽。信號肽是引導新分解的蛋白質向分泌通路轉移的短（長度5-30個氨基酸）肽鏈。常指新分解多肽鏈中用于指點蛋白質的跨膜轉移（定位）的N-末端的氨基酸序列（有時不一定在N端）。運用SignalP 注釋蛋白序列能否含有信號肽結構，運用TMHMM注釋蛋白序列能否含有跨膜結構，*終挑選出含有信號肽結構并且不含跨膜結構的蛋白為分泌蛋白。

軟件Software

SignalP V6.0
SignalP 6.0 預測來自古細菌、革蘭氏陽性細菌、革蘭氏陰性細菌和真核生物的蛋白質中存在的信號肽predicts signal peptides and the location of their cleavage sites in proteins from Archaea, Gram-positive Bacteria,及其切割位點的位置。Gram-negative Bacteria and Eukarya.在細菌和古細菌中，SignalP 6.0 可以區分五種類型的信號肽：In Bacteria and Archaea, SignalP 6.0 can discriminate between five types of signal peptides:

Sec/SPI：由 Sec 轉座轉運，并由信號肽酶 I (Lep) 切割的“規范”分泌信號肽；"Standard" secretory signal peptides transported by Sec translocon and cleaved by Signal Peptidase I (Lep).
Sec/SPII：由 Sec 轉座子運輸，并由信號肽酶 II (Lsp) 切割的脂蛋白信號肽；lipoprotein signal peptides transported by the Sec translocon and cleaved by Signal Peptidase II (Lsp).
Tat/SPI：由 Tat 轉座子轉運，并由信號肽酶 I (Lep) 切割的 Tat 信號肽；Tat signal peptides transported by the Tat translocon and cleaved by Signal Peptidase I (Lep).
Tat/SPII：由 Tat 轉位子轉運，并由信號肽酶 II (Lsp) 切割的 Tat 脂蛋白信號肽；Tat lipoprotein signal peptides transported by Tat translocon & cleaved by Signal Peptidase II (Lsp).
Sec/SPIII：由 Sec 轉位子運輸，并由信號肽酶 III (PilD/PibD) 切割的菌毛蛋白和菌毛蛋白樣信號肽。Pilin & pilin-like signal peptides transported by Sec translocon & cleaved by Signal Peptidase III (PilD/PibD).
此外，SignalP 6.0 預測信號肽的區域。Additionally, SignalP 6.0 predicts the regions of signal peptides.依據類型，預測 n、h 和 c 區域以及其他顯著特征的位置。Depending on the type, the positions of n-, h- and c-regions as well as of other distinctive features are predicted.

TMHMM V2.0c

用于預測蛋白質中的跨膜螺旋。

Python

SignalP和TMHMM關于學術用戶收費，但是需求填寫相關信息和郵箱，以接納下載鏈接（4h有效時間）。

軟件裝置Installation of Softwares

裝置SignalP 6.0

下載訪問SignalP V6.0網站，找到“Download”，填寫相關信息，獲取下載鏈接，下載失掉“signalp-6.0.fast.tar.gz”。有兩個形式可以選擇——“slow_sequential”和“fast"。前者runs the full model sequentially, taking the same amount of RAM as fast but being 6 times slower；后者uses a smaller model that approximates the performance of the full model, requiring a fraction of the resources and being significantly faste。本教程下載的是fast形式。
裝置Installation

裝置依賴Dependencies

Python
matplotlib>3.3.2
numpy>1.19.2
torch>1.7.0 pip install torch
tqdm>4.46.1

裝置SignalP 6.0 # 解緊縮裝置文件 tar zxvf signalp-6.0.fast.tar.gz # 進入解壓后的軟件目錄，在終端運轉 python setup.py install # 測試裝置 signalp6 --help

裝置TMHMM V2.0c

下載訪問TMHMM V2.0c網站，找到“Download”，填寫相關信息，獲取下載鏈接，下載失掉“tmhmm-2.0c.Linux.tar.gz”。
裝置 # 解緊縮 tar zxvf tmhmm-2.0c.Linux.tar.gz # 進入解壓后的目錄 cd tmhmm-2.0c # 獲取以后途徑，我的是“/home/liu/tools/tmhmm-2.0c/bin” pwd # 將該途徑參與到系統的環境變量中，參考我之前的文章來（編輯~/.bashrc）http://liaochenlanruo.github.io/post/f6c9.html#%E6%B7%BB%E5%8A%A0%E7%8E%AF%E5%A2%83%E5%8F%98%E9%87%8F # 修正bin目錄下的tmhmm和tmhmmformat.pl的首行為“#!/usr/bin/perl”
運轉錯誤運轉軟件時總報Segmentation fault (core dumped)錯誤，暫時無解。各位可以運用其在線版。

軟件用法Usage

SignalP 6.0

預測Prediction

A command takes the following form

signalp6 --fastafile /path/to/input.fasta --organism other --output_dir path/to/be/saved --format txt --mode fast

fastafile 輸入文件為FASTA格式的蛋白序列文件Specifies the fasta file with the sequences to be predicted.。
organism is either other or Eukarya. Specifying Eukarya triggers post-processing of the SP predictions to prevent spurious results (only predicts type Sec/SPI).
format can take the values txt, png, eps, all. It defines what output files are created for individual sequences. txtproduces a tabular .gff file with the per-position predictions for each sequence. png, eps, all additionally produce probability plots in the requested format. For larger prediction jobs, plotting will slow down the processing speed significantly.
mode is either fast, slow or slow-sequential. Default is fast, which uses a smaller model that approximates the performance of the full model, requiring a fraction of the resources and being significantly faster. slow runs the full model in parallel, which requires more than 14GB of RAM to be available. slow-sequential runs the full model sequentially, taking the same amount of RAM as fast but being 6 times slower. If the specified model is not installed, SignalP will abort with an error.

輸入Outputs

output_dir/output.gff3：僅包括含有信號肽的序列信息；

output_dir/prediction_results.txt：包括了輸入文件中的一切序列（不重要）；
output_dir/region_output.gff3：包括一切的信號肽區域信息。

n-region: The n-terminal region of the signal peptide. Reported for Sec/SPI, Sec/SPII, Tat/SPI and Tat/SPII. Labeled as N
h-region: The center hydrophobic region of the signal peptide. Reported for Sec/SPI, Sec/SPII, Tat/SPI and Tat/SPII. Labeled as H
c-region: The c-terminal region of the signal peptide, reported for Sec/SPI and Tat/SPI.
Cysteine: The conserved cysteine in +1 of the cleavage site of Lipoproteins that is used for Lipidation. Labeled as c.
Twin-arginine motif: The twin-arginine motif at the end of the n-region that is characteristic for Tat signal peptides. Labeled as R.
Sec/SPIII: These signal peptides have no known region structure.

批處置與結果優化

腳本名：run_SignalP.pl

#!/usr/bin/perl

use strict;

use warnings;

# Author: Liu Hualin

# Date: Oct 14, 2021

open IDNOSEQ, ">IDNOSEQ.txt" || die;

my @faa = glob("*.faa");

foreach (@faa) {

$_ =~ /(.+).faa/;

my $str = $1;

my $out = $1 . ".nodesc";

my $sigseq = $1 . ".sigseq";

my $outdir = $1 . "_signalp";

open IN, $_ || die;

open OUT, ">$out" || die;

while () {

chomp;

if (/^(>\S+)/) {

print OUT $1 . "\n";

}else {

print OUT $_ . "\n";

}

close IN;

close OUT;

my %hash = idseq($out);

system("signalp6 --fastafile $out --organism other --output_dir $outdir --format txt --mode fast");

my $gff = $outdir . "/output.gff3";

if (! -z $gff) {

open IN, "$gff" || die;

;

open OUT, ">$sigseq" || die;

while () {

chomp;

my @lines = split /\t/;

if (exists $hash{$lines[0]}) {

print OUT ">$lines[0]\n$hash{$lines[0]}\n";

}else {

print IDNOSEQ $str . "\t" . "$lines[0]\n";

}

close IN;

close OUT;

}

system("rm $out");

system("mv $sigseq $outdir");

}

close IDNOSEQ;

sub idseq {

my ($fasta) = @_;

my %hash;

local $/ = ">";

open IN, $fasta || die;

;

while () {

chomp;

my ($header, $seq) = split (/\n/, $_, 2);

$header =~ /(\S+)/;

my $id = $1;

$hash{$id} = $seq;

}

close IN;

return (%hash);

}

將run_SignalP.pl與后綴名為“.faa”的FASTA格式文件放在同一目錄下，在終端中運轉如下代碼：

perl run_SignalP.pl

結果解讀Output interpretation

*代表輸入文件的名字。

*_signalp/output.gff3：僅包括含有信號肽的序列信息；
*_signalp/prediction_results.txt：包括了輸入文件中的一切序列（不重要）；
*_signalp/region_output.gff3：包括一切的信號肽區域信息;
*_signalp/*.sigseq：存儲一切信號肽的氨基酸序列文件，可用作TMHMM的輸入文件。

TMHMM

預測

離線版總是報錯，找不出緣由，因此運用網頁效勞器停止，輸入文件為上述生成的“*_signalp/*.sigseq”，將其上傳至網頁版TMHMM，提交義務，等候結果即可。

結果展現

TMHMM可以輸入多種格式的結果文件，詳細請參考其官方說明。

在TMHMM網站提交義務

Long output format

Length：蛋白序列的長度。The length of the protein sequence.
Number of predicted TMHs：預測到的跨膜螺旋的數量。The number of predicted transmembrane helices.
Exp number of AAs in TMHs：跨膜螺旋中氨基酸的預期數量。The expected number of amino acids intransmembrane helices. 假設此數字大于 18，則很能夠是跨膜蛋白（或具有信號肽）。If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
Exp number, first 60 AAs：在蛋白的前60個氨基酸中跨膜螺旋中氨基酸的預期數量。The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein.假設該數字超越幾個，你應該被正告在 N 端預測的跨膜螺旋能夠是一個信號肽。If it more than a few, you are warned that a predicted transmembrane helix in the N-term could be a signal peptide.
Total prob of N-in：N端在膜的細胞質一側的總概率。The total probability that the N-term is on the cytoplasmic side of the membrane.
POSSIBLE N-term signal sequence：當“Exp number, first 60 AAs”大于 10 時發生的正告。A warning that is produced when "Exp number, first 60 AAs" is larger than 10.

蛋白F01_bin.1_00110合計436個氨基酸，有5個跨膜螺旋結構。

蛋白F01_bin.1_00142合計557個氨基酸，一切序列均在膜外，即該序列編碼的是分泌蛋白。

Short output format

"len="：蛋白序列的長度。The length of the protein sequence.
"ExpAA="：跨膜螺旋中氨基酸的預期數量。The expected number of amino acids intransmembrane helices.假設此數字大于 18，則很能夠是跨膜蛋白（或具有信號肽）。If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
"First60="：在蛋白的前60個氨基酸中跨膜螺旋中氨基酸的預期數量。The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein.假設該數字超越幾個，你應該被正告在 N 端預測的跨膜螺旋能夠是一個信號肽。If it more than a few, you are warned that a predicted transmembrane helix in the N-term could be a signal peptide.
"PredHel="：預測到的跨膜螺旋的數量。The number of predicted transmembrane helices by N-best.
"Topology="：N-best 預測的拓撲結構。The topology predicted by N-best.拓撲是由跨膜螺旋的位置給出的，假設螺旋在外部，則由“i”分隔，假設螺旋在外部，則由“o”分隔。'i7-29o44-66i87-109o'意味著它從膜內末尾，在位置7到29有一個預測的TMH，30-43在膜外，然后是位置44-66的TMH。

結果匯總

經過網頁版預測我們僅失掉了一個列表文件（Short output format），該文件需求自己復制網頁內容粘貼到新文件中，我將其命名為*_TMHMM_SHORT.txt，并將其寄存在*_signalp目錄中，該目錄是由run_SignalP.pl生成的。下面我將會統計各個基因組中信號肽蛋白的總數量、分泌蛋白數量和跨膜蛋白數量到文件Statistics.txt中，并區分提取每個基因組的分泌蛋白序列到*_signalp/*.secretory.faa文件中，提取跨膜蛋白序列到*_signalp/*.membrane.faa文件中。該進程將經過tmhmm_parser.pl完成。

#!/usr/bin/perl use strict; use warnings; # Author: Liu Hualin # Date: Oct 15, 2021 open OUT, ">Statistics.txt" || die; print OUT "Strain name\tSignal peptide numbers\tSecretory protein numbers\tMembrane protein numbers\n"; my @sig = glob("*_signalp"); foreach my $sig (@sig) { $sig=~/(.+)_signalp/; my $str = $1; my $tmhmm = $sig . "/$str" . "_TMHMM_SHORT.txt"; my $fasta = $sig . "/$str" . ".sigseq"; my $secretory = $str . ".secretory.faa"; my $membrane = $str . ".membrane.faa"; open SEC, ">$secretory" || die; open MEM, ">$membrane" || die; my $out = 0; my $on = 0; my %hash = idseq($fasta); open IN, $tmhmm || die; while () { chomp; $_=~s/[\r\n]+//g; # print $_ . "\n"; my @lines = split /\t/; if ($lines[5] eq "Topology=o") { $out++; print SEC ">$lines[0]\n$hash{$lines[0]}\n"; }else { $on++; print MEM ">$lines[0]\n$hash{$lines[0]}\n"; } } close IN; close SEC; close MEM; system("mv $secretory $membrane $sig"); my $total = $out + $on; print OUT "$str\t$total\t$out\t$on\n"; } close OUT; sub idseq { my ($fasta) = @_; my %hash; local $/ = ">"; open IN, $fasta || die; ; while () { chomp; my ($header, $seq) = split (/\n/, $_, 2); $header =~ /(\S+)/; my $id = $1; $hash{$id} = $seq; } close IN; return (%hash); }