SeqIO

SeqIO クラスでは様々な形式（フォーマット）のファイルを取り扱うことができる。読み込みと書き込みの両方をサポートしているため、SeqIO を利用することで、ファイルのフォーマットの変換も簡単に行えるようになる。

SeqIO は GenBank フォーマットや FASTA フォーマットをはじめとして、多くのフォーマットをサポートしている。詳細は BioPython wiki に書かれている。

ファイルの読み込み

SeqIO は基本的にファイルの読み込みと書き出しの機能を提供している。1 つのファイルの複数のエントリーが記述されている場合は、各エントリーが SeqRecord 型のオブジェクトに保存され、for ループを利用して 1 エントリーずつ取り出すことができる。

from Bio import SeqIO

with open("sample.gb", "rU") as fh:
    for seq_record in SeqIO.parse(fh, "genbank"):
        print(seq_record.id)
        print(seq_record.seq)

ファイルの書き出し

SeqIO の write メソッドを利用して、ファイルの書き出しを行える。GenBank 形式のファイルを読み込んで、FASTA 形式で書き出すことも可能である。

from Bio import SeqIO

with open("sample.fa", "w") as outfh, open("sample.gb", "rU") as infh:
    seq = SeqIO.parse(infh, "genbank")
    SeqIO.write(seq, outfh, "fasta")

SeqRecord

BioPython の SeqIO で GenBank フォーマットあるいは FASTA ファイルなどを読み込んだ場合、1 エントリーの情報が 1 つの SeqRecord オブジェクトに保存される。SeqRecord 型のオブジェクトは次のようなプロパティがある。

プロパティ	値
seq	配列データ
id	Accession 番号などの情報
name	LOCUS などに相当する配列の名前
description	配列の補足説明
letter_annotations	付加情報マップ型リスト、タプルなどを代入できる
annotations	付加情報マップ型
features	SeqFeature オブジェクト（エクソン、イントロンなどのアノテーション情報など）
dbxrefs	文献

SeqRecord 型のオブジェクトは、ファイルを読み込んだときに自動的に作られる。しかし、文字列から作成したい場合は次のように Seq で、まずシーケンスを作成してから、様々なプロパティを追加していく。

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


seq = Seq("ACCGGT")
seq_r = SeqRecord(seq)

seq_r.seq
#Seq('ACCGGT', Alphabet())

seq_r.id
#'<unknown id>'

seq_r.description
#'<unknown description>'


seq_r.id = "NC000001.1"
seq_r.description = "Sample sequence"
seq_r.annotation["evidence"] = "Bioinformatics, 64,2 210-233 (2012)"
seq_r.annotation["author"] = "K.Jeans, Z.Kavin"
seq_r.letter_annotations["phred_quality"] = [30,42,33,60]

seq_r.id
#'NC000001.1'

seq_r.description
#'Sample sequence'


seq = Seq("ACCGGT")
seq_r = SeqRecord(seq, id="AC00002.1", description="Sample sequence")
seq_r.id
#'AC00002.1'

seq_r.description
#'Sample sequence'

SeqFeature

GenBank などのファイルには、シーケンスのどの領域がエキソンなのか、どの領域がイントロンなのかなどの feature が書かれている。これらの feature は SeqFeature 型のオブジェクトに保存されている。

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation

gb_record = SeqIO.read("ATU21952.gb", "genbank")

len(gb_record.features)
## 11

gb_record.features[0]
## SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(3009), strand=1), type='source')

gb_record.features[1]
## SeqFeature(FeatureLocation(ExactPosition(105), ExactPosition(563), strand=1), type="5'UTR")

SeqRecord から一部のシーケンスを切り出すと、それに対応する SeqFeature も切り出すことができる。

sub_gb_record = gb_record[1000:2500]
sub_gb_record
## SeqRecord(seq=Seq('TAGGATGCTTACTCATGGAATTAGAAGAACTCTTGATAGGCATACTATTTTAAG...AAA', IUPACAmbiguousDNA()), id='U21952.1', name='ATU21952', description='Arabidopsis thaliana ethylene response sensor (ERS) gene, complete cds.', dbxrefs=[])

len(sub_gb_record.features)
## 3

sub_gb_record.features
## [SeqFeature(FeatureLocation(ExactPosition(564), ExactPosition(933), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(1013), ExactPosition(1280), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(1358), ExactPosition(1486), strand=1), type='exon')]

Feature に書かれている位置情報を利用して、該当部分の塩基配列を取り出すことができる。

ft = gb_record.features[3]
ft
## SeqFeature(FeatureLocation(ExactPosition(538), ExactPosition(1469), strand=1), type='exon')


sub_gb_record = gb_record[ft.location.start:ft.location.end]
sub_gb_record.seq
## Seq('GTCAACACAAGTCAGAGCTCCAAAAATGGAGTCATGCGATTGTTTTGAGACGCA...CAG', IUPACAmbiguousDNA())