NCBI API를 이용한 단백질 구조 분석

언어/Python

NCBI API를 이용한 단백질 구조 분석

library04 2024. 12. 4. 19:50

[최초 작성 : 24.12.04]

[추가 수정 : ]

1. 개요

RESTful API 서비스를 활용하여 인슐린 일부의 3D 단백질 구조를 시각화
사용도구 : 포스트맨, 주피터노트북
필요한 라이브러리 : requests, json, re, py3Dmol
- 설치 명령어 : pip install requests py3Dmol
DB 목록 : 구글에서 'ncbi entrez e-utilities' 를 검색 (einfo, esearch, efetch)

2. 전체 흐름

NCBI 데이터베이스 정보를 가져와서 protein 데이터베이스의 상세 정보를 파일로 저장
특정 단백질(P01308)을 검색하고 관련 ID들을 파일로 저장
저장된 ID들을 이용해 각 단백질의 GenBank 형식 정보를 가져옴
GenBank 파일에서 PDB ID를 추출
PDB ID를 이용해 3D 구조를 시각화

# 1. NCBI 데이터베이스 정보 가져오기
import requests
import json

# NCBI 데이터베이스 정보 조회
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=JSON"
response = requests.get(url)

# 데이터베이스 목록에서 protein 정보 추출
python_object = response.json()['einforesult']
db_list = python_object['dblist']
for db in db_list:
    if db == 'protein':
        db_info_url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=JSON&db={db}'
        print(db_info_url)
        db_response = requests.get(db_info_url)
        print('db_response status code: ', db_response.status_code)
        result = json.dumps(db_response.json()['einforesult'], indent=4)
        print(result)
        with open("protein_db_info.txt", "w", encoding="utf_8") as file:
            file.write(result)

# 2. 특정 단백질 검색 (예: P01308 인슐린)
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=P01308&retmode=JSON'
response = requests.get(url)
print('GET: ', response.status_code, response.json())

# 검색 결과에서 ID 목록 추출
header = response.json()['header']
esearch = response.json()['esearchresult']
print(esearch)
id_list = esearch['idlist']

# ID 목록 파일로 저장
for id in id_list:
    print('id: ', id)
    with open("id_list.txt", 'a', encoding="utf-8") as file:
        file.write(id + "\n")

# 3. ID를 이용해 단백질 상세 정보 가져오기
with open('id_list.txt', 'r', encoding='utf-8') as file:
    for line in file:
        id = line.strip()
        if len(id) != 0:
            url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id={id}&rettype=gb&retmode=text'
            print(url)
            response = requests.get(url)
            print('GET STATUS CODE: ', response.status_code)
            with open(f'{id}.txt', 'w', encoding='utf-8') as output_file:
                output_file.write(response.text)

# 4. GenBank 파일에서 PDB ID 추출
import re               # PDB: 뒤에 나오는 문자열을 찾는 패턴

gb_file = '124617.txt'  # 앞서 저장된 GenBank 파일
pdb_id_pattern = r'PDB:(\w+)' 
pdb_id = []

with open(gb_file, 'r') as file:
    for line in file:
        match = re.search(pdb_id_pattern, line)  # 패턴과 일치하는 부분 찾기
        if match:
            id = match.group(1)
            print('pdb_id: ', id)
            pdb_id.append(id)
print('complete.', str(len(pdb_id)))

# 5. 3D 구조 시각화
# !pip install py3Dmol  # 처음 실행시에만 필요

import py3Dmol

# 첫 번째 PDB ID로 3D 구조 시각화
print('pdb_id:', pdb_id[0])
view = py3Dmol.view(query=f'pdb:{pdb_id[0]}')
view.setStyle({"model":-1}, {"cartoon": {"color":"spectrum"}})
view

3. 시각화 결과 이미지

현재글NCBI API를 이용한 단백질 구조 분석

library04 님의 블로그

library04 님의 블로그 입니다.

데이터, 파이썬, 입문, NCBI, API, 컴퓨터일반, ADP, 정보처리기사, 자격증, SQLD, REST, 2025, 9급, 개인정보, 기사, ADsP,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

library04 님의 블로그

NCBI API를 이용한 단백질 구조 분석

'언어/Python'의 다른글

티스토리툴바