Source code for the paper "Protein Structure Accuracy Estimation using Geometry-Complete Perceptron Networks".
- Installation
- GCPNet for protein structure EMA (GCPNet-EMA)
- For developers
- Acknowledgements
- Citations
Install Mamba
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh # accept all terms and install to the default location rm Mambaforge-$(uname)-$(uname -m).sh # (optionally) remove installer after using it source ~/.bashrc # alternatively, one can restart their shell session to achieve the same resultInstall dependencies
# clone project git clone https://github.com/BioinfoMachineLearning/GCPNet-EMA cd GCPNet-EMA # create conda environment mamba env create -f environment.yaml # NOTE: use `cpu_environment.yaml` instead for web server deployment conda activate GCPNet-EMA # NOTE: one still needs to use `conda` to (de)activate environments # install local project as package pip3 install -e . # install ProDy separately to avoid a Biopython version conflict with Ankh pip3 install prody==2.4.1 # uninstall protobuf (if installed) due to (potential) local GLIBCXX conflicts pip3 uninstall protobufNote: TM-score is required to score predicted protein structures, where one can install it as follows:
# download and compile TM-score mkdir -p ~/Programs && cd ~/Programs wget https://zhanggroup.org/TM-score/TMscore.cpp g++ -static -O3 -ffast-math -lm -o TMscore TMscore.cpp rm TMscore.cppMake sure to update the tmscore_exec_path value in e.g., configs/paths/default.yaml to reflect where you have placed the TM-score executable on your machine. Also, make sure that lddt_exec_path points to the bin/lddt path within your GCPNet-EMA Conda environment, where lddt is installed automatically as described in environment.yaml.
Download training and evaluation data
cd data/EMA/ wget https://zenodo.org/record/8150859/files/ema_decoy_model.tar.gz wget https://zenodo.org/record/8150859/files/ema_true_model.tar.gz tar -xzf ema_decoy_model.tar.gz tar -xzf ema_true_model.tar.gz cd ../../ # head back to the root project directoryTrain a model for the estimation of protein structure model accuracy (EMA) task
python3 src/train.py experiment=gcpnet_ema.yamlReproduce our results for the EMA task
default_ema_model_ckpt_path="checkpoints/EMA/no_plddt_or_esm_emb_model.ckpt" af2_ema_model_ckpt_path="checkpoints/EMA/no_esm_emb_model.ckpt" python3 src/eval.py data=ema model=gcpnet_ema logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$default_ema_model_ckpt_path" python3 src/eval.py data=ema model=gcpnet_ema logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$af2_ema_model_ckpt_path"Default EMA Model - No AlphaFold plDDT or ESM Embeddings as Inputs ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test/PerModelMAE │ 0.04611478000879288 │ │ test/PerModelMSE │ 0.004228705074638128 │ │ test/PerModelPearsonCorrCoef │ 0.8075723052024841 │ │ test/PerResidueMAE │ 0.07066802680492401 │ │ test/PerResidueMSE │ 0.010494622401893139 │ │ test/PerResiduePearsonCorrCoef │ 0.7123321890830994 │ │ test/loss │ 0.005345446057617664 │ └────────────────────────────────┴────────────────────────────────┘ AlphaFold EMA Model - No ESM Embeddings as Inputs ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test/PerModelMAE │ 0.042016904801130295 │ │ test/PerModelMSE │ 0.003771992400288582 │ │ test/PerModelPearsonCorrCoef │ 0.8381679654121399 │ │ test/PerResidueMAE │ 0.06481857597827911 │ │ test/PerResidueMSE │ 0.009247069247066975 │ │ test/PerResiduePearsonCorrCoef │ 0.7482331991195679 │ │ test/loss │ 0.004621841479092836 │ └────────────────────────────────┴────────────────────────────────┘Predict per-residue and per-model lDDT scores for 3D protein structures
default_ema_model_ckpt_path="checkpoints/EMA/no_plddt_or_esm_emb_model.ckpt" predict_batch_size=1 # adjust as desired according to available GPU memory num_workers=0 # note: required when initially processing new PDB file inputs, due to ESM's GPU usage python3 src/predict.py model=gcpnet_ema data=ema data.predict_input_dir=$MY_INPUT_PDB_DIR data.predict_true_dir=$MY_OPTIONAL_TRUE_PDB_DIR data.predict_output_dir=$MY_OUTPUTS_DIR data.predict_batch_size=$predict_batch_size data.num_workers=$num_workers logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$default_ema_model_ckpt_path"For example, one can predict per-residue and per-model lDDT scores for a batch of tertiary protein structure inputs, 6W6VE.pdb and 6W77K.pdb within data/EMA/examples/decoy_model, as follows
python3 src/predict.py model=gcpnet_ema data=ema data.predict_input_dir=data/EMA/examples/decoy_model data.predict_output_dir=data/EMA/examples/outputs data.predict_batch_size=1 data.num_workers=0 data.python_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/python data.lddt_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/lddt data.pdbtools_dir="$HOME"/mambaforge/envs/gcpnet/lib/python3.10/site-packages/pdbtools/ logger=csv trainer.accelerator=gpu trainer.devices=[0] ckpt_path=checkpoints/EMA/model.ckptNote: After running the above command, an output CSV containing metadata for the predictions will be located at logs/predict/runs/YYYY-MM-DD_HH-MM-SS/predict_YYYYMMDD_HHMMSS_rank_0_predictions.csv, with text substitutions for the time at which the above command was completed. This CSV will contain a column called predicted_annotated_pdb_filepath that identifies the temporary location of each input PDB file after annotating it with GCPNet-EMA's predicted lDDT scores for each residue. If a directory containing ground-truth PDB files corresponding one-to-one with the inputs in data.predict_input_dir is provided as data.predict_true_dir, then metrics and PDB annotation filepaths will also be reported in the output CSV to quantitatively and qualitatively describe how well GCPNet-EMA was able to improve upon AlphaFold's initial per-residue plDDT values.
Set up pre-commit (one time only) for automatic code linting and formatting upon each git commit
pre-commit installManually reformat all files in the project, as desired
pre-commit run -aUpdate dependencies in environment.yml
mamba env export > env.yaml # e.g., run this after installing new dependencies locally diff environment.yaml env.yaml # note the differences and copy accepted changes back into `environment.yaml` rm env.yaml # clean up temporary environment fileUse Gunicorn to parallelize responses to web server requests across 4 workers using port 80
gunicorn -w 4 -b gcpnet-ema.missouri.edu:80 --timeout 300 src.wsgi:appTest server locally using curl
curl -X POST -F "title=6KHVA" -F "structure_upload=@data/EMA/test_examples/decoy_model/6KHVA.pdb" -F "results_email=username@email.com" http://gcpnet-ema.missouri.edu:80/server_predictCreate a user cronjob (via crontab -e) that checks every five minutes to make sure the Gunicorn web server is running and, if it is not, starts the server by running the Gunicorn command above
# NOTE: add this to your user cronjobs using `crontab -e` */5 * * * * pgrep -f "gunicorn -w 4 -b gcpnet-ema.missouri.edu:80 --timeout 300 src.wsgi:app" || cd /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA && ~/mambaforge/condabin/mamba run -n GCPNet-EMA gunicorn -w 4 -b gcpnet-ema.missouri.edu:80 --timeout 300 --chdir /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA src.wsgi:app >> /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA/server_crontab_logfile.log 2>&1NOTE: You should substitute the /bml/$USER/Repositories/Lab_Repositories/GCPNet-EMA references above with the absolute path to your personal copy of the repository.
NOTE: Make sure to create in the project's local directory (i.e., ./) a .env file that contains values for four key environment variables for server support: (1) SERVER_EMAIL_ADDRESS (e.g., "bml@missouri.edu"); (2) SERVER_EMAIL_SMTP_SERVER (e.g., "massmail.missouri.edu"); (3) SERVER_EMAIL_PORT (e.g., "587" by default); and (4) SERVER_SECRET_KEY (initially generated by the Python secrets package).
GCPNet-EMA builds upon the source code and data from the following project(s):
We thank all their contributors and maintainers!
If you use the code or data associated with this project, or otherwise find this work useful, please cite:
@article{morehead2023gcpnet_ema, title={Protein Structure Accuracy Estimation using Geometry-Complete Perceptron Networks}, author={Morehead, Alex and Cheng, Jianlin}, year={2023} }