General
Symmetry of proteins, an important source of their elegant structures and unique functions, is not as perfect as it may seem. This program prepares protein files given in PDB format from either X-RAY or NMR measurements for a continuous symmetry measure (CSM) calculation. See http://wwpdb.org for more details on the PDB file format and the protein data bank. See https://csm.ouproj.org.il/ for more information on the CSM methodology.
Many PDB files contain inconsistent errors in the forms of missing residues or missing atoms. In addition, they may report on low resolution or low level Rfree values as well as extra data not required for the symmetry evaluation (e.g., ligands). Several steps are involved in the preparation procedure:
-
The files are split into three categories according to their resolution and R_free grade as defined by FirstGlance in Jmol (https://bioinformatics.org/firstglance/fgij/notes.htm#grading):
a. Reliable – PDB files with a resolution of up to 2.0 and an R_free grade of C (Average at this resolution). The user can change the thresholds.
b. Reliable_r_grade – PDB files with a resolution of up to 2.0 and no R_free data
c. Others – PDB files with bad resolution or R_free grade below the threshold.
Reliable files are further processed according to the following stages:
- Removing non-coordinates lines from the atom section.
- Removing ligands and solvent lines at the end of peptides. HETATOM lines in the middle of a peptide are retained.
- Cleaning gaps in the sequence according to REMARK 470 (missing residues) and REMARK 465 (missing atoms):
a. If a backbone atom is missing - the whole amino acid is deleted.
b. If a side chain atom is missing – the side chain is removed.
c. For homomers – gap on one peptide causes the removal of the related atoms from all other peptides. - Retaining the first location in cases of alternate locations.
- Removing hydrogen atoms (optional).
- Ignoring PDB files for which the asymmetric unit does not represent a biological structure (e.g., when the matrix in REMARK 350 is different from the identity matrix).
- For homomers, checking that all peptides are of the same length.
Usage:
Help:
$ pdb_prep --help
Usage: pdb_prep [OPTIONS] COMMAND [ARGS]...
pdb preprations need help? try : pdb_prep COMMAND --help
Options:
--help Show this message and exit.
Commands:
nmr This procedure prepares protein files in...
xray This procedure prepares protein files in...
NMR help:
$ pdb_prep nmr --help
Usage: pdb_prep.py nmr [OPTIONS]
This procedure prepares protein files in pdb format from NMR measurements for
a CSM calculation according to the following stage:
1. Removing non-coordinates lines from the atom section.
2. Removing ligands and solvent lines at the end of peptides.
HETATOM lines in the middle of a peptide are retained.
3. Cleaning gaps in the sequence according to REMARK 470 (missing residues)
and REMARK 465 (missing atoms):
a. If a backbone atom is missing - the whole amino acid is deleted.
b. If a side chain atom is missing – the side chain is removed.
c. For homomers – gap on one peptide causes the removal of the related
atoms from all other peptides.
4. Retaining the first location in cases of alternate location.
5. Removing hydrogen atoms (optional).
6. Ignoring pdb files for which the asymmetric unit does not represent a
biological structure (e.g., non unit matrix in REMARK 350).
7. For homomers, checking that all peptides are of the same length.
Options:
--pdb-dir TEXT The input pdb directory containing PDB files
[default: .]
--pdb-file TEXT Input pdb file (use this or the --pdb-dir
option!)
--with-hydrogens / --no-hydrogens
Leave hydrogen atoms and hetatms from the
files - default --no-hydrogens
--ptype [homomer|heteromer|monomer]
Protein stoichiometry (defualt: homomer)
--parse-rem350 / --ignore-rem350
Parse or ignore remark 350 - default
--parse-rem350
--bio-molecule-chains INTEGER Number of peptides in remark 350
--output-dir TEXT Output dir [default: output.{time}]
--output-text / --output-json Output report in text or json - default
--output-text
--verbose Verbose mode [default: False]
--help Show this message and exit.
X-ray help
$ pdb_prep xray --help
Usage: pdb_prep.py xray [OPTIONS]
This procedure prepares protein files in pdb format from X-RAY measurements for a
CSM calculation according.
At first, the files are split into three categories according to their resolution
and R_free grade:
a. Reliable – PDB files with a resolution of up to 2.0 and an R_free grade of C
(Average at this resolution). Thresholds can be changed.
b. Reliable_r_grade – PDB files with a resolution of up to 2.0 and no R_free data
c. Others – PDB files with bad resolution or R_free grade
Reliable files are further processed according to the following stages:
1. Removing non-coordinates lines from the atom section.
2. Removing ligands and solvent lines at the end of peptides. HETATOM lines in the
middle of a peptide are retained.
3. Cleaning gaps in the sequence according to REMARK 470 (missing residues) and REMARK
465 (missing atoms):
a. If a backbone atom is missing - the whole amino acid is deleted.
b. If a side chain atom is missing – the side chain is removed.
c. For homomers – gap on one peptide causes the removal of the related atoms from
all other peptides.
4. Retaining the first location in cases of alternate location.
5. Removing hydrogen atoms (optional).
6. Ignoring pdb files for which the asymmetric unit does not represent a biological structure
(e.g., non unit matrix in REMARK 350).
7. For homomers, checking that all peptides are of the same length.
Options:
--pdb-dir TEXT Input pdb directory containing PDB files
[default: .]
--pdb-file TEXT Input pdb file (use this or the --pdb-dir
option!)
--max-resolution FLOAT Maximum allowed resolution [default: 2.0]
--limit-r-free-grade [A|B|C|D|E]
Limit for R_free_grade:
A - MUCH BETTER THAN
AVERAGE at this resolution
B - BETTER THAN
AVERAGE at this resolution
C - AVERAGE at
this resolution
D - WORSE THAN AVERAGE at
this resolution
E - UNRELIABLE [default: C]
--with-hydrogens / --no-hydrogens
Leave hydrogen atoms and hetatms from the
files - default --no-hydrogens
--ptype [homomer|heteromer|monomer]
Protein stoichiometry (defualt: homomer)
--parse-rem350 / --ignore-rem350
Parse or ignore remark 350 - default
--parse-rem350
--bio-molecule-chains INTEGER Number of peptides in remark 350
--output-dir TEXT Output dir [default: output.{time}]
--output-text / --output-json Output report in text or json - default
--output-text
--verbose Verbose mode [default: False]
--help Show this message and exit.
How to install?
Installation instructions can be read in this document
Contributing
If you find a bug or have an idea for a program you’d like in this package, feel free to open an issue. Even better: feel free to make a pull request!
Known Issues
The code fails to process PDB files for which the residue sequence numbers of the different peptides is inconsistent.