Paper Summary: KVFinder Cavity Detection

KVFinder Cavity Detection

KVFinder-web: a web-based application for detecting and characterizing biomolecular cavities. Guerra JVS, Ribeiro-Filho HV, Pereira JGC, Lopes-de-Oliveira PS. Nucleic Acids Res. 2023 May 4:gkad324. [web server]

parKVFinder: A thread-level parallel approach in biomolecular cavity detection. Guerra JVDS et al., SoftwareX. 2020 Jul-Dec;100606.
pyKVFinder: an efficient and integrable Python package for biomolecular cavity detection and characterization in data science. Guerra JVDS et al., BMC Bioinformatics. 2021 Dec 20;22(1):607.

KVFinder page at Brazilian Biosciences National Lab (links to code and docs)

[back to paper list]

Overview

RESTful web service receives the job request in JSON via HTTP POST, runs parKVFinder; the client sends HTTP GET with job ID to get status/results
web server module developed in Rust using the Actix framework
queue module uses Ocypod
each module is packaged in a Docker container
web portal developed using Shiny R
also provided: HTTP client with example Python3 script, Pymol2 plugin allowing (among other things) interactively drawing a rectangular box to define the search space

Web Service Input

PDB file or entry, click Load to detect nonstandard residues
default or customized parameters
- inner (smaller) "rolling sphere" probe defines molecular surface
- outer (larger) probe defines the exterior
- a pocket is where the smaller probe can go but the larger cannot, except trimmed further from the exterior by the removal distance
- volume cutoff is minimum size (smaller pockets ignored)
around target molecule – within specified distance of ligand
around target residues – limit calculation to a box around specified residues + padding distance

Web Service Output

the viewer uses the NGL engine for R
"Download Results" gives a TOML file containing run parameters, pocket statistics, and lists of the pocket-defining residues
"Download Structures" gives a zipped folder containing the input PDB and a pocket-dots PDB

Web Service Performance

time increases approximately linearly with the number of atoms
larger "Probe Out" increases calculation time
no clear relationship with "Removal Distance" since its effect on time reflects number/shape of cavities rather than the number of atoms
using smaller grid spacing would give higher-quality results but take longer to run (presumably; time vs. grid spacing was not presented)... I'm guessing the benchmarking results are for grid spacing of 0.6 Å
their website doesn't expose the grid spacing, and the TOML file only says resolution_mode = 'Low', but I measured the distance between pocket dots in ChimeraX as 0.6 Å...

KVFinder Results in ChimeraX (PDB 2gbp)

pockets are output as a PDB file of H and HA "atoms" representing grid points within the pockets, with a different residue name for each pocket
B-factor column: depth, the shortest distance from the point to the pocket-bulk boundary
occupancy column: Eisenberg & Weiss amino acid hydropathy
unfortunately all the residue names have the same residue number
I manually added MODEL/ENDMDL records to the PDB file to make each residue (pocket) a separate model so that it would be easier to work with in ChimeraX (rainbow structures, show, hide, label attr name, etc.)
[ChimeraX session]
- size H atomrad 0.2; surf close; surf #2
- color bfactor #2
- color byattr occupancy #2 palette lipophilicity
- show #1 & :kae :<4
- color byattr occupancy #2 palette ^lipophilicity
- mlp #1 surf #2

Area/Volume Are Relative!

Quantitative comparisons are only valid within-method because different methods define the boundaries differently.

KVFinder results, ChimeraX display, ChimeraX surface area/volume
- SES with H default radius: KAB area 163 vol 161, KAE area 211 vol 266
- SES with radius 0.4 Å: KAB area 106 vol 77, KAE area 150 vol 155
- SES with radius 0.3 Å: KAB area 97 vol 67, KAE area 140 vol 140
- SES with radius 0.2 Å: KAB area 89 vol 56, KAE area 167 vol 106

CASTp uses α-shapes (Voronoi tesselation); Chimera does not show the CASTp boundaries, just the SES of CASTp pocket atoms

Another Example

PDB 121p (H-Ras) KVFinder cavities within 8 Å of ligand GCP
[ChimeraX session]

Seems OK on Nucleic Acids

finds the minor-groove pocket in PDB 6bna similar to the location of netropsin, but only the "floor" (probably need to ↓ removal distance and/or ↑ outer probe size to cover such a shallow pocket) [ChimeraX session]
their viewer prematurely truncates the ribbon (1bna doesn't have this problem)

KVFinder and CASTp Find the Same Pockets Qualitatively

ChimeraX with the 8 pockets found with KVFinder defaults on 2gbp

Chimera with 8 biggest CASTp pockets shown at once

Question for the Team: ChimeraX KVFinder Tool?

Code is open source and as far as I can tell available in useful form(s), i.e., maybe we could even host our own copy of the web service. Not that many user input parameters, and the TOML output should be easy to parse and show as columns in a dialog.

Although we have a ChimeraX recipe for reading CASTp output files, it only does the pocket display and does not show the dialog with the numerous pocket statistics like we have in Chimera. Also, I got the impression from our team that the CASTp database and web service weren't that amenable to interfacing with ChimeraX directly.

parKVFinder: A thread-level parallel approach in biomolecular cavity detection. Guerra JVDS et al., SoftwareX. 2020 Jul-Dec;100606.
pyKVFinder: an efficient and integrable Python package for biomolecular cavity detection and characterization in data science. Guerra JVDS et al., BMC Bioinformatics. 2021 Dec 20;22(1):607.

open source GPL v3.0 with access via github and PyPI
"highly optimized, customizable and user-friendly"
"accurate and robust performance with any type of protein cavity"

parKVFinder Architecture

written in C
input via Pymol GUI or command-line interface
multithreaded parallelization implemented with OpenMP
uses customizable dictionary of atomic VDW radii
grid spacing can be 0.25, 0.5, or 0.6 Å (web service uses the latter)
[Github Linux/Mac] [Github Windows] [docs]

parKVFinder vs. Other Methods Across MD Trajectory

HIV protease 200-ns MD trajectory
in this comparison, parKVFinder and GHECOM were the most accurate/successful at describing the conformational dynamics of the active site

parKVFinder Performance

Runtimes of the various programs on the 2001 frames from the HIV protease trajectory
- 6-core 3.4 GHxz AMD Ryzen 5 2600 processor running on Ubuntu 19.04
- parKVFinder ProbeOut 12 Å, search space <5Å from inhibitor in original structure, PDB 1hvr
- grid spacing?? supp. data only says "All parameters that were not cited were kept at their default values"... if I had to guess, 0.6 Å.
- Fpocket is a Voronoi tesselation method; faster than parKVFinder in this test case, but with worse results (e.g., see previous slide)
parKVFinder runtimes vs. protein size (dataset of 1000 unique domains) and # of threads

pyKVFinder Architecture

python-C parallel KVFinder (pyKVFinder) applies SWIG to extend grid operations written in C to Python
stores results in NumPy arrays for better interoperability with other programs
dependencies: python ≥ 3.7 (tested on Linux and macOS), swig ≥ 4.0.1, toml ≥ 0.10.2, numpy ≥ 1.20.3, matplotlib ≥ 3.3.3
has the features of parKVFinder and more: assigns depth and AA hydrophobicity (choice of 6 scales), runs faster, etc.
GPL v3.0, "no restrictions to use by non-academics" [Github] [docs] [PyPI]

pyKVFinder Test Case

PDB 6wen: ADP-ribose phosphatase (ADRP) domain of SARS-CoV-2 NSP3 in the unbound form

in b,"conservation" is not sequence conservation but the degree to which the corresponding void area is present in a series of related structures listed in c
c shows the distribution of the hydropathies of the pocket dots for each of these related structures, as well as a hierarchical clustering of the pockets based on the frequencies of their surrounding residues, done with SciPy and shown with matplotlib

pyKVFinder Benchmarking

Calculations on 600 frames from a 600-ns MD trajectory of the unbound NSP3 ADRP domain:

a shows results for the frame with the smallest RMSD from the original structure with ADRP bound, 6w02 chain B
b gives total pocket volume (there could be >1 pocket) across the trajectory
the volume calculated by pyKVFinder (346.8 ± 78.7 Å3) and parKVFinder (346.5 ± 79.3 Å3) is close to the molecular-surface volume of ADPR (351.1 Å3) in 6w02 chain B from YASARA (ChimeraX surf + measure volume 426.9 Å3, Chimera 409.4 Å3)
pyKVFinder "Standard workflow" calculates pocket area and volume, "Full workflow" also calculates depth and hydropathy
par/pyKVFinder with grid spacing 0.8 Å, outer probe 6 Å, volume cutoff 25 Å3, all pockets ≥50% in the box enclosing ADPR in 6w02 chain B
desktop computer with 6-core 3.4 GHz AMD Ryzen 5 2600 processor, 32GB RAM, Ubuntu 20.04.2, max # threads available
parKVFinder was faster than fpocket in this benchmark (fpocket was faster in the parKVFinder paper's benchmark)
CASTp on 6w02 chain B gives biggest pocket MS vol 844, area 532, SAS volume 238, area 338 [Chimera session] ...2020 structure not in db, must upload to server, wait for results, then download files "for Pymol"

pyKVFinder vs. parKVFinder

both are parallelized, but pyKVFinder is (even) faster due to "additional possibility to parallelize routines"
2021 paper says that "pyKVFinder will undergo continuous improvements and updates" and its github shows very recent activity (2024); parKVFinder also shows fairly recent github activity but less so (a year ago)
"experienced users requiring scripting routines are encouraged to use pyKVFinder due to its improved performance, while newcomers should prioritize parKVFinder due to its simplicity of installation and execution"