Protein Graph Analytics#
Graphein provides built in utilities for computing various analytics relating graph structure and physicochemical properties. These include functions for getting summaries and metrics out as well as interactive plotting functions.
[1]:
# Install graphein if necessary:
# !pip install graphein
# Install DSSP if necessary:
# !sudo apt-get install dssp (better for colab) OR !conda install -c salilab dssp
Constructing a graph for analysis#
First, we need a graph to work with. Here we construct a graph of 3EIY, a pyrophosphatase. We use the following edges: * Hydrophobic Interactions * Aromatic Interactions * Disulfide Interactions * Peptide Bonds
Using DSSP, we add relative (RSA) and absolute solvent accessibility (ASA) to the nodes.
[2]:
from graphein.protein.config import ProteinGraphConfig, DSSPConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.edges.distance import (
add_aromatic_interactions,
add_disulfide_interactions,
add_hydrophobic_interactions,
add_peptide_bonds,
)
from graphein.protein.features.nodes import asa, rsa
config = ProteinGraphConfig(
edge_construction_functions=[ # List of functions to call to construct edges.
add_hydrophobic_interactions,
add_aromatic_interactions,
add_disulfide_interactions,
add_peptide_bonds,
],
graph_metadata_functions=[asa, rsa], # Add ASA and RSA features.
dssp_config=DSSPConfig(), # Add DSSP config in order to compute ASA and RSA.
)
g = construct_graph(pdb_code="3eiy", config=config)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading meiler embeddings from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/meiler_embeddings.csv
INFO:graphein.protein.edges.distance:Found 413 hydrophobic interactions.
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
174
Downloading PDB structure '3eiy'...
INFO:graphein.protein.utils:Downloaded PDB file for: 3eiy
[3]:
from graphein.protein.visualisation import plotly_protein_structure_graph
plotly_protein_structure_graph(g, node_size_multiplier=1)
Residue Composition#
We can plot the residue composition of the graph with plot_residue_composition
. This can be as a bar (plot_type="bar"
) or pie chart (plot_type="pie"
)
[4]:
from graphein.protein.analysis import plot_residue_composition
fig = plot_residue_composition(g, sort_by="count", plot_type="pie") # Can also sort by "alphabetical"
fig.show()
[5]:
fig = plot_residue_composition(g, sort_by="count", plot_type="bar")
fig.show()
Graph Summary Statistics#
We can compute graph-theoretic summary statistics for a given protein graph based on various centrality measures.
[6]:
from graphein.protein.analysis import graph_summary
graph_summary(g)
[6]:
degree | betweenness_centrality | closeness_centrality | eigenvector_centrality | communicability_betweenness_centrality | residue_type | position | chain | |
---|---|---|---|---|---|---|---|---|
node | ||||||||
A:SER:2 | 1.0 | 0.000000 | 0.178719 | 0.005605 | 0.000954 | SER | 2 | A |
A:PHE:3 | 6.0 | 0.053228 | 0.217337 | 0.036304 | 0.070745 | PHE | 3 | A |
A:SER:4 | 2.0 | 0.002035 | 0.179461 | 0.006686 | 0.003441 | SER | 4 | A |
A:ASN:5 | 2.0 | 0.000750 | 0.174220 | 0.007004 | 0.002090 | ASN | 5 | A |
A:VAL:6 | 5.0 | 0.017827 | 0.204734 | 0.038680 | 0.026333 | VAL | 6 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... |
A:ALA:171 | 2.0 | 0.034513 | 0.170108 | 0.008594 | 0.037133 | ALA | 171 | A |
A:ASN:172 | 2.0 | 0.024214 | 0.147863 | 0.001368 | 0.026655 | ASN | 172 | A |
A:PHE:173 | 3.0 | 0.025774 | 0.135156 | 0.000269 | 0.026477 | PHE | 173 | A |
A:LYS:174 | 2.0 | 0.011561 | 0.119310 | 0.000043 | 0.011769 | LYS | 174 | A |
A:LYS:175 | 1.0 | 0.000000 | 0.106658 | 0.000007 | 0.000126 | LYS | 175 | A |
174 rows × 8 columns
[7]:
graph_summary(g, plot=True)
Degree Distribution#
We can plot the distribution of node degrees over the protein graph
[8]:
from graphein.protein.analysis import plot_degree_distribution
fig = plot_degree_distribution(g)
fig.show()
And the total degree broken down by residue type.
[9]:
from graphein.protein.analysis import plot_degree_by_residue_type
fig = plot_degree_by_residue_type(g, normalise_by_residue_occurrence=False)
fig.show()
Or the total degree normalised by the occurence of each amino acid type
[10]:
fig = plot_degree_by_residue_type(g, normalise_by_residue_occurrence=True)
fig.show()
Edge Type Distribution#
We can plot the distribution of edge types in the graph with plot_edge_type_distribution()
[11]:
from graphein.protein.analysis import plot_edge_type_distribution
fig = plot_edge_type_distribution(g, plot_type="bar")
fig.show()
[12]:
from graphein.protein.analysis import plot_edge_type_distribution
fig = plot_edge_type_distribution(g, plot_type="pie")
fig.show()
Scatter Matrix#
We can use scatter matrices to correlate node-level graph metrics (e.g. centrality measures) with node features.
First, we will plot the scatter matrix for a selection of centrality measures and the relative solvent accessibility (RSA) and absolute solvent accessibility (ASA). N.B. we can include these features in the plot as we have added them as metadata in the initial graph construction. Refer to cell 2 at the top of this notebook for clarity.
A summary of the function parameters:
plot_graph_metric_property_correlation(
g: nx.Graph, # Graph to plot
summary_statistics: List[str] = [ # Graph theoretic metrics to include
"degree",
"betweenness_centrality",
"closeness_centrality",
"eigenvector_centrality",
"communicability_betweenness_centrality",
],
properties: List[str] = ["asa", "rsa"], # Node features to include
colour_by: Optional[str] = "residue_type", # How to colour the points
opacity: float = 0.2, # Opacity of markers
diagonal_visible: bool = True, # Whether or not to show the leading diagonal of the plot
title: Optional[str] = None, # Plot title
height: int = 1000, # Plot height
width: int = 1000, # Plot width
font_size: int = 10, # Font size for axes, title and ticks
)
[13]:
from graphein.protein.analysis import plot_graph_metric_property_correlation
plot_graph_metric_property_correlation(g, diagonal_visible=False)
[14]:
plot_graph_metric_property_correlation(g, diagonal_visible=False, colour_by=None)
However, we can include more features. For instance, we can correlate these metrics with the properties defined by the ExPaSy amino acid scales. We’ll plot a subset for the sake of space.
The full list of properties are:
['b_factor', 'pka_cooh_alpha', 'pka_nh3', 'pka_rgroup', 'isoelectric_points', 'molecularweight', 'numbercodons', 'bulkiness', 'polarityzimmerman', 'polaritygrantham', 'refractivity', 'recognitionfactors', 'hphob_eisenberg', 'hphob_sweet', 'hphob_woods', 'hphob_doolittle', 'hphob_manavalan', 'hphob_leo', 'hphob_black', 'hphob_breese', 'hphob_fauchere', 'hphob_guy', 'hphob_janin', 'hphob_miyazawa', 'hphob_argos', 'hphob_roseman', 'hphob_tanford', 'hphob_wolfenden', 'hphob_welling', 'hphob_wilson', 'hphob_parker', 'hphob_ph3_4', 'hphob_ph7_5', 'hphob_mobility', 'hplchfba', 'hplctfa', 'transmembranetendency', 'hplc2_1', 'hplc7_4', 'buriedresidues', 'accessibleresidues', 'hphob_chothia', 'hphob_rose', 'ratioside', 'averageburied', 'averageflexibility', 'alpha_helixfasman', 'beta_sheetfasman', 'beta_turnfasman', 'alpha_helixroux', 'beta_sheetroux', 'beta_turnroux', 'coilroux', 'alpha_helixlevitt', 'beta_sheetlevitt', 'beta_turnlevitt', 'totalbeta_strand', 'antiparallelbeta_strand', 'parallelbeta_strand', 'a_a_composition', 'a_a_swiss_prot', 'relativemutability']
These can be accessed programmatically with:
for _, d in g.nodes(data=True):
print(d.keys())
break
NB These are just examples, any numerical node feature can be provided - including user-defined features.
[15]:
from functools import partial
from graphein.protein.features.nodes import expasy_protein_scale
# Construct the graph with the expasy features.
config = ProteinGraphConfig(
edge_construction_functions=[
add_hydrophobic_interactions,
add_aromatic_interactions,
add_disulfide_interactions,
add_peptide_bonds,
],
node_metadata_functions=[partial(expasy_protein_scale, add_separate=True)], # Add expasy scale (add partial it so each feature is added under a separate key)
)
g = construct_graph(
pdb_code="3eiy",
config=config
)
# Plot
plot_graph_metric_property_correlation(
g,
diagonal_visible=False,
colour_by="residue_type",
properties=[
"pka_rgroup",
"isoelectric_points",
"bulkiness",
"transmembranetendency",
"coilroux",
"relativemutability"
]
)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv
INFO:graphein.protein.edges.distance:Found 413 hydrophobic interactions.
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
174