Protein Graph Analytics#

Graphein provides built in utilities for computing various analytics relating graph structure and physicochemical properties. These include functions for getting summaries and metrics out as well as interactive plotting functions.

Open In Colab

[1]:
# Install graphein if necessary:
# !pip install graphein
# Install DSSP if necessary:
# !sudo apt-get install dssp (better for colab) OR !conda install -c salilab dssp

Constructing a graph for analysis#

First, we need a graph to work with. Here we construct a graph of 3EIY, a pyrophosphatase. We use the following edges: * Hydrophobic Interactions * Aromatic Interactions * Disulfide Interactions * Peptide Bonds

Using DSSP, we add relative (RSA) and absolute solvent accessibility (ASA) to the nodes.

[2]:
from graphein.protein.config import ProteinGraphConfig, DSSPConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.edges.distance import (
    add_aromatic_interactions,
    add_disulfide_interactions,
    add_hydrophobic_interactions,
    add_peptide_bonds,
)
from graphein.protein.features.nodes import asa, rsa

config = ProteinGraphConfig(
    edge_construction_functions=[       # List of functions to call to construct edges.
        add_hydrophobic_interactions,
        add_aromatic_interactions,
        add_disulfide_interactions,
        add_peptide_bonds,
    ],
    graph_metadata_functions=[asa, rsa],  # Add ASA and RSA features.
    dssp_config=DSSPConfig(),             # Add DSSP config in order to compute ASA and RSA.
)

g = construct_graph(pdb_code="3eiy", config=config)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading meiler embeddings from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/meiler_embeddings.csv
INFO:graphein.protein.edges.distance:Found 413 hydrophobic interactions.
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
174
Downloading PDB structure '3eiy'...
INFO:graphein.protein.utils:Downloaded PDB file for: 3eiy
[3]:
from graphein.protein.visualisation import plotly_protein_structure_graph

plotly_protein_structure_graph(g, node_size_multiplier=1)

Residue Composition#

We can plot the residue composition of the graph with plot_residue_composition. This can be as a bar (plot_type="bar") or pie chart (plot_type="pie")

[4]:
from graphein.protein.analysis import plot_residue_composition

fig = plot_residue_composition(g, sort_by="count", plot_type="pie") # Can also sort by "alphabetical"
fig.show()
[5]:
fig = plot_residue_composition(g, sort_by="count", plot_type="bar")
fig.show()

Graph Summary Statistics#

We can compute graph-theoretic summary statistics for a given protein graph based on various centrality measures.

[6]:
from graphein.protein.analysis import graph_summary

graph_summary(g)
[6]:
degree betweenness_centrality closeness_centrality eigenvector_centrality communicability_betweenness_centrality residue_type position chain
node
A:SER:2 1.0 0.000000 0.178719 0.005605 0.000954 SER 2 A
A:PHE:3 6.0 0.053228 0.217337 0.036304 0.070745 PHE 3 A
A:SER:4 2.0 0.002035 0.179461 0.006686 0.003441 SER 4 A
A:ASN:5 2.0 0.000750 0.174220 0.007004 0.002090 ASN 5 A
A:VAL:6 5.0 0.017827 0.204734 0.038680 0.026333 VAL 6 A
... ... ... ... ... ... ... ... ...
A:ALA:171 2.0 0.034513 0.170108 0.008594 0.037133 ALA 171 A
A:ASN:172 2.0 0.024214 0.147863 0.001368 0.026655 ASN 172 A
A:PHE:173 3.0 0.025774 0.135156 0.000269 0.026477 PHE 173 A
A:LYS:174 2.0 0.011561 0.119310 0.000043 0.011769 LYS 174 A
A:LYS:175 1.0 0.000000 0.106658 0.000007 0.000126 LYS 175 A

174 rows × 8 columns

[7]:
graph_summary(g, plot=True)

Degree Distribution#

We can plot the distribution of node degrees over the protein graph

[8]:
from graphein.protein.analysis import plot_degree_distribution

fig = plot_degree_distribution(g)
fig.show()

And the total degree broken down by residue type.

[9]:
from graphein.protein.analysis import plot_degree_by_residue_type

fig = plot_degree_by_residue_type(g, normalise_by_residue_occurrence=False)
fig.show()

Or the total degree normalised by the occurence of each amino acid type

[10]:
fig = plot_degree_by_residue_type(g, normalise_by_residue_occurrence=True)
fig.show()

Edge Type Distribution#

We can plot the distribution of edge types in the graph with plot_edge_type_distribution()

[11]:
from graphein.protein.analysis import plot_edge_type_distribution

fig = plot_edge_type_distribution(g, plot_type="bar")
fig.show()
[12]:
from graphein.protein.analysis import plot_edge_type_distribution

fig = plot_edge_type_distribution(g, plot_type="pie")
fig.show()

Scatter Matrix#

We can use scatter matrices to correlate node-level graph metrics (e.g. centrality measures) with node features.

First, we will plot the scatter matrix for a selection of centrality measures and the relative solvent accessibility (RSA) and absolute solvent accessibility (ASA). N.B. we can include these features in the plot as we have added them as metadata in the initial graph construction. Refer to cell 2 at the top of this notebook for clarity.

A summary of the function parameters:

plot_graph_metric_property_correlation(
    g: nx.Graph,                                # Graph to plot
    summary_statistics: List[str] = [           # Graph theoretic metrics to include
        "degree",
        "betweenness_centrality",
        "closeness_centrality",
        "eigenvector_centrality",
        "communicability_betweenness_centrality",
    ],
    properties: List[str] = ["asa", "rsa"],     # Node features to include
    colour_by: Optional[str] = "residue_type",  # How to colour the points
    opacity: float = 0.2,                       # Opacity of markers
    diagonal_visible: bool = True,              # Whether or not to show the leading diagonal of the plot
    title: Optional[str] = None,                # Plot title
    height: int = 1000,                         # Plot height
    width: int = 1000,                          # Plot width
    font_size: int = 10,                        # Font size for axes, title and ticks
)
[13]:
from graphein.protein.analysis import plot_graph_metric_property_correlation

plot_graph_metric_property_correlation(g, diagonal_visible=False)
[14]:
plot_graph_metric_property_correlation(g, diagonal_visible=False, colour_by=None)

However, we can include more features. For instance, we can correlate these metrics with the properties defined by the ExPaSy amino acid scales. We’ll plot a subset for the sake of space.

The full list of properties are:

['b_factor', 'pka_cooh_alpha', 'pka_nh3', 'pka_rgroup', 'isoelectric_points', 'molecularweight', 'numbercodons', 'bulkiness', 'polarityzimmerman', 'polaritygrantham', 'refractivity', 'recognitionfactors', 'hphob_eisenberg', 'hphob_sweet', 'hphob_woods', 'hphob_doolittle', 'hphob_manavalan', 'hphob_leo', 'hphob_black', 'hphob_breese', 'hphob_fauchere', 'hphob_guy', 'hphob_janin', 'hphob_miyazawa', 'hphob_argos', 'hphob_roseman', 'hphob_tanford', 'hphob_wolfenden', 'hphob_welling', 'hphob_wilson', 'hphob_parker', 'hphob_ph3_4', 'hphob_ph7_5', 'hphob_mobility', 'hplchfba', 'hplctfa', 'transmembranetendency', 'hplc2_1', 'hplc7_4', 'buriedresidues', 'accessibleresidues', 'hphob_chothia', 'hphob_rose', 'ratioside', 'averageburied', 'averageflexibility', 'alpha_helixfasman', 'beta_sheetfasman', 'beta_turnfasman', 'alpha_helixroux', 'beta_sheetroux', 'beta_turnroux', 'coilroux', 'alpha_helixlevitt', 'beta_sheetlevitt', 'beta_turnlevitt', 'totalbeta_strand', 'antiparallelbeta_strand', 'parallelbeta_strand', 'a_a_composition', 'a_a_swiss_prot', 'relativemutability']

These can be accessed programmatically with:

for _, d in g.nodes(data=True):
    print(d.keys())
    break

NB These are just examples, any numerical node feature can be provided - including user-defined features.

[15]:
from functools import partial
from graphein.protein.features.nodes import expasy_protein_scale

# Construct the graph with the expasy features.
config = ProteinGraphConfig(
    edge_construction_functions=[
        add_hydrophobic_interactions,
        add_aromatic_interactions,
        add_disulfide_interactions,
        add_peptide_bonds,
    ],
    node_metadata_functions=[partial(expasy_protein_scale, add_separate=True)], # Add expasy scale (add partial it so each feature is added under a separate key)
)
g = construct_graph(
    pdb_code="3eiy",
    config=config
    )

# Plot
plot_graph_metric_property_correlation(
    g,
    diagonal_visible=False,
    colour_by="residue_type",
    properties=[
        "pka_rgroup",
        "isoelectric_points",
        "bulkiness",
        "transmembranetendency",
        "coilroux",
        "relativemutability"
        ]
    )
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv
INFO:graphein.protein.edges.distance:Found 413 hydrophobic interactions.
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
174