Graphein - Residue Graph Tutorial#

In this notebook, we’ll run through residue-level graph construction in Graphein. We start by discsussing the config, the high-level API and spend the bulk of the tutorial running through the various options. At the end we’ll see how the low-level API can be used to control each step of the graph construction process.

Open In Colab

[1]:
# Install Graphein if necessary
# !pip install graphein

# Install pymol if necessary - in this tutorial PyMol is only used for the initial plot. Feel free to skip!
# sudo apt-get install pymol (recommended for colab) OR conda install -c schrodinger pymol

First, let’s checkout the protein we’ll be playing with today:

[2]:
# NBVAL_SKIP
from graphein.utils.pymol import MolViewer
pymol = MolViewer()
pymol.delete("all") # delete all objects from other sessions if necessary.
pymol.fetch("3eiy")
pymol.show_as("cartoon")
pymol.display()
[2]:
../_images/notebooks_residue_graphs_3_0.png

Config#

Graphein is designed for processing datasets of protein structures into graphs. We rely on a global config object, `ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__ to store various parameters in the high-level API. We use `pydantic <https://pydantic-docs.helpmanual.io/>`__ for the config object and provide sane defaults.

[3]:
from graphein.protein.config import ProteinGraphConfig

config = ProteinGraphConfig()
config.dict()
[3]:
{'granularity': 'CA',
 'keep_hets': False,
 'insertions': False,
 'pdb_dir': PosixPath('../examples/pdbs'),
 'verbose': False,
 'exclude_waters': True,
 'deprotonate': False,
 'protein_df_processing_functions': None,
 'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
 'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
 'edge_metadata_functions': None,
 'graph_metadata_functions': None,
 'get_contacts_config': None,
 'dssp_config': None}

Let’s run through the parameters of `ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__:

  • granularity: specifies the granularity of the graph (i.e. what should the nodes be). Possible values are: atom identifiers (e.g. "CA" for \(\alpha\) carbon, "CB" for \(\beta\) carbon), "centroid" to use residue centroids (under the hood, this is the same as "CA", but we use the average x,y,z coordinates for the atoms in the residue) or “atom” for atom-level construction. This is discussed in another notebook.

  • keep_hets: this is a boolean specifying whether or not to keep heteroatoms present in the .pdb file. Heteroatoms are typically non-protein atoms (waters, metal ions, ligands) but can sometimes contain non-standard or modified residues.

  • insertions: boolean specifying whether or not to keep insertions in the PDB file

  • pdb_dir optional path to a folder in which to save pdb files. Otherwise, /tmp/ will be used

  • verbose: bool controlling amount of info printed

  • exclude_waters: not implemented

  • deprotonate: bool indicating whether or not to remove Hydrogen atoms

  • protein_df_processing_functions: list of functions with which to process the PDB dataframe. Discussed in the low-level API.

  • edge_construction_functions: list of functions to compute edges with

  • node_metadata_functions: list of functions to annotate nodes with

  • edge_metadata_functions: list of functions to annotate edges with

  • graph_meta_functions: list of functions to annotate graph with

  • get_contacts_config: A separate config object if using GetContacts edge construction functions

If you wish, you can construct an entirely fresh configuration. Or, if we wish to change only some of these parameters, we can pass a dictionary containing our modifications:

[4]:
params_to_change = {"granularity": "centroids"}

config = ProteinGraphConfig(**params_to_change)
config.dict()
[4]:
{'granularity': 'centroids',
 'keep_hets': False,
 'insertions': False,
 'pdb_dir': PosixPath('../examples/pdbs'),
 'verbose': False,
 'exclude_waters': True,
 'deprotonate': False,
 'protein_df_processing_functions': None,
 'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
 'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
 'edge_metadata_functions': None,
 'graph_metadata_functions': None,
 'get_contacts_config': None,
 'dssp_config': None}

High-level API#

Graphein features a high-level API which should be applicable for most simple graph constructions. This can be used on either a .pdb file (so you can run whatever pre-processing you wish), or we can provide a PDB accession code and retrieve a structure from the PDB itself. If a path is provided, it takes precedence over the PDB code.

To use it we do as follows:

[5]:
from graphein.protein.graphs import construct_graph

g = construct_graph(config=config, pdb_code="3eiy")
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading meiler embeddings from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/meiler_embeddings.csv
174

If you wish to use a local .pdb file, you can run:

g = construct_graph(config=config, pdb_path="../graphein/examples/pdbs/3eiy.pdb")

Let’s check out the results with the in-built visualisation

[6]:
from graphein.protein.visualisation import plotly_protein_structure_graph

p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="degree",
    label_node_ids=False,
    plot_title="Peptide backbone graph. Nodes coloured by degree.",
    node_size_multiplier=1
    )
p.show()

`Cool! So let’s look at what we’ve got here:

We’ve plotted the graph, positioning nodes according to their x,y,z coordinates and coloured them by their degree. As we can see all the nodes are yellow except the two corresponding to the N and C terminal residues. Why is this? Because we’ve only computed the peptide-bond edges. Now, we probably want more so let’s look at how to do that!

Edge Functions#

Graphein is implemented in a functional fashion. In this case, this means in order to compute edges, we pass a list of edge construction functions to the construction. We have supplied a number of edge computation functions. These are located in: * `graphein.protein.edges.distance <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.distance>`__ * `graphein.protein.edges.intramolecular <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.intramolecular>`__ (these rely on an installation of GetContacts, an optional dependency) and a separate GetContactsConfig * `graphein.protein.edges.atomic <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.atomic>`__ (these are used in atomic-level graphs and we discuss these in another notebook tutorial)

However, edge functions are simple functions that take in an nx.Graph and return an nx.Graph with added edges. This means users can easily define their own to suit their purposes! Let’s take a closer look at some of the in-built functions before defining our own.

Built-in Edge Functions#

[7]:
from graphein.protein.edges.distance import add_hydrogen_bond_interactions, add_peptide_bonds

new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds, add_hydrogen_bond_interactions]}
config = ProteinGraphConfig(**new_edge_funcs)


g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph with peptide backbone and H-Bonds. \n Nodes coloured by sequence position. Edges coloured by type.",
    node_size_multiplier=1,
    )
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
174

Great! So what if we try a bunch of them

[8]:
from graphein.protein.edges.distance import (add_peptide_bonds,
                                             add_hydrogen_bond_interactions,
                                             add_disulfide_interactions,
                                             add_ionic_interactions,
                                             add_aromatic_interactions,
                                             add_aromatic_sulphur_interactions,
                                             add_cation_pi_interactions
                                            )

new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds,
                                                  add_aromatic_interactions,
                                                  add_hydrogen_bond_interactions,
                                                  add_disulfide_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_cation_pi_interactions]
                 }

config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n  Nodes coloured by sequence position, edges by type",
    node_size_multiplier=1
    )
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
INFO:graphein.protein.edges.distance:Found 1308 ionic interactions.
174

Cool! So we’ve added a bunch on distance-based computations of intramolecular edges to our graph! We have a few more distance based edge functions that we can visualise. Let’s check them out!

First up, the Delaunay triangulation:

[9]:
from graphein.protein.edges.distance import add_delaunay_triangulation

new_edge_funcs = {"edge_construction_functions": [add_delaunay_triangulation]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    G=g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    node_size_multiplier=1,
    plot_title="Protein graph created by the Delaunay triangulation of the structure. \n Nodes coloured by sequence position."
    )
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.edges.distance:Detected 954 simplices in the Delaunay Triangulation.

Next, let’s consider the last two edge construction functions `add_k_nn_edges <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>`__ and `add_distance_threshold <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>`__. These two functions take additional parameters. We can manage this with partial functions. These functions also feature a long_interaction_threshold parameter which specifies the minimum number of residues in the sequence between two residues in order to add an edge. This is because we may not be so interested in residues close together in the sequence being close together in the graph.

[10]:
from functools import partial
from graphein.protein.edges.distance import add_distance_threshold

new_edge_funcs = {"edge_construction_functions": [partial(add_distance_threshold, long_interaction_threshold=5, threshold=10.)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created by thresholding distance between nodes. \n Nodes must be <10A apart and at least 5 positions apart \n Nodes coloured by sequence position.",
    node_size_multiplier=1
    )
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 3218 distance edges
INFO:graphein.protein.edges.distance:Added 1904 distance edges. (1314 removed by LIN)
[11]:
from functools import partial
from graphein.protein.edges.distance import add_k_nn_edges

new_edge_funcs = {"edge_construction_functions": [partial(add_k_nn_edges, k=3, long_interaction_threshold=0)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created from K-NN of each node. Nodes coloured by sequence position",
    node_size_multiplier=1
    )
p.show()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 522 KNN edges

Defining your own edge functions#

Now, let’s turn our attention to defining our own edge construction functions. These must take in a nx.Graph and return a nx.Graph. You can use these custom functions together with all the in-built edge functions as before. If you come up with something cool, consider making a pull-request and sharing it with the community!

Here, as an example, we define an arbitrary function that creates an edge between all histidine residues in the graph.

[12]:
import networkx as nx

def add_histidine_histidine_edges(G: nx.Graph) -> nx.Graph:
    # Iterate over nodes to identify histidines
    histidines = [n for n, d in G.nodes(data=True) if d["residue_name"] == "HIS"]

    # Iterate over histidines and create a bond between them
    [G.add_edge(x, y, kind={"histidine"}) for i, x in enumerate(histidines) for j, y in enumerate(histidines) if i!=j]
    return G
[13]:
new_edge_funcs = {"edge_construction_functions": [add_histidine_histidine_edges]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created using a user-defined function that connects all Histidines."
    )
p.show()
# Let's check this worked:
for u, v, a in g.edges(data=True):
    print(u, v, a)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
A:HIS:111 A:HIS:137 {'kind': {'histidine'}}
A:HIS:111 A:HIS:163 {'kind': {'histidine'}}
A:HIS:137 A:HIS:163 {'kind': {'histidine'}}

Features & Metadata#

Graphein is designed to facilitate geometric deep learning on protein structures and so we have a bunch of feature and metadata annotation functions. These operate on three levels: * node-level annotation * graph-level annotation * edge-level annotation

These behave very similarly to the edge construction functions we just looked at. You simply pass a list of functions that take the right sort of arguments and return the right types.

  • graph-level annotation takes in a nx.Graph and returns an nx.Graph (just like before!).

  • node-level annotation takes in a node, data tuple from G.nodes(data=True) and returns a pd.Series

  • edge-level annotation in node_u, node_v, data tuple from G.edges(data=True) and returns

Built-in Graph Annotation Functions#

These are found in graphein.protein.features.sequence and graphein.protein.features.graph. We make a slight distinction between those functions that operate on the protein sequence for cleaner organisation. N.B. we’ve mentioned the chain_selection parameter briefly - if you include multiple chains in your graph, then will be multiple sequences associated with it. These are accessed as such: G.graph[f"sequence_{CHAIN_ID}"] and features computed from sequence are accessed as: G.graph[f"{FEATURE_NAME}_{CHAIN_ID}]

[14]:
from graphein.protein.features.sequence.sequence import molecular_weight

new_graph_annotation_funcs = {"graph_metadata_functions": [molecular_weight]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("Sequence:", g.graph["sequence_A"])
print("MW:", g.graph["molecular_weight_A"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
Sequence: SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK
MW: 19029.71029999999

We also provide some utilities for adding graph-level features in the form of sequence embeddings from various pre-trained language models. Let’s check them out!

[15]:
# Warning! This cell may crash a binder notebook as the pre-trained model download is rather large!
# NBVAL_SKIP
from graphein.protein.features.sequence.embeddings import esm_sequence_embedding, biovec_sequence_embedding

new_graph_annotation_funcs = {"graph_metadata_functions": [esm_sequence_embedding, biovec_sequence_embedding]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("ESM:", g.graph["esm_embedding_A"])
print("biovec:", g.graph["biovec_embedding_A"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
Using cache found in /Users/arianjamasb/.cache/torch/hub/facebookresearch_esm_master
/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning:

Failed to load image Python extension: dlopen(/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIxEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
  Referenced from: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so
  Expected in: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torch/lib/libtorch_cpu.dylib

174
INFO:gensim.utils:loading Word2Vec object from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
DEBUG:smart_open.smart_open_lib:{'uri': '/Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'compression': None, 'transport_params': None}
INFO:gensim.utils:loading vocabulary recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.vocabulary.* with mmap=None
INFO:gensim.utils:loading wv recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute vectors_norm to None
INFO:gensim.utils:loading trainables recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.trainables.* with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
ESM: [-0.02279943  0.16464774 -0.00461995 ...  0.02556622 -0.19798394
  0.09682215]
biovec: [array([ -8.865931  ,   5.65769   ,  -1.5052402 ,   3.2893941 ,
        -6.6228924 ,   2.4890645 ,   1.0635861 ,   2.4991994 ,
        -2.2740204 ,  -7.6614437 ,   0.3739053 ,   4.285517  ,
        -1.7950253 ,  -2.0930681 ,   0.19075748,   9.249631  ,
        -1.2537682 ,   6.0140305 ,  -3.0592997 ,   3.1944623 ,
         5.6747837 ,   1.4732779 ,  -9.556535  ,   4.5126605 ,
       -15.789192  ,  -9.917104  ,  10.61495   ,   1.6693095 ,
         1.8446481 , -12.656563  ,   2.5226657 ,   3.2782233 ,
         5.6363025 ,  -1.5776551 ,   9.913681  ,  -1.4776824 ,
         7.4292827 ,   2.6953607 ,  -0.81495905,   1.0415509 ,
       -18.352554  ,  -9.603085  , -11.215957  ,   3.9449768 ,
         2.2988393 ,   2.8612108 ,  -5.647224  ,  -9.894558  ,
         2.8099368 ,  -3.47802   ,  -1.7364154 ,   4.10581   ,
        -1.1337806 ,  -7.851351  ,   3.667841  ,   2.5098174 ,
         0.3878222 ,  -0.282476  ,   4.242093  ,   2.7918775 ,
        -0.3927042 ,  -3.504907  ,   2.2170143 ,  -6.7696056 ,
        -8.574576  ,   7.549987  ,   5.5607357 ,  -7.514735  ,
         5.778782  ,   2.1968088 ,   5.747694  ,  -4.4072075 ,
         1.6411833 ,   2.339804  ,   4.364424  ,  -4.615387  ,
         0.61950064,   7.19126   ,  -0.10569584,  -6.485407  ,
        -3.3555477 ,   0.58533096,   4.150122  ,   4.6470437 ,
        -7.6111784 ,   5.8581285 ,  -9.8836    ,   8.114656  ,
         4.3746543 ,   1.4033681 ,  -4.632345  ,  13.016971  ,
        -2.839704  ,  -6.210647  ,   3.4460034 ,   2.6573007 ,
       -14.987186  ,  -6.545936  ,  -2.3990562 ,   1.2732263 ],
      dtype=float32), array([ -9.079695  ,   3.7650356 ,   0.3614651 ,   2.1325634 ,
        -5.285296  ,   3.0021908 ,   1.2300422 ,   1.8999658 ,
        -2.0472856 ,  -5.5658956 ,   0.5014471 ,   3.8743246 ,
        -1.0728757 ,  -3.0830898 ,   0.73323405,   9.658365  ,
        -0.10142106,   6.2466826 ,  -2.4961333 ,   2.8022783 ,
         6.5060906 ,   1.6311501 , -10.04929   ,   5.4711785 ,
       -16.297068  , -10.765219  ,  10.580414  ,   0.503202  ,
         2.4227042 , -12.57771   ,   2.7075891 ,   3.9921677 ,
         5.5343657 ,  -0.9493397 ,  10.082592  ,   0.301396  ,
         6.978562  ,   2.4752004 ,  -1.1012759 ,  -1.1357698 ,
       -16.81547   ,  -9.117541  , -10.919918  ,   3.8029845 ,
         2.5540738 ,   3.0349758 ,  -5.4471354 ,  -9.438955  ,
         3.067747  ,  -3.1571815 ,  -3.1563768 ,   4.7784586 ,
        -2.3092556 ,  -7.704368  ,   0.82422435,   1.8206171 ,
         2.161718  ,   0.37946355,   4.7007704 ,   2.2930143 ,
         0.75466317,  -3.3875456 ,   2.3561945 ,  -6.173627  ,
        -7.655903  ,   6.066585  ,   5.7083497 ,  -8.785268  ,
         4.0877957 ,   2.1213973 ,   5.8782153 ,  -4.1335964 ,
         0.26376987,   3.185395  ,   3.9498727 ,  -4.5998893 ,
        -0.11535516,   8.4039135 ,  -0.8255909 ,  -5.910516  ,
        -2.2402806 ,  -0.15940607,   3.476339  ,   6.034683  ,
        -6.72777   ,   6.143824  ,  -9.662175  ,   8.747985  ,
         3.985702  ,   2.272303  ,  -3.933255  ,  12.071237  ,
        -1.8998326 ,  -6.6413393 ,   4.2529097 ,   3.4216733 ,
       -15.628299  ,  -5.513835  ,  -3.004577  ,   1.3258756 ],
      dtype=float32), array([ -6.9852    ,   4.425231  ,  -0.5174276 ,   2.0380528 ,
        -7.1311088 ,   2.4794824 ,   1.9649935 ,   1.9006643 ,
        -1.3209196 ,  -5.8989286 ,   1.2391084 ,   3.485902  ,
        -1.9545169 ,  -2.2625084 ,   1.8165749 ,   9.708086  ,
        -0.74385244,   6.5338845 ,  -3.0916584 ,   3.9312694 ,
         6.1540847 ,   3.4793413 , -10.056544  ,   5.7322254 ,
       -13.599701  , -10.923328  ,   9.378405  ,   0.555632  ,
         3.512236  , -11.891929  ,   2.382594  ,   3.2599561 ,
         5.123339  ,  -0.55314994,  10.585077  ,  -0.9078483 ,
         8.167705  ,   2.4789596 ,   0.78942966,  -1.0045129 ,
       -17.38679   ,  -9.496934  , -10.542821  ,   3.5852609 ,
         3.3952198 ,   2.9937162 ,  -5.9167666 ,  -9.517663  ,
         2.2889428 ,  -3.6738756 ,  -2.7072256 ,   4.6205435 ,
        -1.970848  ,  -6.7903805 ,   2.1213295 ,   2.5594456 ,
         0.4943355 ,   0.27333227,   4.096034  ,   1.2748998 ,
        -1.3203058 ,  -3.6010296 ,   2.623712  ,  -8.468901  ,
        -6.3624053 ,   7.388231  ,   4.908528  ,  -7.1587243 ,
         4.9127274 ,   3.2538335 ,   6.45744   ,  -3.3919108 ,
        -0.22479697,   2.4506824 ,   4.864544  ,  -4.201532  ,
        -0.08151506,   5.778758  ,  -0.5880602 ,  -7.52435   ,
        -2.3001645 ,   0.04607171,   4.0975637 ,   5.7935786 ,
        -8.676276  ,   6.952252  , -11.0132    ,   7.6395264 ,
         4.4887743 ,   2.5889583 ,  -5.0494356 ,  12.681698  ,
        -2.3577175 ,  -5.7180576 ,   3.481403  ,   3.9928114 ,
       -14.854708  ,  -6.135002  ,  -2.9227335 ,   1.2122653 ],
      dtype=float32)]

Writing your own graph-level annotation function.#

We provide a couple of useful functions for dealing with multiple chains robustly, namely compute_feature_over_chains and aggregate_feature_over_chains. This takes an input (nx.Graph), your function: (Callable), and feature_name (str). Your function must operate on a sequence (str) to use this utility.

aggregate_feature_over_chains can be used to perform aggregation {"min", "max", "sum" , "mean"} of chain-specific features. It takes: an nx.Graph, a feature name (str) and and aggregation type (str)

If you don’t want to operate on the sequnce, you don’t need to use these!

Let’s use this to construct a graph-level feature that operates on the sequence. This feature is an integer equal to the number of histidines in the chain.

[16]:
from graphein.protein.features.sequence.utils import compute_feature_over_chains, aggregate_feature_over_chains

# Define our graph-level function that operates on sequences
def histidine_count_feature(G: nx.Graph) -> nx.Graph:

    # Define our feature function that operates on a sequence
    def count_histidines(sequence: str) -> int:
         return sequence.count("H")

    # Compute the feature over the chains in the graph
    G = compute_feature_over_chains(G, func=count_histidines, feature_name="histidine_count")

    # Aggregate the feature over the chains
    G = aggregate_feature_over_chains(G, feature_name="histidine_count", aggregation_type="mean")
    return G


new_graph_annotation_funcs = {"graph_metadata_functions": [histidine_count_feature]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

# Test our new feature
g = construct_graph(config=config, pdb_code="3eiy")
print(g.graph["histidine_count_A"])

# And now on a multi-chain graph
g = construct_graph(config=config, pdb_code="9API", chain_selection="all")
print("Chain A Hists:", g.graph["histidine_count_A"])
print("Chain B Hists:", g.graph["histidine_count_B"])
print("Chain B Seq:", g.graph["sequence_B"])
print("Aggregated HIS count:", g.graph["histidine_count_mean"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
3
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 375 total nodes
375
Chain A Hists: 11
Chain B Hists: 0
Chain B Seq: SIPPEVKFNKPFVFLMIEQNTKSPLFMGKVVNPTQK
Aggregated HIS count: 5.5

Node-level features#

Similarly, we provide some in-built functions for computing node-level features. For example, we’ll look at some featurisations of amino acid types.

First, we’ll look at a simple one-hot encoding of the amino acid type

[17]:
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
    print(d["amino_acid_one_hot"])
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
174
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[18]:
from graphein.protein.features.nodes.amino_acid import expasy_protein_scale

config = ProteinGraphConfig(**{"node_metadata_functions": [expasy_protein_scale]})
g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
    print(d['expasy'])
    break
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv
174
pka_cooh_alpha               2.21
pka_nh3                      9.15
pka_rgroup                   7.00
isoelectric_points           5.68
molecularweight            105.00
                            ...
antiparallelbeta_strand      0.87
parallelbeta_strand          0.70
a_a_composition              6.90
a_a_swiss_prot               6.56
relativemutability         120.00
Name: SER, Length: 61, dtype: float64

Geometric features#

API Reference

We can also add geometric features to protein structure graphs.

[19]:
from graphein.protein.features.nodes.geometry import add_sidechain_vector, add_beta_carbon_vector, add_sequence_neighbour_vector
from graphein.protein.visualisation import add_vector_to_plot

add_sidechain_vector(g)
add_beta_carbon_vector(g)
add_sequence_neighbour_vector(g)

fig = plotly_protein_structure_graph(g, node_size_multiplier=1)
fig = add_vector_to_plot(g, fig, "sidechain_vector", colour="red", scale=1.5)
fig = add_vector_to_plot(g, fig, "c_beta_vector", colour="blue", scale=1.5)
fig = add_vector_to_plot(g, fig, "sequence_neighbour_vector_n_to_c", colour="green", scale=1.5)
fig

Low-Level API#

Graphein also features a low-level API if you want to get down into the gritty details of the construction. Essentially, every step of the construction is a function that you can tweak/alter the sequencing of as you please. This is very flexible if not unnecessary for most applications

Pre-Processing#

All graphs begin as pd.DataFrame representations of the .PDB file. These are handled by PandasPDB. There is a convenience function here process_dataframe, to which you can pass a list of dataframe processing functions. These are similar to the edge construction and annotation functions described above except they take in a pd.DataFrame and return a pd.DataFrame. We’ll start by reimplementing the construction we’ve been using above.

[20]:
from graphein.protein.graphs import read_pdb_to_dataframe, process_dataframe, deprotonate_structure, convert_structure_to_centroids, subset_structure_to_atom_type, filter_hetatms, remove_insertions

processing_funcs = [deprotonate_structure, convert_structure_to_centroids, remove_insertions]


pdb_code = "3eiy"
# Read dataframe from PDB
raw_df = read_pdb_to_dataframe(pdb_code=pdb_code)

# Apply processing functions
df = process_dataframe(raw_df, atom_df_processing_funcs = processing_funcs)
df.head()
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
[20]:
record_name atom_number blank_1 atom_name alt_loc residue_name blank_2 chain_id residue_number insertion ... y_coord z_coord occupancy b_factor blank_4 segment_id element_symbol charge line_idx node_id
0 ATOM 2 CA SER A 2 ... 54.615500 -0.136833 1.0 52.54 C NaN 610 A:SER:2
1 ATOM 8 CA PHE A 3 ... 51.417273 2.656727 1.0 48.73 C NaN 616 A:PHE:3
2 ATOM 19 CA SER A 4 ... 50.090833 -2.889667 1.0 47.00 C NaN 627 A:SER:4
3 ATOM 25 CA ASN A 5 ... 52.759000 -4.146375 1.0 41.42 C NaN 633 A:ASN:5
4 ATOM 33 CA VAL A 6 ... 51.575429 0.109000 1.0 32.41 C NaN 641 A:VAL:6

5 rows × 22 columns

We can define our own processing functions. Here we define a function that removes odd numbered residues from the dataframe. We’ll use this again later.

[21]:
import pandas as pd

def remove_odd_resis(df: pd.DataFrame) -> pd.DataFrame:
    return df.loc[df.index % 2 == 0]

remove_odd_resis(df).head()
[21]:
record_name atom_number blank_1 atom_name alt_loc residue_name blank_2 chain_id residue_number insertion ... y_coord z_coord occupancy b_factor blank_4 segment_id element_symbol charge line_idx node_id
0 ATOM 2 CA SER A 2 ... 54.615500 -0.136833 1.0 52.54 C NaN 610 A:SER:2
2 ATOM 19 CA SER A 4 ... 50.090833 -2.889667 1.0 47.00 C NaN 627 A:SER:4
4 ATOM 33 CA VAL A 6 ... 51.575429 0.109000 1.0 32.41 C NaN 641 A:VAL:6
6 ATOM 47 CA ALA A 8 ... 46.781200 0.618800 1.0 25.67 C NaN 655 A:ALA:8
8 ATOM 56 CA LYS A 10 ... 51.873333 1.231222 1.0 27.63 C NaN 664 A:LYS:10

5 rows × 22 columns

Building the Graph#

First we initialise the graph with the metadata. We then use this to add the nodes

[22]:
from graphein.protein.graphs import initialise_graph_with_metadata, add_nodes_to_graph

g = initialise_graph_with_metadata(protein_df=df, # from above cell
                                   raw_pdb_df=raw_df.df["ATOM"], # Store this for traceability
                                   pdb_id = pdb_code, #and again
                                   granularity = "centroid" # Store this so we know what kind of graph we have
                                  )

g = add_nodes_to_graph(g)
print(g.nodes)
['A:SER:2', 'A:PHE:3', 'A:SER:4', 'A:ASN:5', 'A:VAL:6', 'A:PRO:7', 'A:ALA:8', 'A:GLY:9', 'A:LYS:10', 'A:ASP:11', 'A:LEU:12', 'A:PRO:13', 'A:GLN:14', 'A:ASP:15', 'A:PHE:16', 'A:ASN:17', 'A:VAL:18', 'A:ILE:19', 'A:ILE:20', 'A:GLU:21', 'A:ILE:22', 'A:PRO:23', 'A:ALA:24', 'A:GLN:25', 'A:SER:26', 'A:GLU:27', 'A:PRO:28', 'A:VAL:29', 'A:LYS:30', 'A:TYR:31', 'A:GLU:32', 'A:ALA:33', 'A:ASP:34', 'A:LYS:35', 'A:ALA:36', 'A:LEU:37', 'A:GLY:38', 'A:LEU:39', 'A:LEU:40', 'A:VAL:41', 'A:VAL:42', 'A:ASP:43', 'A:ARG:44', 'A:PHE:45', 'A:ILE:46', 'A:GLY:47', 'A:THR:48', 'A:GLY:49', 'A:MET:50', 'A:ARG:51', 'A:TYR:52', 'A:PRO:53', 'A:VAL:54', 'A:ASN:55', 'A:TYR:56', 'A:GLY:57', 'A:PHE:58', 'A:ILE:59', 'A:PRO:60', 'A:GLN:61', 'A:THR:62', 'A:LEU:63', 'A:SER:64', 'A:GLY:65', 'A:ASP:66', 'A:GLY:67', 'A:ASP:68', 'A:PRO:69', 'A:VAL:70', 'A:ASP:71', 'A:VAL:72', 'A:LEU:73', 'A:VAL:74', 'A:ILE:75', 'A:THR:76', 'A:PRO:77', 'A:PHE:78', 'A:PRO:79', 'A:LEU:80', 'A:LEU:81', 'A:ALA:82', 'A:GLY:83', 'A:SER:84', 'A:VAL:85', 'A:VAL:86', 'A:ARG:87', 'A:ALA:88', 'A:ARG:89', 'A:ALA:90', 'A:LEU:91', 'A:GLY:92', 'A:MET:93', 'A:LEU:94', 'A:LYS:95', 'A:MET:96', 'A:THR:97', 'A:ASP:98', 'A:GLU:99', 'A:SER:100', 'A:GLY:101', 'A:VAL:102', 'A:ASP:103', 'A:ALA:104', 'A:LYS:105', 'A:LEU:106', 'A:VAL:107', 'A:ALA:108', 'A:VAL:109', 'A:PRO:110', 'A:HIS:111', 'A:ASP:112', 'A:LYS:113', 'A:VAL:114', 'A:CYS:115', 'A:PRO:116', 'A:MET:117', 'A:THR:118', 'A:ALA:119', 'A:ASN:120', 'A:LEU:121', 'A:LYS:122', 'A:SER:123', 'A:ILE:124', 'A:ASP:125', 'A:ASP:126', 'A:VAL:127', 'A:PRO:128', 'A:ALA:129', 'A:TYR:130', 'A:LEU:131', 'A:LYS:132', 'A:ASP:133', 'A:GLN:134', 'A:ILE:135', 'A:LYS:136', 'A:HIS:137', 'A:PHE:138', 'A:PHE:139', 'A:GLU:140', 'A:GLN:141', 'A:TYR:142', 'A:LYS:143', 'A:ALA:144', 'A:LEU:145', 'A:GLU:146', 'A:LYS:147', 'A:GLY:148', 'A:LYS:149', 'A:TRP:150', 'A:VAL:151', 'A:LYS:152', 'A:VAL:153', 'A:GLU:154', 'A:GLY:155', 'A:TRP:156', 'A:ASP:157', 'A:GLY:158', 'A:ILE:159', 'A:ASP:160', 'A:ALA:161', 'A:ALA:162', 'A:HIS:163', 'A:LYS:164', 'A:GLU:165', 'A:ILE:166', 'A:THR:167', 'A:ASP:168', 'A:GLY:169', 'A:VAL:170', 'A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']

Next, we add edges to the graph:

[23]:
from graphein.protein.graphs import compute_edges
g = compute_edges(g, get_contacts_config=None, funcs=[add_peptide_bonds, add_hydrogen_bond_interactions])
print(g.edges)
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
174
[('A:SER:2', 'A:PHE:3'), ('A:PHE:3', 'A:SER:4'), ('A:SER:4', 'A:ASN:5'), ('A:ASN:5', 'A:VAL:6'), ('A:VAL:6', 'A:PRO:7'), ('A:PRO:7', 'A:ALA:8'), ('A:ALA:8', 'A:GLY:9'), ('A:GLY:9', 'A:LYS:10'), ('A:LYS:10', 'A:ASP:11'), ('A:ASP:11', 'A:LEU:12'), ('A:LEU:12', 'A:PRO:13'), ('A:PRO:13', 'A:GLN:14'), ('A:GLN:14', 'A:ASP:15'), ('A:ASP:15', 'A:PHE:16'), ('A:ASP:15', 'A:ARG:87'), ('A:PHE:16', 'A:ASN:17'), ('A:ASN:17', 'A:VAL:18'), ('A:VAL:18', 'A:ILE:19'), ('A:ILE:19', 'A:ILE:20'), ('A:ILE:20', 'A:GLU:21'), ('A:GLU:21', 'A:ILE:22'), ('A:ILE:22', 'A:PRO:23'), ('A:PRO:23', 'A:ALA:24'), ('A:ALA:24', 'A:GLN:25'), ('A:GLN:25', 'A:SER:26'), ('A:SER:26', 'A:GLU:27'), ('A:GLU:27', 'A:PRO:28'), ('A:PRO:28', 'A:VAL:29'), ('A:VAL:29', 'A:LYS:30'), ('A:LYS:30', 'A:TYR:31'), ('A:TYR:31', 'A:GLU:32'), ('A:GLU:32', 'A:ALA:33'), ('A:ALA:33', 'A:ASP:34'), ('A:ASP:34', 'A:LYS:35'), ('A:LYS:35', 'A:ALA:36'), ('A:ALA:36', 'A:LEU:37'), ('A:LEU:37', 'A:GLY:38'), ('A:GLY:38', 'A:LEU:39'), ('A:LEU:39', 'A:LEU:40'), ('A:LEU:40', 'A:VAL:41'), ('A:VAL:41', 'A:VAL:42'), ('A:VAL:42', 'A:ASP:43'), ('A:ASP:43', 'A:ARG:44'), ('A:ARG:44', 'A:PHE:45'), ('A:ARG:44', 'A:LYS:149'), ('A:PHE:45', 'A:ILE:46'), ('A:ILE:46', 'A:GLY:47'), ('A:GLY:47', 'A:THR:48'), ('A:THR:48', 'A:GLY:49'), ('A:GLY:49', 'A:MET:50'), ('A:MET:50', 'A:ARG:51'), ('A:ARG:51', 'A:TYR:52'), ('A:TYR:52', 'A:PRO:53'), ('A:PRO:53', 'A:VAL:54'), ('A:VAL:54', 'A:ASN:55'), ('A:ASN:55', 'A:TYR:56'), ('A:TYR:56', 'A:GLY:57'), ('A:TYR:56', 'A:LYS:105'), ('A:TYR:56', 'A:ASP:71'), ('A:GLY:57', 'A:PHE:58'), ('A:PHE:58', 'A:ILE:59'), ('A:ILE:59', 'A:PRO:60'), ('A:PRO:60', 'A:GLN:61'), ('A:GLN:61', 'A:THR:62'), ('A:THR:62', 'A:LEU:63'), ('A:LEU:63', 'A:SER:64'), ('A:SER:64', 'A:GLY:65'), ('A:GLY:65', 'A:ASP:66'), ('A:ASP:66', 'A:GLY:67'), ('A:GLY:67', 'A:ASP:68'), ('A:ASP:68', 'A:PRO:69'), ('A:PRO:69', 'A:VAL:70'), ('A:VAL:70', 'A:ASP:71'), ('A:ASP:71', 'A:VAL:72'), ('A:ASP:71', 'A:LYS:105'), ('A:VAL:72', 'A:LEU:73'), ('A:LEU:73', 'A:VAL:74'), ('A:VAL:74', 'A:ILE:75'), ('A:ILE:75', 'A:THR:76'), ('A:THR:76', 'A:PRO:77'), ('A:PRO:77', 'A:PHE:78'), ('A:PHE:78', 'A:PRO:79'), ('A:PRO:79', 'A:LEU:80'), ('A:LEU:80', 'A:LEU:81'), ('A:LEU:81', 'A:ALA:82'), ('A:ALA:82', 'A:GLY:83'), ('A:GLY:83', 'A:SER:84'), ('A:SER:84', 'A:VAL:85'), ('A:VAL:85', 'A:VAL:86'), ('A:VAL:86', 'A:ARG:87'), ('A:ARG:87', 'A:ALA:88'), ('A:ALA:88', 'A:ARG:89'), ('A:ARG:89', 'A:ALA:90'), ('A:ARG:89', 'A:ASP:112'), ('A:ALA:90', 'A:LEU:91'), ('A:LEU:91', 'A:GLY:92'), ('A:GLY:92', 'A:MET:93'), ('A:MET:93', 'A:LEU:94'), ('A:LEU:94', 'A:LYS:95'), ('A:LYS:95', 'A:MET:96'), ('A:MET:96', 'A:THR:97'), ('A:THR:97', 'A:ASP:98'), ('A:ASP:98', 'A:GLU:99'), ('A:ASP:98', 'A:LYS:143'), ('A:GLU:99', 'A:SER:100'), ('A:SER:100', 'A:GLY:101'), ('A:GLY:101', 'A:VAL:102'), ('A:VAL:102', 'A:ASP:103'), ('A:ASP:103', 'A:ALA:104'), ('A:ASP:103', 'A:LYS:105'), ('A:ALA:104', 'A:LYS:105'), ('A:LYS:105', 'A:LEU:106'), ('A:LEU:106', 'A:VAL:107'), ('A:VAL:107', 'A:ALA:108'), ('A:ALA:108', 'A:VAL:109'), ('A:VAL:109', 'A:PRO:110'), ('A:PRO:110', 'A:HIS:111'), ('A:HIS:111', 'A:ASP:112'), ('A:ASP:112', 'A:LYS:113'), ('A:LYS:113', 'A:VAL:114'), ('A:VAL:114', 'A:CYS:115'), ('A:CYS:115', 'A:PRO:116'), ('A:CYS:115', 'A:MET:117'), ('A:PRO:116', 'A:MET:117'), ('A:MET:117', 'A:THR:118'), ('A:THR:118', 'A:ALA:119'), ('A:ALA:119', 'A:ASN:120'), ('A:ASN:120', 'A:LEU:121'), ('A:LEU:121', 'A:LYS:122'), ('A:LYS:122', 'A:SER:123'), ('A:SER:123', 'A:ILE:124'), ('A:ILE:124', 'A:ASP:125'), ('A:ASP:125', 'A:ASP:126'), ('A:ASP:126', 'A:VAL:127'), ('A:VAL:127', 'A:PRO:128'), ('A:PRO:128', 'A:ALA:129'), ('A:ALA:129', 'A:TYR:130'), ('A:TYR:130', 'A:LEU:131'), ('A:LEU:131', 'A:LYS:132'), ('A:LYS:132', 'A:ASP:133'), ('A:ASP:133', 'A:GLN:134'), ('A:GLN:134', 'A:ILE:135'), ('A:ILE:135', 'A:LYS:136'), ('A:LYS:136', 'A:HIS:137'), ('A:HIS:137', 'A:PHE:138'), ('A:PHE:138', 'A:PHE:139'), ('A:PHE:139', 'A:GLU:140'), ('A:GLU:140', 'A:GLN:141'), ('A:GLN:141', 'A:TYR:142'), ('A:TYR:142', 'A:LYS:143'), ('A:LYS:143', 'A:ALA:144'), ('A:ALA:144', 'A:LEU:145'), ('A:LEU:145', 'A:GLU:146'), ('A:GLU:146', 'A:LYS:147'), ('A:LYS:147', 'A:GLY:148'), ('A:GLY:148', 'A:LYS:149'), ('A:LYS:149', 'A:TRP:150'), ('A:TRP:150', 'A:VAL:151'), ('A:VAL:151', 'A:LYS:152'), ('A:LYS:152', 'A:VAL:153'), ('A:VAL:153', 'A:GLU:154'), ('A:GLU:154', 'A:GLY:155'), ('A:GLY:155', 'A:TRP:156'), ('A:TRP:156', 'A:ASP:157'), ('A:ASP:157', 'A:GLY:158'), ('A:GLY:158', 'A:ILE:159'), ('A:ILE:159', 'A:ASP:160'), ('A:ASP:160', 'A:ALA:161'), ('A:ALA:161', 'A:ALA:162'), ('A:ALA:162', 'A:HIS:163'), ('A:HIS:163', 'A:LYS:164'), ('A:LYS:164', 'A:GLU:165'), ('A:LYS:164', 'A:ASP:168'), ('A:GLU:165', 'A:ILE:166'), ('A:ILE:166', 'A:THR:167'), ('A:THR:167', 'A:ASP:168'), ('A:ASP:168', 'A:GLY:169'), ('A:GLY:169', 'A:VAL:170'), ('A:VAL:170', 'A:ALA:171'), ('A:ALA:171', 'A:ASN:172'), ('A:ASN:172', 'A:PHE:173'), ('A:PHE:173', 'A:LYS:174'), ('A:LYS:174', 'A:LYS:175')]

Let’s check it out!

[24]:
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created using low-level API. Shows peptide bond back bond and H-Bonds",
    node_size_multiplier=1
    )
p.show()

Just as we expected! From here we can apply any number of graph, node or edge metadata annotation functions that we may wish to. The advantage here is that we are not tied to the sequencing in the high-level API of:

(annotate node metadata -> compute edges -> annotate graph metadata -> annotate edge metadata).

This means, if we want, we can define graph annotation functions that utilise edge metadata that would otherwise not be possible. E.g. here we annotate bonds containing histidines with a 1. We then define a graph feature that looks for these bonds, sums the residue numbers involved and sums them.

[25]:
from graphein.protein.graphs import annotate_node_metadata, annotate_graph_metadata, annotate_edge_metadata

# Define edge feature that assigns hist_status = 1 to any bond involving a histidine
def bond_has_histidine_edge_feature(u, v, d):
    if "HIS" in u or "HIS" in v:
        d["hist_status"] = 1
    else:
        d["hist_status"] = 0

# Annotate the edge metadata
g = annotate_edge_metadata(g, [bond_has_histidine_edge_feature])

# Check it out
for u, v, d in g.edges(data=True):
    if d["hist_status"] == 1:
        print(u, v, d)

# Annotate graph-level feature that depends on histidine_status edge feature. This adds the sum of the residue numbers involved in hist_status bonds
def sum_hist_status_bond_residue_nums(G: nx.Graph) -> nx.Graph:
    to_add = []

    for u, v, a in G.edges(data=True):
        if a["hist_status"] == 1:
            to_add.append(G.nodes[u]["residue_number"])
            to_add.append(G.nodes[v]["residue_number"])

    total = sum(to_add)

    G.graph["hist_bond_total_residue_number"] = total
    return G


g = annotate_graph_metadata(g, [sum_hist_status_bond_residue_nums])
print("Graph-level hist feature:", g.graph["hist_bond_total_residue_number"])
A:PRO:110 A:HIS:111 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:111 A:ASP:112 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:LYS:136 A:HIS:137 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:137 A:PHE:138 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:ALA:162 A:HIS:163 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:163 A:LYS:164 {'kind': {'peptide_bond'}, 'hist_status': 1}
Graph-level hist feature: 1644

fin