Graphein - Residue Graph Tutorial#

In this notebook, we’ll run through residue-level graph construction in Graphein. We start by discsussing the config, the high-level API and spend the bulk of the tutorial running through the various options. At the end we’ll see how the low-level API can be used to control each step of the graph construction process.

[1]:

# Install Graphein if necessary
# !pip install graphein

# Install pymol if necessary - in this tutorial PyMol is only used for the initial plot. Feel free to skip!
# sudo apt-get install pymol (recommended for colab) OR conda install -c schrodinger pymol

First, let’s checkout the protein we’ll be playing with today:

[2]:

# NBVAL_SKIP
from graphein.utils.pymol import MolViewer
pymol = MolViewer()
pymol.delete("all") # delete all objects from other sessions if necessary.
pymol.fetch("3eiy")
pymol.show_as("cartoon")
pymol.display()

[2]:

../_images/notebooks_residue_graphs_3_0.png

Config#

Graphein is designed for processing datasets of protein structures into graphs. We rely on a global config object, `ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__ to store various parameters in the high-level API. We use `pydantic <https://pydantic-docs.helpmanual.io/>`__ for the config object and provide sane defaults.

[3]:

from graphein.protein.config import ProteinGraphConfig

config = ProteinGraphConfig()
config.dict()

[3]:

{'granularity': 'CA',
 'keep_hets': False,
 'insertions': False,
 'pdb_dir': PosixPath('../examples/pdbs'),
 'verbose': False,
 'exclude_waters': True,
 'deprotonate': False,
 'protein_df_processing_functions': None,
 'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
 'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
 'edge_metadata_functions': None,
 'graph_metadata_functions': None,
 'get_contacts_config': None,
 'dssp_config': None}

Let’s run through the parameters of `ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>`__:

granularity: specifies the granularity of the graph (i.e. what should the nodes be). Possible values are: atom identifiers (e.g. "CA" for \(\alpha\) carbon, "CB" for \(\beta\) carbon), "centroid" to use residue centroids (under the hood, this is the same as "CA", but we use the average x,y,z coordinates for the atoms in the residue) or “atom” for atom-level construction. This is discussed in another notebook.
keep_hets: this is a boolean specifying whether or not to keep heteroatoms present in the .pdb file. Heteroatoms are typically non-protein atoms (waters, metal ions, ligands) but can sometimes contain non-standard or modified residues.
insertions: boolean specifying whether or not to keep insertions in the PDB file
pdb_dir optional path to a folder in which to save pdb files. Otherwise, /tmp/ will be used
verbose: bool controlling amount of info printed
exclude_waters: not implemented
deprotonate: bool indicating whether or not to remove Hydrogen atoms
protein_df_processing_functions: list of functions with which to process the PDB dataframe. Discussed in the low-level API.
edge_construction_functions: list of functions to compute edges with
node_metadata_functions: list of functions to annotate nodes with
edge_metadata_functions: list of functions to annotate edges with
graph_meta_functions: list of functions to annotate graph with
get_contacts_config: A separate config object if using GetContacts edge construction functions

If you wish, you can construct an entirely fresh configuration. Or, if we wish to change only some of these parameters, we can pass a dictionary containing our modifications:

[4]:

params_to_change = {"granularity": "centroids"}

config = ProteinGraphConfig(**params_to_change)
config.dict()

[4]:

{'granularity': 'centroids',
 'keep_hets': False,
 'insertions': False,
 'pdb_dir': PosixPath('../examples/pdbs'),
 'verbose': False,
 'exclude_waters': True,
 'deprotonate': False,
 'protein_df_processing_functions': None,
 'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
 'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
 'edge_metadata_functions': None,
 'graph_metadata_functions': None,
 'get_contacts_config': None,
 'dssp_config': None}

High-level API#

Graphein features a high-level API which should be applicable for most simple graph constructions. This can be used on either a .pdb file (so you can run whatever pre-processing you wish), or we can provide a PDB accession code and retrieve a structure from the PDB itself. If a path is provided, it takes precedence over the PDB code.

To use it we do as follows:

[5]:

from graphein.protein.graphs import construct_graph

g = construct_graph(config=config, pdb_code="3eiy")

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading meiler embeddings from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/meiler_embeddings.csv

If you wish to use a local .pdb file, you can run:

g = construct_graph(config=config, pdb_path="../graphein/examples/pdbs/3eiy.pdb")

Let’s check out the results with the in-built visualisation

[6]:

from graphein.protein.visualisation import plotly_protein_structure_graph

p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="degree",
    label_node_ids=False,
    plot_title="Peptide backbone graph. Nodes coloured by degree.",
    node_size_multiplier=1
    )
p.show()

`Cool! So let’s look at what we’ve got here:

We’ve plotted the graph, positioning nodes according to their x,y,z coordinates and coloured them by their degree. As we can see all the nodes are yellow except the two corresponding to the N and C terminal residues. Why is this? Because we’ve only computed the peptide-bond edges. Now, we probably want more so let’s look at how to do that!

Edge Functions#

Graphein is implemented in a functional fashion. In this case, this means in order to compute edges, we pass a list of edge construction functions to the construction. We have supplied a number of edge computation functions. These are located in: * `graphein.protein.edges.distance <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.distance>`__ * `graphein.protein.edges.intramolecular <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.intramolecular>`__ (these rely on an installation of GetContacts, an optional dependency) and a separate GetContactsConfig * `graphein.protein.edges.atomic <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.atomic>`__ (these are used in atomic-level graphs and we discuss these in another notebook tutorial)

However, edge functions are simple functions that take in an nx.Graph and return an nx.Graph with added edges. This means users can easily define their own to suit their purposes! Let’s take a closer look at some of the in-built functions before defining our own.

Built-in Edge Functions#

[7]:

from graphein.protein.edges.distance import add_hydrogen_bond_interactions, add_peptide_bonds

new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds, add_hydrogen_bond_interactions]}
config = ProteinGraphConfig(**new_edge_funcs)


g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph with peptide backbone and H-Bonds. \n Nodes coloured by sequence position. Edges coloured by type.",
    node_size_multiplier=1,
    )
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.

Great! So what if we try a bunch of them

[8]:

from graphein.protein.edges.distance import (add_peptide_bonds,
                                             add_hydrogen_bond_interactions,
                                             add_disulfide_interactions,
                                             add_ionic_interactions,
                                             add_aromatic_interactions,
                                             add_aromatic_sulphur_interactions,
                                             add_cation_pi_interactions
                                            )

new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds,
                                                  add_aromatic_interactions,
                                                  add_hydrogen_bond_interactions,
                                                  add_disulfide_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_cation_pi_interactions]
                 }

config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n  Nodes coloured by sequence position, edges by type",
    node_size_multiplier=1
    )
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
INFO:graphein.protein.edges.distance:Found 1308 ionic interactions.

Cool! So we’ve added a bunch on distance-based computations of intramolecular edges to our graph! We have a few more distance based edge functions that we can visualise. Let’s check them out!

`graphein.protein.edges.distance.add_delaunay_triangulation <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_delaunay_triangulation>`__: Adds edges based on the delaunay triangulation
`graphein.protein.edges.distance.add_k_nn_edges <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>`__: Adds edges based on K Nearest neighbours
`graphein.protein.edges.distance.add_distance_threshold <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>`__: Adds edges based on distance cutoff in Angstroms

First up, the Delaunay triangulation:

[9]:

from graphein.protein.edges.distance import add_delaunay_triangulation

new_edge_funcs = {"edge_construction_functions": [add_delaunay_triangulation]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    G=g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    node_size_multiplier=1,
    plot_title="Protein graph created by the Delaunay triangulation of the structure. \n Nodes coloured by sequence position."
    )
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.edges.distance:Detected 954 simplices in the Delaunay Triangulation.

Next, let’s consider the last two edge construction functions `add_k_nn_edges <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>`__ and `add_distance_threshold <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>`__. These two functions take additional parameters. We can manage this with partial functions. These functions also feature a long_interaction_threshold parameter which specifies the minimum number of residues in the sequence between two residues in order to add an edge. This is because we may not be so interested in residues close together in the sequence being close together in the graph.

[10]:

from functools import partial
from graphein.protein.edges.distance import add_distance_threshold

new_edge_funcs = {"edge_construction_functions": [partial(add_distance_threshold, long_interaction_threshold=5, threshold=10.)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created by thresholding distance between nodes. \n Nodes must be <10A apart and at least 5 positions apart \n Nodes coloured by sequence position.",
    node_size_multiplier=1
    )
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 3218 distance edges
INFO:graphein.protein.edges.distance:Added 1904 distance edges. (1314 removed by LIN)

[11]:

from functools import partial
from graphein.protein.edges.distance import add_k_nn_edges

new_edge_funcs = {"edge_construction_functions": [partial(add_k_nn_edges, k=3, long_interaction_threshold=0)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created from K-NN of each node. Nodes coloured by sequence position",
    node_size_multiplier=1
    )
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 522 KNN edges

Defining your own edge functions#

Now, let’s turn our attention to defining our own edge construction functions. These must take in a nx.Graph and return a nx.Graph. You can use these custom functions together with all the in-built edge functions as before. If you come up with something cool, consider making a pull-request and sharing it with the community!

Here, as an example, we define an arbitrary function that creates an edge between all histidine residues in the graph.

[12]:

import networkx as nx

def add_histidine_histidine_edges(G: nx.Graph) -> nx.Graph:
    # Iterate over nodes to identify histidines
    histidines = [n for n, d in G.nodes(data=True) if d["residue_name"] == "HIS"]

    # Iterate over histidines and create a bond between them
    [G.add_edge(x, y, kind={"histidine"}) for i, x in enumerate(histidines) for j, y in enumerate(histidines) if i!=j]
    return G

[13]:

new_edge_funcs = {"edge_construction_functions": [add_histidine_histidine_edges]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created using a user-defined function that connects all Histidines."
    )
p.show()
# Let's check this worked:
for u, v, a in g.edges(data=True):
    print(u, v, a)

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

A:HIS:111 A:HIS:137 {'kind': {'histidine'}}
A:HIS:111 A:HIS:163 {'kind': {'histidine'}}
A:HIS:137 A:HIS:163 {'kind': {'histidine'}}

Features & Metadata#

Graphein is designed to facilitate geometric deep learning on protein structures and so we have a bunch of feature and metadata annotation functions. These operate on three levels: * node-level annotation * graph-level annotation * edge-level annotation

These behave very similarly to the edge construction functions we just looked at. You simply pass a list of functions that take the right sort of arguments and return the right types.

graph-level annotation takes in a nx.Graph and returns an nx.Graph (just like before!).
node-level annotation takes in a node, data tuple from G.nodes(data=True) and returns a pd.Series
edge-level annotation in node_u, node_v, data tuple from G.edges(data=True) and returns

Built-in Graph Annotation Functions#

These are found in graphein.protein.features.sequence and graphein.protein.features.graph. We make a slight distinction between those functions that operate on the protein sequence for cleaner organisation. N.B. we’ve mentioned the chain_selection parameter briefly - if you include multiple chains in your graph, then will be multiple sequences associated with it. These are accessed as such: G.graph[f"sequence_{CHAIN_ID}"] and features computed from sequence are accessed as: G.graph[f"{FEATURE_NAME}_{CHAIN_ID}]

[14]:

from graphein.protein.features.sequence.sequence import molecular_weight

new_graph_annotation_funcs = {"graph_metadata_functions": [molecular_weight]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("Sequence:", g.graph["sequence_A"])
print("MW:", g.graph["molecular_weight_A"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
Sequence: SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK
MW: 19029.71029999999

We also provide some utilities for adding graph-level features in the form of sequence embeddings from various pre-trained language models. Let’s check them out!

[15]:

# Warning! This cell may crash a binder notebook as the pre-trained model download is rather large!
# NBVAL_SKIP
from graphein.protein.features.sequence.embeddings import esm_sequence_embedding, biovec_sequence_embedding

new_graph_annotation_funcs = {"graph_metadata_functions": [esm_sequence_embedding, biovec_sequence_embedding]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("ESM:", g.graph["esm_embedding_A"])
print("biovec:", g.graph["biovec_embedding_A"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
Using cache found in /Users/arianjamasb/.cache/torch/hub/facebookresearch_esm_master
/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning:

Failed to load image Python extension: dlopen(/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIxEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
  Referenced from: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so
  Expected in: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torch/lib/libtorch_cpu.dylib

INFO:gensim.utils:loading Word2Vec object from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
DEBUG:smart_open.smart_open_lib:{'uri': '/Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'compression': None, 'transport_params': None}
INFO:gensim.utils:loading vocabulary recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.vocabulary.* with mmap=None
INFO:gensim.utils:loading wv recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute vectors_norm to None
INFO:gensim.utils:loading trainables recursively from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model.trainables.* with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model

ESM: [-0.02279943  0.16464774 -0.00461995 ...  0.02556622 -0.19798394
  0.09682215]
biovec: [array([ -8.865931  ,   5.65769   ,  -1.5052402 ,   3.2893941 ,
        -6.6228924 ,   2.4890645 ,   1.0635861 ,   2.4991994 ,
        -2.2740204 ,  -7.6614437 ,   0.3739053 ,   4.285517  ,
        -1.7950253 ,  -2.0930681 ,   0.19075748,   9.249631  ,
        -1.2537682 ,   6.0140305 ,  -3.0592997 ,   3.1944623 ,
         5.6747837 ,   1.4732779 ,  -9.556535  ,   4.5126605 ,
       -15.789192  ,  -9.917104  ,  10.61495   ,   1.6693095 ,
         1.8446481 , -12.656563  ,   2.5226657 ,   3.2782233 ,
         5.6363025 ,  -1.5776551 ,   9.913681  ,  -1.4776824 ,
         7.4292827 ,   2.6953607 ,  -0.81495905,   1.0415509 ,
       -18.352554  ,  -9.603085  , -11.215957  ,   3.9449768 ,
         2.2988393 ,   2.8612108 ,  -5.647224  ,  -9.894558  ,
         2.8099368 ,  -3.47802   ,  -1.7364154 ,   4.10581   ,
        -1.1337806 ,  -7.851351  ,   3.667841  ,   2.5098174 ,
         0.3878222 ,  -0.282476  ,   4.242093  ,   2.7918775 ,
        -0.3927042 ,  -3.504907  ,   2.2170143 ,  -6.7696056 ,
        -8.574576  ,   7.549987  ,   5.5607357 ,  -7.514735  ,
         5.778782  ,   2.1968088 ,   5.747694  ,  -4.4072075 ,
         1.6411833 ,   2.339804  ,   4.364424  ,  -4.615387  ,
         0.61950064,   7.19126   ,  -0.10569584,  -6.485407  ,
        -3.3555477 ,   0.58533096,   4.150122  ,   4.6470437 ,
        -7.6111784 ,   5.8581285 ,  -9.8836    ,   8.114656  ,
         4.3746543 ,   1.4033681 ,  -4.632345  ,  13.016971  ,
        -2.839704  ,  -6.210647  ,   3.4460034 ,   2.6573007 ,
       -14.987186  ,  -6.545936  ,  -2.3990562 ,   1.2732263 ],
      dtype=float32), array([ -9.079695  ,   3.7650356 ,   0.3614651 ,   2.1325634 ,
        -5.285296  ,   3.0021908 ,   1.2300422 ,   1.8999658 ,
        -2.0472856 ,  -5.5658956 ,   0.5014471 ,   3.8743246 ,
        -1.0728757 ,  -3.0830898 ,   0.73323405,   9.658365  ,
        -0.10142106,   6.2466826 ,  -2.4961333 ,   2.8022783 ,
         6.5060906 ,   1.6311501 , -10.04929   ,   5.4711785 ,
       -16.297068  , -10.765219  ,  10.580414  ,   0.503202  ,
         2.4227042 , -12.57771   ,   2.7075891 ,   3.9921677 ,
         5.5343657 ,  -0.9493397 ,  10.082592  ,   0.301396  ,
         6.978562  ,   2.4752004 ,  -1.1012759 ,  -1.1357698 ,
       -16.81547   ,  -9.117541  , -10.919918  ,   3.8029845 ,
         2.5540738 ,   3.0349758 ,  -5.4471354 ,  -9.438955  ,
         3.067747  ,  -3.1571815 ,  -3.1563768 ,   4.7784586 ,
        -2.3092556 ,  -7.704368  ,   0.82422435,   1.8206171 ,
         2.161718  ,   0.37946355,   4.7007704 ,   2.2930143 ,
         0.75466317,  -3.3875456 ,   2.3561945 ,  -6.173627  ,
        -7.655903  ,   6.066585  ,   5.7083497 ,  -8.785268  ,
         4.0877957 ,   2.1213973 ,   5.8782153 ,  -4.1335964 ,
         0.26376987,   3.185395  ,   3.9498727 ,  -4.5998893 ,
        -0.11535516,   8.4039135 ,  -0.8255909 ,  -5.910516  ,
        -2.2402806 ,  -0.15940607,   3.476339  ,   6.034683  ,
        -6.72777   ,   6.143824  ,  -9.662175  ,   8.747985  ,
         3.985702  ,   2.272303  ,  -3.933255  ,  12.071237  ,
        -1.8998326 ,  -6.6413393 ,   4.2529097 ,   3.4216733 ,
       -15.628299  ,  -5.513835  ,  -3.004577  ,   1.3258756 ],
      dtype=float32), array([ -6.9852    ,   4.425231  ,  -0.5174276 ,   2.0380528 ,
        -7.1311088 ,   2.4794824 ,   1.9649935 ,   1.9006643 ,
        -1.3209196 ,  -5.8989286 ,   1.2391084 ,   3.485902  ,
        -1.9545169 ,  -2.2625084 ,   1.8165749 ,   9.708086  ,
        -0.74385244,   6.5338845 ,  -3.0916584 ,   3.9312694 ,
         6.1540847 ,   3.4793413 , -10.056544  ,   5.7322254 ,
       -13.599701  , -10.923328  ,   9.378405  ,   0.555632  ,
         3.512236  , -11.891929  ,   2.382594  ,   3.2599561 ,
         5.123339  ,  -0.55314994,  10.585077  ,  -0.9078483 ,
         8.167705  ,   2.4789596 ,   0.78942966,  -1.0045129 ,
       -17.38679   ,  -9.496934  , -10.542821  ,   3.5852609 ,
         3.3952198 ,   2.9937162 ,  -5.9167666 ,  -9.517663  ,
         2.2889428 ,  -3.6738756 ,  -2.7072256 ,   4.6205435 ,
        -1.970848  ,  -6.7903805 ,   2.1213295 ,   2.5594456 ,
         0.4943355 ,   0.27333227,   4.096034  ,   1.2748998 ,
        -1.3203058 ,  -3.6010296 ,   2.623712  ,  -8.468901  ,
        -6.3624053 ,   7.388231  ,   4.908528  ,  -7.1587243 ,
         4.9127274 ,   3.2538335 ,   6.45744   ,  -3.3919108 ,
        -0.22479697,   2.4506824 ,   4.864544  ,  -4.201532  ,
        -0.08151506,   5.778758  ,  -0.5880602 ,  -7.52435   ,
        -2.3001645 ,   0.04607171,   4.0975637 ,   5.7935786 ,
        -8.676276  ,   6.952252  , -11.0132    ,   7.6395264 ,
         4.4887743 ,   2.5889583 ,  -5.0494356 ,  12.681698  ,
        -2.3577175 ,  -5.7180576 ,   3.481403  ,   3.9928114 ,
       -14.854708  ,  -6.135002  ,  -2.9227335 ,   1.2122653 ],
      dtype=float32)]

Writing your own graph-level annotation function.#

We provide a couple of useful functions for dealing with multiple chains robustly, namely compute_feature_over_chains and aggregate_feature_over_chains. This takes an input (nx.Graph), your function: (Callable), and feature_name (str). Your function must operate on a sequence (str) to use this utility.

aggregate_feature_over_chains can be used to perform aggregation {"min", "max", "sum" , "mean"} of chain-specific features. It takes: an nx.Graph, a feature name (str) and and aggregation type (str)

If you don’t want to operate on the sequnce, you don’t need to use these!

Let’s use this to construct a graph-level feature that operates on the sequence. This feature is an integer equal to the number of histidines in the chain.

[16]:

from graphein.protein.features.sequence.utils import compute_feature_over_chains, aggregate_feature_over_chains

# Define our graph-level function that operates on sequences
def histidine_count_feature(G: nx.Graph) -> nx.Graph:

    # Define our feature function that operates on a sequence
    def count_histidines(sequence: str) -> int:
         return sequence.count("H")

    # Compute the feature over the chains in the graph
    G = compute_feature_over_chains(G, func=count_histidines, feature_name="histidine_count")

    # Aggregate the feature over the chains
    G = aggregate_feature_over_chains(G, feature_name="histidine_count", aggregation_type="mean")
    return G


new_graph_annotation_funcs = {"graph_metadata_functions": [histidine_count_feature]}
config = ProteinGraphConfig(**new_graph_annotation_funcs)

# Test our new feature
g = construct_graph(config=config, pdb_code="3eiy")
print(g.graph["histidine_count_A"])

# And now on a multi-chain graph
g = construct_graph(config=config, pdb_code="9API", chain_selection="all")
print("Chain A Hists:", g.graph["histidine_count_A"])
print("Chain B Hists:", g.graph["histidine_count_B"])
print("Chain B Seq:", g.graph["sequence_B"])
print("Aggregated HIS count:", g.graph["histidine_count_mean"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
3

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 375 total nodes

375
Chain A Hists: 11
Chain B Hists: 0
Chain B Seq: SIPPEVKFNKPFVFLMIEQNTKSPLFMGKVVNPTQK
Aggregated HIS count: 5.5

Node-level features#

Similarly, we provide some in-built functions for computing node-level features. For example, we’ll look at some featurisations of amino acid types.

First, we’ll look at a simple one-hot encoding of the amino acid type

[17]:

from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
    print(d["amino_acid_one_hot"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

[18]:

from graphein.protein.features.nodes.amino_acid import expasy_protein_scale

config = ProteinGraphConfig(**{"node_metadata_functions": [expasy_protein_scale]})
g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
    print(d['expasy'])
    break

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv

174
pka_cooh_alpha               2.21
pka_nh3                      9.15
pka_rgroup                   7.00
isoelectric_points           5.68
molecularweight            105.00
                            ...
antiparallelbeta_strand      0.87
parallelbeta_strand          0.70
a_a_composition              6.90
a_a_swiss_prot               6.56
relativemutability         120.00
Name: SER, Length: 61, dtype: float64

Geometric features#

API Reference

We can also add geometric features to protein structure graphs.

[19]:

from graphein.protein.features.nodes.geometry import add_sidechain_vector, add_beta_carbon_vector, add_sequence_neighbour_vector
from graphein.protein.visualisation import add_vector_to_plot

add_sidechain_vector(g)
add_beta_carbon_vector(g)
add_sequence_neighbour_vector(g)

fig = plotly_protein_structure_graph(g, node_size_multiplier=1)
fig = add_vector_to_plot(g, fig, "sidechain_vector", colour="red", scale=1.5)
fig = add_vector_to_plot(g, fig, "c_beta_vector", colour="blue", scale=1.5)
fig = add_vector_to_plot(g, fig, "sequence_neighbour_vector_n_to_c", colour="green", scale=1.5)
fig

Low-Level API#

Graphein also features a low-level API if you want to get down into the gritty details of the construction. Essentially, every step of the construction is a function that you can tweak/alter the sequencing of as you please. This is very flexible if not unnecessary for most applications

Pre-Processing#

All graphs begin as pd.DataFrame representations of the .PDB file. These are handled by PandasPDB. There is a convenience function here process_dataframe, to which you can pass a list of dataframe processing functions. These are similar to the edge construction and annotation functions described above except they take in a pd.DataFrame and return a pd.DataFrame. We’ll start by reimplementing the construction we’ve been using above.

[20]:

from graphein.protein.graphs import read_pdb_to_dataframe, process_dataframe, deprotonate_structure, convert_structure_to_centroids, subset_structure_to_atom_type, filter_hetatms, remove_insertions

processing_funcs = [deprotonate_structure, convert_structure_to_centroids, remove_insertions]


pdb_code = "3eiy"
# Read dataframe from PDB
raw_df = read_pdb_to_dataframe(pdb_code=pdb_code)

# Apply processing functions
df = process_dataframe(raw_df, atom_df_processing_funcs = processing_funcs)
df.head()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes

[20]:

	record_name	atom_number	atom_name	residue_name	chain_id	residue_number	...	y_coord	z_coord	occupancy	b_factor	element_symbol	charge	line_idx	node_id
0	ATOM	2	CA	SER	A	2	...	54.615500	-0.136833	1.0	52.54	C	NaN	610	A:SER:2
1	ATOM	8	CA	PHE	A	3	...	51.417273	2.656727	1.0	48.73	C	NaN	616	A:PHE:3
2	ATOM	19	CA	SER	A	4	...	50.090833	-2.889667	1.0	47.00	C	NaN	627	A:SER:4
3	ATOM	25	CA	ASN	A	5	...	52.759000	-4.146375	1.0	41.42	C	NaN	633	A:ASN:5
4	ATOM	33	CA	VAL	A	6	...	51.575429	0.109000	1.0	32.41	C	NaN	641	A:VAL:6

5 rows × 22 columns

We can define our own processing functions. Here we define a function that removes odd numbered residues from the dataframe. We’ll use this again later.

[21]:

import pandas as pd

def remove_odd_resis(df: pd.DataFrame) -> pd.DataFrame:
    return df.loc[df.index % 2 == 0]

remove_odd_resis(df).head()

[21]:

	record_name	atom_number	atom_name	residue_name	chain_id	residue_number	...	y_coord	z_coord	occupancy	b_factor	element_symbol	charge	line_idx	node_id
0	ATOM	2	CA	SER	A	2	...	54.615500	-0.136833	1.0	52.54	C	NaN	610	A:SER:2
2	ATOM	19	CA	SER	A	4	...	50.090833	-2.889667	1.0	47.00	C	NaN	627	A:SER:4
4	ATOM	33	CA	VAL	A	6	...	51.575429	0.109000	1.0	32.41	C	NaN	641	A:VAL:6
6	ATOM	47	CA	ALA	A	8	...	46.781200	0.618800	1.0	25.67	C	NaN	655	A:ALA:8
8	ATOM	56	CA	LYS	A	10	...	51.873333	1.231222	1.0	27.63	C	NaN	664	A:LYS:10

5 rows × 22 columns

Building the Graph#

First we initialise the graph with the metadata. We then use this to add the nodes

[22]:

from graphein.protein.graphs import initialise_graph_with_metadata, add_nodes_to_graph

g = initialise_graph_with_metadata(protein_df=df, # from above cell
                                   raw_pdb_df=raw_df.df["ATOM"], # Store this for traceability
                                   pdb_id = pdb_code, #and again
                                   granularity = "centroid" # Store this so we know what kind of graph we have
                                  )

g = add_nodes_to_graph(g)
print(g.nodes)

['A:SER:2', 'A:PHE:3', 'A:SER:4', 'A:ASN:5', 'A:VAL:6', 'A:PRO:7', 'A:ALA:8', 'A:GLY:9', 'A:LYS:10', 'A:ASP:11', 'A:LEU:12', 'A:PRO:13', 'A:GLN:14', 'A:ASP:15', 'A:PHE:16', 'A:ASN:17', 'A:VAL:18', 'A:ILE:19', 'A:ILE:20', 'A:GLU:21', 'A:ILE:22', 'A:PRO:23', 'A:ALA:24', 'A:GLN:25', 'A:SER:26', 'A:GLU:27', 'A:PRO:28', 'A:VAL:29', 'A:LYS:30', 'A:TYR:31', 'A:GLU:32', 'A:ALA:33', 'A:ASP:34', 'A:LYS:35', 'A:ALA:36', 'A:LEU:37', 'A:GLY:38', 'A:LEU:39', 'A:LEU:40', 'A:VAL:41', 'A:VAL:42', 'A:ASP:43', 'A:ARG:44', 'A:PHE:45', 'A:ILE:46', 'A:GLY:47', 'A:THR:48', 'A:GLY:49', 'A:MET:50', 'A:ARG:51', 'A:TYR:52', 'A:PRO:53', 'A:VAL:54', 'A:ASN:55', 'A:TYR:56', 'A:GLY:57', 'A:PHE:58', 'A:ILE:59', 'A:PRO:60', 'A:GLN:61', 'A:THR:62', 'A:LEU:63', 'A:SER:64', 'A:GLY:65', 'A:ASP:66', 'A:GLY:67', 'A:ASP:68', 'A:PRO:69', 'A:VAL:70', 'A:ASP:71', 'A:VAL:72', 'A:LEU:73', 'A:VAL:74', 'A:ILE:75', 'A:THR:76', 'A:PRO:77', 'A:PHE:78', 'A:PRO:79', 'A:LEU:80', 'A:LEU:81', 'A:ALA:82', 'A:GLY:83', 'A:SER:84', 'A:VAL:85', 'A:VAL:86', 'A:ARG:87', 'A:ALA:88', 'A:ARG:89', 'A:ALA:90', 'A:LEU:91', 'A:GLY:92', 'A:MET:93', 'A:LEU:94', 'A:LYS:95', 'A:MET:96', 'A:THR:97', 'A:ASP:98', 'A:GLU:99', 'A:SER:100', 'A:GLY:101', 'A:VAL:102', 'A:ASP:103', 'A:ALA:104', 'A:LYS:105', 'A:LEU:106', 'A:VAL:107', 'A:ALA:108', 'A:VAL:109', 'A:PRO:110', 'A:HIS:111', 'A:ASP:112', 'A:LYS:113', 'A:VAL:114', 'A:CYS:115', 'A:PRO:116', 'A:MET:117', 'A:THR:118', 'A:ALA:119', 'A:ASN:120', 'A:LEU:121', 'A:LYS:122', 'A:SER:123', 'A:ILE:124', 'A:ASP:125', 'A:ASP:126', 'A:VAL:127', 'A:PRO:128', 'A:ALA:129', 'A:TYR:130', 'A:LEU:131', 'A:LYS:132', 'A:ASP:133', 'A:GLN:134', 'A:ILE:135', 'A:LYS:136', 'A:HIS:137', 'A:PHE:138', 'A:PHE:139', 'A:GLU:140', 'A:GLN:141', 'A:TYR:142', 'A:LYS:143', 'A:ALA:144', 'A:LEU:145', 'A:GLU:146', 'A:LYS:147', 'A:GLY:148', 'A:LYS:149', 'A:TRP:150', 'A:VAL:151', 'A:LYS:152', 'A:VAL:153', 'A:GLU:154', 'A:GLY:155', 'A:TRP:156', 'A:ASP:157', 'A:GLY:158', 'A:ILE:159', 'A:ASP:160', 'A:ALA:161', 'A:ALA:162', 'A:HIS:163', 'A:LYS:164', 'A:GLU:165', 'A:ILE:166', 'A:THR:167', 'A:ASP:168', 'A:GLY:169', 'A:VAL:170', 'A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']

Next, we add edges to the graph:

[23]:

from graphein.protein.graphs import compute_edges
g = compute_edges(g, get_contacts_config=None, funcs=[add_peptide_bonds, add_hydrogen_bond_interactions])
print(g.edges)

INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.

174
[('A:SER:2', 'A:PHE:3'), ('A:PHE:3', 'A:SER:4'), ('A:SER:4', 'A:ASN:5'), ('A:ASN:5', 'A:VAL:6'), ('A:VAL:6', 'A:PRO:7'), ('A:PRO:7', 'A:ALA:8'), ('A:ALA:8', 'A:GLY:9'), ('A:GLY:9', 'A:LYS:10'), ('A:LYS:10', 'A:ASP:11'), ('A:ASP:11', 'A:LEU:12'), ('A:LEU:12', 'A:PRO:13'), ('A:PRO:13', 'A:GLN:14'), ('A:GLN:14', 'A:ASP:15'), ('A:ASP:15', 'A:PHE:16'), ('A:ASP:15', 'A:ARG:87'), ('A:PHE:16', 'A:ASN:17'), ('A:ASN:17', 'A:VAL:18'), ('A:VAL:18', 'A:ILE:19'), ('A:ILE:19', 'A:ILE:20'), ('A:ILE:20', 'A:GLU:21'), ('A:GLU:21', 'A:ILE:22'), ('A:ILE:22', 'A:PRO:23'), ('A:PRO:23', 'A:ALA:24'), ('A:ALA:24', 'A:GLN:25'), ('A:GLN:25', 'A:SER:26'), ('A:SER:26', 'A:GLU:27'), ('A:GLU:27', 'A:PRO:28'), ('A:PRO:28', 'A:VAL:29'), ('A:VAL:29', 'A:LYS:30'), ('A:LYS:30', 'A:TYR:31'), ('A:TYR:31', 'A:GLU:32'), ('A:GLU:32', 'A:ALA:33'), ('A:ALA:33', 'A:ASP:34'), ('A:ASP:34', 'A:LYS:35'), ('A:LYS:35', 'A:ALA:36'), ('A:ALA:36', 'A:LEU:37'), ('A:LEU:37', 'A:GLY:38'), ('A:GLY:38', 'A:LEU:39'), ('A:LEU:39', 'A:LEU:40'), ('A:LEU:40', 'A:VAL:41'), ('A:VAL:41', 'A:VAL:42'), ('A:VAL:42', 'A:ASP:43'), ('A:ASP:43', 'A:ARG:44'), ('A:ARG:44', 'A:PHE:45'), ('A:ARG:44', 'A:LYS:149'), ('A:PHE:45', 'A:ILE:46'), ('A:ILE:46', 'A:GLY:47'), ('A:GLY:47', 'A:THR:48'), ('A:THR:48', 'A:GLY:49'), ('A:GLY:49', 'A:MET:50'), ('A:MET:50', 'A:ARG:51'), ('A:ARG:51', 'A:TYR:52'), ('A:TYR:52', 'A:PRO:53'), ('A:PRO:53', 'A:VAL:54'), ('A:VAL:54', 'A:ASN:55'), ('A:ASN:55', 'A:TYR:56'), ('A:TYR:56', 'A:GLY:57'), ('A:TYR:56', 'A:LYS:105'), ('A:TYR:56', 'A:ASP:71'), ('A:GLY:57', 'A:PHE:58'), ('A:PHE:58', 'A:ILE:59'), ('A:ILE:59', 'A:PRO:60'), ('A:PRO:60', 'A:GLN:61'), ('A:GLN:61', 'A:THR:62'), ('A:THR:62', 'A:LEU:63'), ('A:LEU:63', 'A:SER:64'), ('A:SER:64', 'A:GLY:65'), ('A:GLY:65', 'A:ASP:66'), ('A:ASP:66', 'A:GLY:67'), ('A:GLY:67', 'A:ASP:68'), ('A:ASP:68', 'A:PRO:69'), ('A:PRO:69', 'A:VAL:70'), ('A:VAL:70', 'A:ASP:71'), ('A:ASP:71', 'A:VAL:72'), ('A:ASP:71', 'A:LYS:105'), ('A:VAL:72', 'A:LEU:73'), ('A:LEU:73', 'A:VAL:74'), ('A:VAL:74', 'A:ILE:75'), ('A:ILE:75', 'A:THR:76'), ('A:THR:76', 'A:PRO:77'), ('A:PRO:77', 'A:PHE:78'), ('A:PHE:78', 'A:PRO:79'), ('A:PRO:79', 'A:LEU:80'), ('A:LEU:80', 'A:LEU:81'), ('A:LEU:81', 'A:ALA:82'), ('A:ALA:82', 'A:GLY:83'), ('A:GLY:83', 'A:SER:84'), ('A:SER:84', 'A:VAL:85'), ('A:VAL:85', 'A:VAL:86'), ('A:VAL:86', 'A:ARG:87'), ('A:ARG:87', 'A:ALA:88'), ('A:ALA:88', 'A:ARG:89'), ('A:ARG:89', 'A:ALA:90'), ('A:ARG:89', 'A:ASP:112'), ('A:ALA:90', 'A:LEU:91'), ('A:LEU:91', 'A:GLY:92'), ('A:GLY:92', 'A:MET:93'), ('A:MET:93', 'A:LEU:94'), ('A:LEU:94', 'A:LYS:95'), ('A:LYS:95', 'A:MET:96'), ('A:MET:96', 'A:THR:97'), ('A:THR:97', 'A:ASP:98'), ('A:ASP:98', 'A:GLU:99'), ('A:ASP:98', 'A:LYS:143'), ('A:GLU:99', 'A:SER:100'), ('A:SER:100', 'A:GLY:101'), ('A:GLY:101', 'A:VAL:102'), ('A:VAL:102', 'A:ASP:103'), ('A:ASP:103', 'A:ALA:104'), ('A:ASP:103', 'A:LYS:105'), ('A:ALA:104', 'A:LYS:105'), ('A:LYS:105', 'A:LEU:106'), ('A:LEU:106', 'A:VAL:107'), ('A:VAL:107', 'A:ALA:108'), ('A:ALA:108', 'A:VAL:109'), ('A:VAL:109', 'A:PRO:110'), ('A:PRO:110', 'A:HIS:111'), ('A:HIS:111', 'A:ASP:112'), ('A:ASP:112', 'A:LYS:113'), ('A:LYS:113', 'A:VAL:114'), ('A:VAL:114', 'A:CYS:115'), ('A:CYS:115', 'A:PRO:116'), ('A:CYS:115', 'A:MET:117'), ('A:PRO:116', 'A:MET:117'), ('A:MET:117', 'A:THR:118'), ('A:THR:118', 'A:ALA:119'), ('A:ALA:119', 'A:ASN:120'), ('A:ASN:120', 'A:LEU:121'), ('A:LEU:121', 'A:LYS:122'), ('A:LYS:122', 'A:SER:123'), ('A:SER:123', 'A:ILE:124'), ('A:ILE:124', 'A:ASP:125'), ('A:ASP:125', 'A:ASP:126'), ('A:ASP:126', 'A:VAL:127'), ('A:VAL:127', 'A:PRO:128'), ('A:PRO:128', 'A:ALA:129'), ('A:ALA:129', 'A:TYR:130'), ('A:TYR:130', 'A:LEU:131'), ('A:LEU:131', 'A:LYS:132'), ('A:LYS:132', 'A:ASP:133'), ('A:ASP:133', 'A:GLN:134'), ('A:GLN:134', 'A:ILE:135'), ('A:ILE:135', 'A:LYS:136'), ('A:LYS:136', 'A:HIS:137'), ('A:HIS:137', 'A:PHE:138'), ('A:PHE:138', 'A:PHE:139'), ('A:PHE:139', 'A:GLU:140'), ('A:GLU:140', 'A:GLN:141'), ('A:GLN:141', 'A:TYR:142'), ('A:TYR:142', 'A:LYS:143'), ('A:LYS:143', 'A:ALA:144'), ('A:ALA:144', 'A:LEU:145'), ('A:LEU:145', 'A:GLU:146'), ('A:GLU:146', 'A:LYS:147'), ('A:LYS:147', 'A:GLY:148'), ('A:GLY:148', 'A:LYS:149'), ('A:LYS:149', 'A:TRP:150'), ('A:TRP:150', 'A:VAL:151'), ('A:VAL:151', 'A:LYS:152'), ('A:LYS:152', 'A:VAL:153'), ('A:VAL:153', 'A:GLU:154'), ('A:GLU:154', 'A:GLY:155'), ('A:GLY:155', 'A:TRP:156'), ('A:TRP:156', 'A:ASP:157'), ('A:ASP:157', 'A:GLY:158'), ('A:GLY:158', 'A:ILE:159'), ('A:ILE:159', 'A:ASP:160'), ('A:ASP:160', 'A:ALA:161'), ('A:ALA:161', 'A:ALA:162'), ('A:ALA:162', 'A:HIS:163'), ('A:HIS:163', 'A:LYS:164'), ('A:LYS:164', 'A:GLU:165'), ('A:LYS:164', 'A:ASP:168'), ('A:GLU:165', 'A:ILE:166'), ('A:ILE:166', 'A:THR:167'), ('A:THR:167', 'A:ASP:168'), ('A:ASP:168', 'A:GLY:169'), ('A:GLY:169', 'A:VAL:170'), ('A:VAL:170', 'A:ALA:171'), ('A:ALA:171', 'A:ASN:172'), ('A:ASN:172', 'A:PHE:173'), ('A:PHE:173', 'A:LYS:174'), ('A:LYS:174', 'A:LYS:175')]

Let’s check it out!

[24]:

p = plotly_protein_structure_graph(
    g,
    colour_edges_by="kind",
    colour_nodes_by="seq_position",
    label_node_ids=False,
    plot_title="Protein graph created using low-level API. Shows peptide bond back bond and H-Bonds",
    node_size_multiplier=1
    )
p.show()

Just as we expected! From here we can apply any number of graph, node or edge metadata annotation functions that we may wish to. The advantage here is that we are not tied to the sequencing in the high-level API of:

(annotate node metadata -> compute edges -> annotate graph metadata -> annotate edge metadata).

This means, if we want, we can define graph annotation functions that utilise edge metadata that would otherwise not be possible. E.g. here we annotate bonds containing histidines with a 1. We then define a graph feature that looks for these bonds, sums the residue numbers involved and sums them.

[25]:

from graphein.protein.graphs import annotate_node_metadata, annotate_graph_metadata, annotate_edge_metadata

# Define edge feature that assigns hist_status = 1 to any bond involving a histidine
def bond_has_histidine_edge_feature(u, v, d):
    if "HIS" in u or "HIS" in v:
        d["hist_status"] = 1
    else:
        d["hist_status"] = 0

# Annotate the edge metadata
g = annotate_edge_metadata(g, [bond_has_histidine_edge_feature])

# Check it out
for u, v, d in g.edges(data=True):
    if d["hist_status"] == 1:
        print(u, v, d)

# Annotate graph-level feature that depends on histidine_status edge feature. This adds the sum of the residue numbers involved in hist_status bonds
def sum_hist_status_bond_residue_nums(G: nx.Graph) -> nx.Graph:
    to_add = []

    for u, v, a in G.edges(data=True):
        if a["hist_status"] == 1:
            to_add.append(G.nodes[u]["residue_number"])
            to_add.append(G.nodes[v]["residue_number"])

    total = sum(to_add)

    G.graph["hist_bond_total_residue_number"] = total
    return G


g = annotate_graph_metadata(g, [sum_hist_status_bond_residue_nums])
print("Graph-level hist feature:", g.graph["hist_bond_total_residue_number"])

A:PRO:110 A:HIS:111 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:111 A:ASP:112 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:LYS:136 A:HIS:137 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:137 A:PHE:138 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:ALA:162 A:HIS:163 {'kind': {'peptide_bond'}, 'hist_status': 1}
A:HIS:163 A:LYS:164 {'kind': {'peptide_bond'}, 'hist_status': 1}
Graph-level hist feature: 1644

fin