# Graphein - Residue Graph Tutorial#

In this notebook, we’ll run through residue-level graph construction in Graphein. We start by discsussing the config, the high-level API and spend the bulk of the tutorial running through the various options. At the end we’ll see how the low-level API can be used to control each step of the graph construction process.

[1]:

# Install Graphein if necessary
# !pip install graphein

# Install pymol if necessary - in this tutorial PyMol is only used for the initial plot. Feel free to skip!
# sudo apt-get install pymol (recommended for colab) OR conda install -c schrodinger pymol


First, let’s checkout the protein we’ll be playing with today:

[2]:

# NBVAL_SKIP
from graphein.utils.pymol import MolViewer
pymol = MolViewer()
pymol.delete("all") # delete all objects from other sessions if necessary.
pymol.fetch("3eiy")
pymol.show_as("cartoon")
pymol.display()

[2]:


## Config#

Graphein is designed for processing datasets of protein structures into graphs. We rely on a global config object, ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>__ to store various parameters in the high-level API. We use pydantic <https://pydantic-docs.helpmanual.io/>__ for the config object and provide sane defaults.

[3]:

from graphein.protein.config import ProteinGraphConfig

config = ProteinGraphConfig()
config.dict()

[3]:

{'granularity': 'CA',
'keep_hets': False,
'insertions': False,
'pdb_dir': PosixPath('../examples/pdbs'),
'verbose': False,
'exclude_waters': True,
'deprotonate': False,
'protein_df_processing_functions': None,
'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
'get_contacts_config': None,
'dssp_config': None}


Let’s run through the parameters of ProteinGraphConfig <https://graphein.ai/modules/graphein.protein.html#graphein.protein.config.ProteinGraphConfig>__:

• granularity: specifies the granularity of the graph (i.e. what should the nodes be). Possible values are: atom identifiers (e.g. "CA" for $$\alpha$$ carbon, "CB" for $$\beta$$ carbon), "centroid" to use residue centroids (under the hood, this is the same as "CA", but we use the average x,y,z coordinates for the atoms in the residue) or “atom” for atom-level construction. This is discussed in another notebook.

• keep_hets: this is a boolean specifying whether or not to keep heteroatoms present in the .pdb file. Heteroatoms are typically non-protein atoms (waters, metal ions, ligands) but can sometimes contain non-standard or modified residues.

• insertions: boolean specifying whether or not to keep insertions in the PDB file

• pdb_dir optional path to a folder in which to save pdb files. Otherwise, /tmp/ will be used

• verbose: bool controlling amount of info printed

• exclude_waters: not implemented

• deprotonate: bool indicating whether or not to remove Hydrogen atoms

• protein_df_processing_functions: list of functions with which to process the PDB dataframe. Discussed in the low-level API.

• edge_construction_functions: list of functions to compute edges with

• node_metadata_functions: list of functions to annotate nodes with

• edge_metadata_functions: list of functions to annotate edges with

• graph_meta_functions: list of functions to annotate graph with

• get_contacts_config: A separate config object if using GetContacts edge construction functions

If you wish, you can construct an entirely fresh configuration. Or, if we wish to change only some of these parameters, we can pass a dictionary containing our modifications:

[4]:

params_to_change = {"granularity": "centroids"}

config = ProteinGraphConfig(**params_to_change)
config.dict()

[4]:

{'granularity': 'centroids',
'keep_hets': False,
'insertions': False,
'pdb_dir': PosixPath('../examples/pdbs'),
'verbose': False,
'exclude_waters': True,
'deprotonate': False,
'protein_df_processing_functions': None,
'edge_construction_functions': [<function graphein.protein.edges.distance.add_peptide_bonds(G: 'nx.Graph') -> 'nx.Graph'>],
'node_metadata_functions': [<function graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) -> Union[pandas.core.series.Series, <built-in function array>]>],
'get_contacts_config': None,
'dssp_config': None}


## High-level API#

Graphein features a high-level API which should be applicable for most simple graph constructions. This can be used on either a .pdb file (so you can run whatever pre-processing you wish), or we can provide a PDB accession code and retrieve a structure from the PDB itself. If a path is provided, it takes precedence over the PDB code.

To use it we do as follows:

[5]:

from graphein.protein.graphs import construct_graph

g = construct_graph(config=config, pdb_code="3eiy")

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Converting dataframe to centroids. This averages XYZ coords of the atoms in a residue
DEBUG:graphein.protein.graphs:Calculated 174 centroid nodes
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174


If you wish to use a local .pdb file, you can run:

g = construct_graph(config=config, pdb_path="../graphein/examples/pdbs/3eiy.pdb")


Let’s check out the results with the in-built visualisation

[6]:

from graphein.protein.visualisation import plotly_protein_structure_graph

p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="degree",
label_node_ids=False,
plot_title="Peptide backbone graph. Nodes coloured by degree.",
node_size_multiplier=1
)
p.show()


Cool! So let’s look at what we’ve got here:

We’ve plotted the graph, positioning nodes according to their x,y,z coordinates and coloured them by their degree. As we can see all the nodes are yellow except the two corresponding to the N and C terminal residues. Why is this? Because we’ve only computed the peptide-bond edges. Now, we probably want more so let’s look at how to do that!

### Edge Functions#

Graphein is implemented in a functional fashion. In this case, this means in order to compute edges, we pass a list of edge construction functions to the construction. We have supplied a number of edge computation functions. These are located in: * graphein.protein.edges.distance <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.distance>__ * graphein.protein.edges.intramolecular <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.intramolecular>__ (these rely on an installation of GetContacts, an optional dependency) and a separate GetContactsConfig * graphein.protein.edges.atomic <https://graphein.ai/modules/graphein.protein.html#module-graphein.protein.edges.atomic>__ (these are used in atomic-level graphs and we discuss these in another notebook tutorial)

However, edge functions are simple functions that take in an nx.Graph and return an nx.Graph with added edges. This means users can easily define their own to suit their purposes! Let’s take a closer look at some of the in-built functions before defining our own.

#### Built-in Edge Functions#

[7]:

from graphein.protein.edges.distance import add_hydrogen_bond_interactions, add_peptide_bonds

config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph with peptide backbone and H-Bonds. \n Nodes coloured by sequence position. Edges coloured by type.",
node_size_multiplier=1,
)
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.

174


Great! So what if we try a bunch of them

[8]:

from graphein.protein.edges.distance import (add_peptide_bonds,
)

}

config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n  Nodes coloured by sequence position, edges by type",
node_size_multiplier=1
)
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 16 aromatic-aromatic interactions
INFO:graphein.protein.edges.distance:Found 75 hbond interactions.
INFO:graphein.protein.edges.distance:Found 7 hbond interactions.
DEBUG:graphein.protein.edges.distance:1 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
INFO:graphein.protein.edges.distance:Found 1308 ionic interactions.

174


Cool! So we’ve added a bunch on distance-based computations of intramolecular edges to our graph! We have a few more distance based edge functions that we can visualise. Let’s check them out!

First up, the Delaunay triangulation:

[9]:

from graphein.protein.edges.distance import add_delaunay_triangulation

config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
G=g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
node_size_multiplier=1,
plot_title="Protein graph created by the Delaunay triangulation of the structure. \n Nodes coloured by sequence position."
)
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.edges.distance:Detected 954 simplices in the Delaunay Triangulation.


Next, let’s consider the last two edge construction functions add_k_nn_edges <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_k_nn_edges>__ and add_distance_threshold <https://graphein.ai/modules/graphein.protein.html#graphein.protein.edges.distance.add_distance_threshold>__. These two functions take additional parameters. We can manage this with partial functions. These functions also feature a long_interaction_threshold parameter which specifies the minimum number of residues in the sequence between two residues in order to add an edge. This is because we may not be so interested in residues close together in the sequence being close together in the graph.

[10]:

from functools import partial

new_edge_funcs = {"edge_construction_functions": [partial(add_distance_threshold, long_interaction_threshold=5, threshold=10.)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created by thresholding distance between nodes. \n Nodes must be <10A apart and at least 5 positions apart \n Nodes coloured by sequence position.",
node_size_multiplier=1
)
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 3218 distance edges
INFO:graphein.protein.edges.distance:Added 1904 distance edges. (1314 removed by LIN)

[11]:

from functools import partial

new_edge_funcs = {"edge_construction_functions": [partial(add_k_nn_edges, k=3, long_interaction_threshold=0)]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created from K-NN of each node. Nodes coloured by sequence position",
node_size_multiplier=1
)
p.show()

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
INFO:graphein.protein.edges.distance:Found: 522 KNN edges


#### Defining your own edge functions#

Now, let’s turn our attention to defining our own edge construction functions. These must take in a nx.Graph and return a nx.Graph. You can use these custom functions together with all the in-built edge functions as before. If you come up with something cool, consider making a pull-request and sharing it with the community!

Here, as an example, we define an arbitrary function that creates an edge between all histidine residues in the graph.

[12]:

import networkx as nx

# Iterate over nodes to identify histidines
histidines = [n for n, d in G.nodes(data=True) if d["residue_name"] == "HIS"]

# Iterate over histidines and create a bond between them
[G.add_edge(x, y, kind={"histidine"}) for i, x in enumerate(histidines) for j, y in enumerate(histidines) if i!=j]
return G

[13]:

new_edge_funcs = {"edge_construction_functions": [add_histidine_histidine_edges]}
config = ProteinGraphConfig(**new_edge_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
p = plotly_protein_structure_graph(
g,
colour_edges_by="kind",
colour_nodes_by="seq_position",
label_node_ids=False,
plot_title="Protein graph created using a user-defined function that connects all Histidines."
)
p.show()
# Let's check this worked:
for u, v, a in g.edges(data=True):
print(u, v, a)

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

A:HIS:111 A:HIS:137 {'kind': {'histidine'}}
A:HIS:111 A:HIS:163 {'kind': {'histidine'}}
A:HIS:137 A:HIS:163 {'kind': {'histidine'}}


Graphein is designed to facilitate geometric deep learning on protein structures and so we have a bunch of feature and metadata annotation functions. These operate on three levels: * node-level annotation * graph-level annotation * edge-level annotation

These behave very similarly to the edge construction functions we just looked at. You simply pass a list of functions that take the right sort of arguments and return the right types.

• graph-level annotation takes in a nx.Graph and returns an nx.Graph (just like before!).

• node-level annotation takes in a node, data tuple from G.nodes(data=True) and returns a pd.Series

• edge-level annotation in node_u, node_v, data tuple from G.edges(data=True) and returns

#### Built-in Graph Annotation Functions#

These are found in graphein.protein.features.sequence and graphein.protein.features.graph. We make a slight distinction between those functions that operate on the protein sequence for cleaner organisation. N.B. we’ve mentioned the chain_selection parameter briefly - if you include multiple chains in your graph, then will be multiple sequences associated with it. These are accessed as such: G.graph[f"sequence_{CHAIN_ID}"] and features computed from sequence are accessed as: G.graph[f"{FEATURE_NAME}_{CHAIN_ID}]

[14]:

from graphein.protein.features.sequence.sequence import molecular_weight

config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("Sequence:", g.graph["sequence_A"])
print("MW:", g.graph["molecular_weight_A"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
MW: 19029.71029999999


We also provide some utilities for adding graph-level features in the form of sequence embeddings from various pre-trained language models. Let’s check them out!

[15]:

# Warning! This cell may crash a binder notebook as the pre-trained model download is rather large!
# NBVAL_SKIP
from graphein.protein.features.sequence.embeddings import esm_sequence_embedding, biovec_sequence_embedding

config = ProteinGraphConfig(**new_graph_annotation_funcs)

g = construct_graph(config=config, pdb_code="3eiy")
print("ESM:", g.graph["esm_embedding_A"])
print("biovec:", g.graph["biovec_embedding_A"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
/Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning:

Referenced from: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torchvision/image.so
Expected in: /Users/arianjamasb/opt/anaconda3/envs/graphein-wip/lib/python3.8/site-packages/torch/lib/libtorch_cpu.dylib


174

INFO:gensim.utils:loading Word2Vec object from /Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model
DEBUG:smart_open.smart_open_lib:{'uri': '/Users/arianjamasb/github/graphein/graphein/protein/features/pretrained_models/swissprot-reviewed-protvec.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'compression': None, 'transport_params': None}
INFO:gensim.utils:setting ignored attribute vectors_norm to None
INFO:gensim.utils:setting ignored attribute cum_table to None

ESM: [-0.02279943  0.16464774 -0.00461995 ...  0.02556622 -0.19798394
0.09682215]
biovec: [array([ -8.865931  ,   5.65769   ,  -1.5052402 ,   3.2893941 ,
-6.6228924 ,   2.4890645 ,   1.0635861 ,   2.4991994 ,
-2.2740204 ,  -7.6614437 ,   0.3739053 ,   4.285517  ,
-1.7950253 ,  -2.0930681 ,   0.19075748,   9.249631  ,
-1.2537682 ,   6.0140305 ,  -3.0592997 ,   3.1944623 ,
5.6747837 ,   1.4732779 ,  -9.556535  ,   4.5126605 ,
-15.789192  ,  -9.917104  ,  10.61495   ,   1.6693095 ,
1.8446481 , -12.656563  ,   2.5226657 ,   3.2782233 ,
5.6363025 ,  -1.5776551 ,   9.913681  ,  -1.4776824 ,
7.4292827 ,   2.6953607 ,  -0.81495905,   1.0415509 ,
-18.352554  ,  -9.603085  , -11.215957  ,   3.9449768 ,
2.2988393 ,   2.8612108 ,  -5.647224  ,  -9.894558  ,
2.8099368 ,  -3.47802   ,  -1.7364154 ,   4.10581   ,
-1.1337806 ,  -7.851351  ,   3.667841  ,   2.5098174 ,
0.3878222 ,  -0.282476  ,   4.242093  ,   2.7918775 ,
-0.3927042 ,  -3.504907  ,   2.2170143 ,  -6.7696056 ,
-8.574576  ,   7.549987  ,   5.5607357 ,  -7.514735  ,
5.778782  ,   2.1968088 ,   5.747694  ,  -4.4072075 ,
1.6411833 ,   2.339804  ,   4.364424  ,  -4.615387  ,
0.61950064,   7.19126   ,  -0.10569584,  -6.485407  ,
-3.3555477 ,   0.58533096,   4.150122  ,   4.6470437 ,
-7.6111784 ,   5.8581285 ,  -9.8836    ,   8.114656  ,
4.3746543 ,   1.4033681 ,  -4.632345  ,  13.016971  ,
-2.839704  ,  -6.210647  ,   3.4460034 ,   2.6573007 ,
-14.987186  ,  -6.545936  ,  -2.3990562 ,   1.2732263 ],
dtype=float32), array([ -9.079695  ,   3.7650356 ,   0.3614651 ,   2.1325634 ,
-5.285296  ,   3.0021908 ,   1.2300422 ,   1.8999658 ,
-2.0472856 ,  -5.5658956 ,   0.5014471 ,   3.8743246 ,
-1.0728757 ,  -3.0830898 ,   0.73323405,   9.658365  ,
-0.10142106,   6.2466826 ,  -2.4961333 ,   2.8022783 ,
6.5060906 ,   1.6311501 , -10.04929   ,   5.4711785 ,
-16.297068  , -10.765219  ,  10.580414  ,   0.503202  ,
2.4227042 , -12.57771   ,   2.7075891 ,   3.9921677 ,
5.5343657 ,  -0.9493397 ,  10.082592  ,   0.301396  ,
6.978562  ,   2.4752004 ,  -1.1012759 ,  -1.1357698 ,
-16.81547   ,  -9.117541  , -10.919918  ,   3.8029845 ,
2.5540738 ,   3.0349758 ,  -5.4471354 ,  -9.438955  ,
3.067747  ,  -3.1571815 ,  -3.1563768 ,   4.7784586 ,
-2.3092556 ,  -7.704368  ,   0.82422435,   1.8206171 ,
2.161718  ,   0.37946355,   4.7007704 ,   2.2930143 ,
0.75466317,  -3.3875456 ,   2.3561945 ,  -6.173627  ,
-7.655903  ,   6.066585  ,   5.7083497 ,  -8.785268  ,
4.0877957 ,   2.1213973 ,   5.8782153 ,  -4.1335964 ,
0.26376987,   3.185395  ,   3.9498727 ,  -4.5998893 ,
-0.11535516,   8.4039135 ,  -0.8255909 ,  -5.910516  ,
-2.2402806 ,  -0.15940607,   3.476339  ,   6.034683  ,
-6.72777   ,   6.143824  ,  -9.662175  ,   8.747985  ,
3.985702  ,   2.272303  ,  -3.933255  ,  12.071237  ,
-1.8998326 ,  -6.6413393 ,   4.2529097 ,   3.4216733 ,
-15.628299  ,  -5.513835  ,  -3.004577  ,   1.3258756 ],
dtype=float32), array([ -6.9852    ,   4.425231  ,  -0.5174276 ,   2.0380528 ,
-7.1311088 ,   2.4794824 ,   1.9649935 ,   1.9006643 ,
-1.3209196 ,  -5.8989286 ,   1.2391084 ,   3.485902  ,
-1.9545169 ,  -2.2625084 ,   1.8165749 ,   9.708086  ,
-0.74385244,   6.5338845 ,  -3.0916584 ,   3.9312694 ,
6.1540847 ,   3.4793413 , -10.056544  ,   5.7322254 ,
-13.599701  , -10.923328  ,   9.378405  ,   0.555632  ,
3.512236  , -11.891929  ,   2.382594  ,   3.2599561 ,
5.123339  ,  -0.55314994,  10.585077  ,  -0.9078483 ,
8.167705  ,   2.4789596 ,   0.78942966,  -1.0045129 ,
-17.38679   ,  -9.496934  , -10.542821  ,   3.5852609 ,
3.3952198 ,   2.9937162 ,  -5.9167666 ,  -9.517663  ,
2.2889428 ,  -3.6738756 ,  -2.7072256 ,   4.6205435 ,
-1.970848  ,  -6.7903805 ,   2.1213295 ,   2.5594456 ,
0.4943355 ,   0.27333227,   4.096034  ,   1.2748998 ,
-1.3203058 ,  -3.6010296 ,   2.623712  ,  -8.468901  ,
-6.3624053 ,   7.388231  ,   4.908528  ,  -7.1587243 ,
4.9127274 ,   3.2538335 ,   6.45744   ,  -3.3919108 ,
-0.22479697,   2.4506824 ,   4.864544  ,  -4.201532  ,
-0.08151506,   5.778758  ,  -0.5880602 ,  -7.52435   ,
-2.3001645 ,   0.04607171,   4.0975637 ,   5.7935786 ,
-8.676276  ,   6.952252  , -11.0132    ,   7.6395264 ,
4.4887743 ,   2.5889583 ,  -5.0494356 ,  12.681698  ,
-2.3577175 ,  -5.7180576 ,   3.481403  ,   3.9928114 ,
-14.854708  ,  -6.135002  ,  -2.9227335 ,   1.2122653 ],
dtype=float32)]


#### Writing your own graph-level annotation function.#

We provide a couple of useful functions for dealing with multiple chains robustly, namely compute_feature_over_chains and aggregate_feature_over_chains. This takes an input (nx.Graph), your function: (Callable), and feature_name (str). Your function must operate on a sequence (str) to use this utility.

aggregate_feature_over_chains can be used to perform aggregation {"min", "max", "sum" , "mean"} of chain-specific features. It takes: an nx.Graph, a feature name (str) and and aggregation type (str)

If you don’t want to operate on the sequnce, you don’t need to use these!

Let’s use this to construct a graph-level feature that operates on the sequence. This feature is an integer equal to the number of histidines in the chain.

[16]:

from graphein.protein.features.sequence.utils import compute_feature_over_chains, aggregate_feature_over_chains

# Define our graph-level function that operates on sequences
def histidine_count_feature(G: nx.Graph) -> nx.Graph:

# Define our feature function that operates on a sequence
def count_histidines(sequence: str) -> int:
return sequence.count("H")

# Compute the feature over the chains in the graph
G = compute_feature_over_chains(G, func=count_histidines, feature_name="histidine_count")

# Aggregate the feature over the chains
G = aggregate_feature_over_chains(G, feature_name="histidine_count", aggregation_type="mean")
return G

config = ProteinGraphConfig(**new_graph_annotation_funcs)

# Test our new feature
g = construct_graph(config=config, pdb_code="3eiy")
print(g.graph["histidine_count_A"])

# And now on a multi-chain graph
g = construct_graph(config=config, pdb_code="9API", chain_selection="all")
print("Chain A Hists:", g.graph["histidine_count_A"])
print("Chain B Hists:", g.graph["histidine_count_B"])
print("Chain B Seq:", g.graph["sequence_B"])
print("Aggregated HIS count:", g.graph["histidine_count_mean"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
3

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 375 total nodes

375
Chain A Hists: 11
Chain B Hists: 0
Chain B Seq: SIPPEVKFNKPFVFLMIEQNTKSPLFMGKVVNPTQK
Aggregated HIS count: 5.5


#### Node-level features#

Similarly, we provide some in-built functions for computing node-level features. For example, we’ll look at some featurisations of amino acid types.

First, we’ll look at a simple one-hot encoding of the amino acid type

[17]:

from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
print(d["amino_acid_one_hot"])

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes

174
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

[18]:

from graphein.protein.features.nodes.amino_acid import expasy_protein_scale

g = construct_graph(config=config, pdb_code="3eiy")

for n, d in g.nodes(data=True):
print(d['expasy'])
break

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 174 total nodes
DEBUG:graphein.protein.features.nodes.amino_acid:Reading Expasy protein scales from: /Users/arianjamasb/github/graphein/graphein/protein/features/nodes/amino_acid_properties.csv

174
pka_cooh_alpha               2.21
pka_nh3                      9.15
pka_rgroup                   7.00
isoelectric_points           5.68
molecularweight            105.00
...
antiparallelbeta_strand      0.87
parallelbeta_strand          0.70
a_a_composition              6.90
a_a_swiss_prot               6.56
relativemutability         120.00
Name: SER, Length: 61, dtype: float64


#### Geometric features#

API Reference

We can also add geometric features to protein structure graphs.

[19]:

from graphein.protein.features.nodes.geometry import add_sidechain_vector, add_beta_carbon_vector, add_sequence_neighbour_vector
`