graphein.protein#

Config#

Base Config object for use with Protein Graph Construction.

class graphein.protein.config.GetContactsConfig(*, get_contacts_path: pathlib.Path = PosixPath('/Users/arianjamasb/github/getcontacts'), contacts_dir: pathlib.Path = PosixPath('/Users/arianjamasb/graphein/examples/contacts'), pdb_dir: pathlib.Path = PosixPath('/Users/arianjamasb/graphein/examples/pdbs'), granularity: str = 'CA')[source]#

Config object for parameters relating to running GetContacts. GetContacts is an optional dependency from which intramolecular interactions can be computed and used as edges in the graph.

More information about GetContacts can be found at https://getcontacts.github.io/

Parameters
  • get_contacts_path (pathlib.Path) – Path to GetContacts installation

  • contacts_dir (pathlib.Path) – Path to store output of GetContacts

  • pdb_dir (pathlib.Path) – Path to PDB files to be used to compute intramolecular interactions.

  • granularity (str) – Specifies the node types of the graph, defaults to "CA" for alpha-carbons as nodes. Other options are "CB" (beta-carbon), "atom" for all-atom graphs, and "centroid" for nodes positioned as residue centroids.

graphein.protein.config.GranularityOpts#

Allowable granularity options for nodes in the graph.

alias of Literal[‘atom’, ‘centroids’]

graphein.protein.config.GraphAtoms#

Allowable atom types for nodes in the graph.

alias of Literal[‘N’, ‘CA’, ‘C’, ‘O’, ‘CB’, ‘OG’, ‘CG’, ‘CD1’, ‘CD2’, ‘CE1’, ‘CE2’, ‘CZ’, ‘OD1’, ‘ND2’, ‘CG1’, ‘CG2’, ‘CD’, ‘CE’, ‘NZ’, ‘OD2’, ‘OE1’, ‘NE2’, ‘OE2’, ‘OH’, ‘NE’, ‘NH1’, ‘NH2’, ‘OG1’, ‘SD’, ‘ND1’, ‘SG’, ‘NE1’, ‘CE3’, ‘CZ2’, ‘CZ3’, ‘CH2’, ‘OXT’]

class graphein.protein.config.ProteinGraphConfig(*, granularity: typing.Union[typing.Literal['N', 'CA', 'C', 'O', 'CB', 'OG', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OD1', 'ND2', 'CG1', 'CG2', 'CD', 'CE', 'NZ', 'OD2', 'OE1', 'NE2', 'OE2', 'OH', 'NE', 'NH1', 'NH2', 'OG1', 'SD', 'ND1', 'SG', 'NE1', 'CE3', 'CZ2', 'CZ3', 'CH2', 'OXT'], typing.Literal['atom', 'centroids']] = 'CA', keep_hets: bool = False, insertions: bool = False, pdb_dir: pathlib.Path = PosixPath('../examples/pdbs'), verbose: bool = False, exclude_waters: bool = True, deprotonate: bool = False, protein_df_processing_functions: typing.List[typing.Callable] = None, edge_construction_functions: typing.List[typing.Union[typing.Callable, str]] = [<function add_peptide_bonds>], node_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = [<function meiler_embedding>], edge_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = None, graph_metadata_functions: typing.List[typing.Callable] = None, get_contacts_config: graphein.protein.config.GetContactsConfig = None, dssp_config: graphein.protein.config.DSSPConfig = None)[source]#

Config Object for Protein Structure Graph Construction.

If you encounter a problematic structure, perusing https://www.umass.edu/microbio/chime/pe_beta/pe/protexpl/badpdbs.htm may provide some additional insight. PDBs are notoriously troublesome and this is an excellent overview.

Parameters
  • granularity (str (Union[graphein.protein.config.GraphAtoms, graphein.protein.config.GranularityOpts])) – Controls the granularity of the graph construction. "atom" builds an atomic-scale graph where nodes are constituent atoms. Residue-level graphs can be build by specifying which constituent atom should represent node positions (see GraphAtoms). Additionally, "centroids" can be specified to compute the centre of gravity for a given atom (Specified in GranularityOpts). Defaults to "CA" (alpha-Carbon).

  • keep_hets (bool) –

    Controls whether or not heteroatoms are removed from the PDB file. These are typically modified residues, bound ligands, crystallographic adjuvants, ions or water molecules.

    For more information, see: https://proteopedia.org/wiki/index.php/Hetero_atoms

  • insertions (bool) – Controls whether or not insertions are allowed.

  • pdb_dir (pathlib.Path) – Specifies path to download protein structures into.

  • verbose (bool) – Specifies verbosity of graph creation process.

  • exclude_waters – Specifies whether or not water molecules are excluded from the structure

  • deprotonate (bool) – Specifies whether or not to remove H atoms from the graph.

  • protein_df_processing_functions (Optional[List[Callable]]) – List of functions that take a pd.DataFrame and return a pd.DataFrame. This allows users to define their own series of processing functions for the protein structure DataFrame and override the default sequencing of processing steps provided by Graphein. We refer users to our low-level API tutorial for more details.

  • edge_construction_functions (List[Callable]) – List of functions that take an nx.Graph and return an nx.Graph with desired edges added. Prepared edge constructions can be found in graphein.protein.edges

  • node_metadata_functions (List[Callable], optional) – List of functions that take an nx.Graph

  • edge_metadata_functions (List[Callable], optional) – List of functions that take an

  • graph_metadata_functions (List[Callable], optional) – List of functions that take an nx.Graph and return an nx.Graph with added graph-level features and metadata.

  • get_contacts_config (GetContactsConfig, optional) – Config object containing parameters for running GetContacts for computing intramolecular contact-based edges. Defaults to None.

  • dssp_config (DSSPConfig, optional) – Config Object containing reference to DSSP executable. Defaults to None. NB DSSP must be installed. See installation instructions: https://graphein.ai/getting_started/installation.html#optional-dependencies

class graphein.protein.config.ProteinMeshConfig(*, pymol_command_line_options: str = '-cKq', pymol_commands: List[str] = ['show surface'])[source]#

Config object for parameters relating to Protein Mesh construction with PyMol

NB PyMol must be installed. See: https://graphein.ai/getting_started/installation.html#optional-dependencies

Parameters

Graphs#

Functions for working with Protein Structure Graphs.

graphein.protein.graphs.add_nodes_to_graph(G: networkx.classes.graph.Graph, protein_df: Optional[pandas.core.frame.DataFrame] = None, verbose: bool = False) networkx.classes.graph.Graph[source]#

Add nodes into protein graph.

Parameters
  • G (nx.Graph) – nx.Graph with metadata to populate with nodes.

  • verbose (bool) – Controls verbosity of this step.

Protein_df

DataFrame of protein structure containing nodes & initial node metadata to add to the graph.

Returns

nx.Graph with nodes added.

Return type

nx.Graph

graphein.protein.graphs.assign_node_id_to_dataframe(protein_df: pandas.core.frame.DataFrame, granularity: str) pandas.core.frame.DataFrame[source]#

Assigns the node ID back to the pdb_df dataframe

Parameters
  • protein_df (pd.DataFrame) – Structure Dataframe

  • granularity (str) – Granularity of graph. Atom-level, residue (e.g. CA) or centroids. See: GRAPH_ATOMS and GRANULARITY_OPTS.

Returns

Returns dataframe with added node_ids

Return type

pd.DataFrame

graphein.protein.graphs.calculate_centroid_positions(atoms: pandas.core.frame.DataFrame, verbose: bool = False) pandas.core.frame.DataFrame[source]#

Calculates position of sidechain centroids.

Parameters
  • atoms (pd.DataFrame) – ATOM df of protein structure.

  • verbose (bool) – bool controlling verbosity.

Returns

centroids (df).

Return type

pd.DataFrame

graphein.protein.graphs.compute_chain_graph(g: networkx.classes.graph.Graph, chain_list: Optional[List[str]] = None, remove_self_loops: bool = False, return_weighted_graph: bool = False) Union[networkx.classes.graph.Graph, networkx.classes.multigraph.MultiGraph][source]#

Computes a chain-level graph from a protein structure graph.

This graph features nodes as individual chains in a complex and edges as the interactions between constituent nodes in each chain. You have the option of returning an unweighted graph (multigraph, return_weighted_graph=False) or a weighted graph (return_weighted_graph=True). The difference between these is the unweighted graph features and edge for each interaction between chains (ie the number of edges will be equal to the number of edges in the input protein structure graph), while the weighted graph sums these interactions to a single edge between chains with the counts stored as features.

Parameters
  • g (nx.Graph) – A protein structure graph to compute the chain graph of.

  • chain_list (Optional[List[str]]) – A list of chains to extract from the input graph. If None, all chains will be used. This is provided as input to extract_subgraph_from_chains. Default is None.

  • remove_self_loops (bool) – Whether to remove self-loops from the graph. Default is False.

Returns

A chain-level graph.

Return type

Union[nx.Graph, nx.MultiGraph]

graphein.protein.graphs.compute_edges(G: networkx.classes.graph.Graph, funcs: List[Callable], get_contacts_config: Optional[graphein.protein.config.GetContactsConfig] = None) networkx.classes.graph.Graph[source]#

Computes edges for the protein structure graph. Will compute a pairwise distance matrix between nodes which is added to the graph metadata to facilitate some edge computations.

Parameters
  • G (nx.Graph) – nx.Graph with nodes to add edges to.

  • funcs (List[Callable]) – List of edge construction functions.

  • get_contacts_config (graphein.protein.config.GetContactsConfig) – Config object for GetContacts if intramolecular edges are being used.

Returns

Graph with added edges.

Return type

nx.Graph

graphein.protein.graphs.compute_secondary_structure_graph(g: networkx.classes.graph.Graph, allowable_ss_elements: Optional[List[str]] = None, remove_non_ss: bool = True, remove_self_loops: bool = False, return_weighted_graph: bool = False) Union[networkx.classes.graph.Graph, networkx.classes.multigraph.MultiGraph][source]#

Computes a secondary structure graph from a protein structure graph.

Parameters
  • g (nx.Graph) – A protein structure graph to compute the secondary structure graph of.

  • remove_non_ss (bool) – Whether to remove non-secondary structure nodes from the graph. These are denoted as "-" by DSSP. Default is True.

  • remove_self_loops (bool) – Whether to remove self-loops from the graph. Default is False.

  • return_weighted_graph (bool) – Whether to return a weighted graph. Default is False.

Raises

ProteinGraphConfigurationError – If the protein structure graph is not configured correctly with secondary structure assignments on all nodes.

Returns

A secondary structure graph.

Return type

Union[nx.Graph, nx.MultiGraph]

graphein.protein.graphs.compute_weighted_graph_from_multigraph(g: networkx.classes.multigraph.MultiGraph) networkx.classes.graph.Graph[source]#

Computes a weighted graph from a multigraph.

This function is used to convert a multigraph to a weighted graph. The weights of the edges are the number of interactions between the nodes.

Parameters

g (nx.MultiGraph) – A multigraph.

Returns

A weighted graph.

Return type

nx.Graph

graphein.protein.graphs.construct_graph(config: Optional[graphein.protein.config.ProteinGraphConfig] = None, pdb_path: Optional[str] = None, pdb_code: Optional[str] = None, chain_selection: str = 'all', df_processing_funcs: Optional[List[Callable]] = None, edge_construction_funcs: Optional[List[Callable]] = None, edge_annotation_funcs: Optional[List[Callable]] = None, node_annotation_funcs: Optional[List[Callable]] = None, graph_annotation_funcs: Optional[List[Callable]] = None) networkx.classes.graph.Graph[source]#

Constructs protein structure graph from a pdb_code or pdb_path.

Users can provide a ProteinGraphConfig object to specify construction parameters.

However, config parameters can be overridden by passing arguments directly to the function.

Parameters
  • config (graphein.protein.config.ProteinGraphConfig, optional) – ProteinGraphConfig object. If None, defaults to config in graphein.protein.config.

  • pdb_path (str, optional) – Path to pdb_file to build graph from. Default is None.

  • pdb_code (str, optional) – 4-character PDB accession pdb_code to build graph from. Default is None.

  • chain_selection (str) – String of polypeptide chains to include in graph. E.g "ABDF" or "all". Default is "all".

  • df_processing_funcs (List[Callable], optional) – List of dataframe processing functions. Default is None.

  • edge_construction_funcs (List[Callable], optional) – List of edge construction functions. Default is None.

  • edge_annotation_funcs (List[Callable], optional) – List of edge annotation functions. Default is None.

  • node_annotation_funcs (List[Callable], optional) – List of node annotation functions. Default is None.

  • graph_annotation_funcs (List[Callable]) – List of graph annotation function. Default is None.

Returns

Protein Structure Graph

Type

nx.Graph

graphein.protein.graphs.construct_graphs_mp(pdb_code_it: Optional[List[str]] = None, pdb_path_it: Optional[List[str]] = None, chain_selections: Optional[list[str]] = None, config: ProteinGraphConfig = ProteinGraphConfig(granularity='CA', keep_hets=False, insertions=False, pdb_dir=PosixPath('../examples/pdbs'), verbose=False, exclude_waters=True, deprotonate=False, protein_df_processing_functions=None, edge_construction_functions=[<function add_peptide_bonds>], node_metadata_functions=[<function meiler_embedding>], edge_metadata_functions=None, graph_metadata_functions=None, get_contacts_config=None, dssp_config=None), num_cores: int = 16, return_dict: bool = True) Union[List[nx.Graph], Dict[str, nx.Graph]][source]#

Constructs protein graphs for a list of pdb codes or pdb paths using multiprocessing.

Parameters
  • pdb_code_it (Optional[List[str]], defaults to None) – List of pdb codes to use for protein graph construction

  • pdb_path_it (Optional[List[str]], defaults to None) – List of paths to PDB files to use for protein graph construction

  • chain_selections (Optional[List[str]], defaults to None) – List of chains to select from the protein structures (e.g. [“ABC”, “A”, “L”, “CD”…])

  • config (graphein.protein.config.ProteinGraphConfig, defaults to default config params) – ProteinGraphConfig to use.

  • num_cores (int, defaults to 16) – Number of cores to use for multiprocessing. The more the merrier

  • return_dict (bool, default to True) – Whether or not to return a dictionary (indexed by pdb codes/paths) or a list of graphs.

Returns

Iterable of protein graphs. None values indicate there was a problem in constructing the graph for this particular pdb

Return type

Union[List[nx.Graph], Dict[str, nx.Graph]]

graphein.protein.graphs.convert_structure_to_centroids(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Overwrite existing (x, y, z) coordinates with centroids of the amino acids.

Parameters

df (pd.DataFrame) – Pandas Dataframe protein structure to convert into a dataframe of centroid positions.

Returns

pd.DataFrame with atoms/residues positions converted into centroid positions.

Return type

pd.DataFrame

graphein.protein.graphs.deprotonate_structure(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Remove protons from PDB dataframe.

Parameters

df (pd.DataFrame) – Atomic dataframe.

Returns

Atomic dataframe with all atom_name == "H" removed.

Return type

pd.DataFrame

graphein.protein.graphs.filter_hetatms(df: pandas.core.frame.DataFrame, keep_hets: List[str]) List[pandas.core.frame.DataFrame][source]#

Return hetatms of interest.

Parameters
  • df (pd.DataFrame) – Protein Structure dataframe to filter hetatoms from.

  • keep_hets – List of hetero atom names to keep.

Returns

Protein structure dataframe with heteroatoms removed

:rtype pd.DataFrame

graphein.protein.graphs.initialise_graph_with_metadata(protein_df: pandas.core.frame.DataFrame, raw_pdb_df: pandas.core.frame.DataFrame, pdb_id: str, granularity: str) networkx.classes.graph.Graph[source]#

Initializes the nx Graph object with initial metadata.

Parameters
  • protein_df (pd.DataFrame) – Processed Dataframe of protein structure.

  • raw_pdb_df (pd.DataFrame) – Unprocessed dataframe of protein structure for comparison and traceability downstream.

  • pdb_id (str) – PDB Accession code.

  • granularity (str) – Granularity of the graph (eg "atom", "CA", "CB" etc or "centroid"). See: GRAPH_ATOMS and GRANULARITY_OPTS.

Returns

Returns initial protein structure graph with metadata.

Return type

nx.Graph

graphein.protein.graphs.number_groups_of_runs(list_of_values: List[Any]) List[str][source]#

Numbers groups of runs in a list of values.

E.g. ["A", "A", "B", "A", "A", "A", "B", "B"] -> ["A1", "A1", "B1", "A2", "A2", "A2", "B2", "B2"]

Parameters

list_of_values (List[Any]) – List of values to number.

Returns

List of numbered values.

Return type

List[str]

graphein.protein.graphs.process_dataframe(protein_df: pandas.core.frame.DataFrame, atom_df_processing_funcs: Optional[List[Callable]] = None, hetatom_df_processing_funcs: Optional[List[Callable]] = None, granularity: str = 'centroids', chain_selection: str = 'all', insertions: bool = False, deprotonate: bool = True, keep_hets: List[str] = [], verbose: bool = False) pandas.core.frame.DataFrame[source]#

Process ATOM and HETATM dataframes to produce singular dataframe used for graph construction.

Parameters
  • protein_df (pd.DataFrame) – Dataframe to process. Should be the object returned from read_pdb_to_dataframe().

  • atom_df_processing_funcs (List[Callable], optional) – List of functions to process dataframe. These must take in a dataframe and return a dataframe. Defaults to None.

  • hetatom_df_processing_funcs (List[Callable], optional) – List of functions to process the hetatom dataframe. These must take in a dataframe and return a dataframe

  • granularity (str) – The level of granularity for the graph. This determines the node definition. Acceptable values include: "centroids", "atoms", any of the atom_names in the PDB file (e.g. "CA", "CB", "OG", etc.). See: GRAPH_ATOMS and GRANULARITY_OPTS.

  • insertions – Whether or not to keep insertions.

  • insertions – bool

  • deprotonate (bool) – Whether or not to remove hydrogen atoms (i.e. deprotonation).

  • keep_hets (List[str]) – Hetatoms to keep. Defaults to an empty list. To keep a hetatom, pass it inside a list of hetatom names to keep.

  • verbose (bool) – Verbosity level.

  • chain_selection (str) – Which protein chain to select. Defaults to "all". Eg can use "ACF" to select 3 chains (A, C & F)

Returns

A protein dataframe that can be consumed by other graph construction functions.

Return type

pd.DataFrame

graphein.protein.graphs.read_pdb_to_dataframe(pdb_path: Optional[str] = None, pdb_code: Optional[str] = None, verbose: bool = False, granularity: str = 'CA') pandas.core.frame.DataFrame[source]#

Reads PDB file to PandasPDB object.

Returns atomic_df, which is a dataframe enumerating all atoms and their cartesian coordinates in 3D space. Also contains associated metadata from the PDB file.

Parameters
  • pdb_path (str, optional) – path to PDB file. Defaults to None.

  • pdb_code (str, optional) – 4-character PDB accession. Defaults to None.

  • verbose (bool) – print dataframe?

  • granularity (str) – Specifies granularity of dataframe. See ProteinGraphConfig for further details.

Returns

pd.DataFrame containing protein structure

Return type

pd.DataFrame

graphein.protein.graphs.remove_insertions(df: pandas.core.frame.DataFrame, keep: str = 'first') pandas.core.frame.DataFrame[source]#

This function removes insertions from PDB dataframes.

Parameters
  • df (pd.DataFrame) – Protein Structure dataframe to remove insertions from.

  • keep (str) – Specifies which insertion to keep. Options are "first" or "last". Default is "first"

Returns

Protein structure dataframe with insertions removed

Return type

pd.DataFrame

graphein.protein.graphs.select_chains(protein_df: pandas.core.frame.DataFrame, chain_selection: str, verbose: bool = False) pandas.core.frame.DataFrame[source]#

Extracts relevant chains from protein_df.

Parameters
  • protein_df (pd.DataFrame) – pandas dataframe of PDB subsetted to relevant atoms (CA, CB).

  • chain_selection (str) – Specifies chains that should be extracted from the larger complexed structure.

  • verbose (bool) – Print dataframe?

Returns

Protein structure dataframe containing only entries in the chain selection.

Return type

pd.DataFrame

graphein.protein.graphs.subset_structure_to_atom_type(df: pandas.core.frame.DataFrame, granularity: str) pandas.core.frame.DataFrame[source]#

Return a subset of atomic dataframe that contains only certain atom names.

Parameters

df (pd.DataFrame) – Protein Structure dataframe to subset.

Returns

Subsetted protein structure dataframe.

Return type

pd.DataFrame

Edges#

Distance#

Functions for computing biochemical edges of graphs.

graphein.protein.edges.distance.add_aromatic_interactions(G: networkx.classes.graph.Graph, pdb_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Find all aromatic-aromatic interaction.

Criteria: phenyl ring centroids separated between 4.5A to 7A. Phenyl rings are present on PHE, TRP, HIS, TYR (AROMATIC_RESIS). Phenyl ring atoms on these amino acids are defined by the following atoms: - PHE: CG, CD, CE, CZ - TRP: CD, CE, CH, CZ - HIS: CG, CD, ND, NE, CE - TYR: CG, CD, CE, CZ Centroids of these atoms are taken by taking:

(mean x), (mean y), (mean z)

for each of the ring atoms. Notes for future self/developers: - Because of the requirement to pre-compute ring centroids, we do not

use the functions written above (filter_dataframe, compute_distmat, get_interacting_atoms), as they do not return centroid atom euclidean coordinates.

graphein.protein.edges.distance.add_aromatic_sulphur_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Find all aromatic-sulphur interactions.

graphein.protein.edges.distance.add_cation_pi_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Add cation-pi interactions.

graphein.protein.edges.distance.add_delaunay_triangulation(G: networkx.classes.graph.Graph, allowable_nodes: Optional[List[str]] = None)[source]#

Compute the Delaunay triangulation of the protein structure.

This has been used in prior work. References:

Harrison, R. W., Yu, X. & Weber, I. T. Using triangulation to include target structure improves drug resistance prediction accuracy. in 1–1 (IEEE, 2013). doi:10.1109/ICCABS.2013.6629236

Yu, X., Weber, I. T. & Harrison, R. W. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. BMC Genomics 15 Suppl 5, S1 (2014).

Notes: 1. We do not use the add_interacting_resis function, because this

interaction is computed on the CA atoms. Therefore, there is code duplication. For now, I have chosen to leave this code duplication in.

Parameters
  • G (nx.Graph) – The networkx graph to add the triangulation to.

  • allowable_nodes (List[str], optional) – The nodes to include in the triangulation. If None (default), no filtering is done. This parameter is used to filter out nodes that are not desired in the triangulation. Eg if you wanted to construct a delaunay triangulation of the CA atoms of an atomic graph.

graphein.protein.edges.distance.add_distance_threshold(G: networkx.classes.graph.Graph, long_interaction_threshold: int, threshold: float = 5.0)[source]#

Adds edges to any nodes within a given distance of each other. Long interaction threshold is used to specify minimum separation in sequence to add an edge between networkx nodes within the distance threshold

Parameters
  • G (nx.Graph) – Protein Structure graph to add distance edges to

  • long_interaction_threshold (int) – minimum distance in sequence for two nodes to be connected

  • threshold (float) – Distance in angstroms, below which two nodes are connected

Returns

Graph with distance-based edges added

graphein.protein.edges.distance.add_disulfide_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Find all disulfide interactions between CYS residues (DISULFIDE_RESIS, DISULFIDE_ATOMS).

Criteria: sulfur atom pairs are within 2.2A of each other.

Parameters
  • G (nx.Graph) – networkx protein graph

  • rgroup_df (pd.DataFrame, optional) – pd.DataFrame containing rgroup data, defaults to None, which retrieves the df from the provided nx graph.

graphein.protein.edges.distance.add_hydrogen_bond_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Add all hydrogen-bond interactions.

graphein.protein.edges.distance.add_hydrophobic_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Find all hydrophobic interactions.

Performs searches between the following residues: [ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO, TYR] (HYDROPHOBIC_RESIS).

Criteria: R-group residues are within 5A distance.

Parameters
  • G (nx.Graph) – nx.Graph to add hydrophobic interactions to.

  • rgroup_df (pd.DataFrame, optional) – Optional dataframe of R-group atoms.

graphein.protein.edges.distance.add_interacting_resis(G: networkx.classes.graph.Graph, interacting_atoms: numpy.ndarray, dataframe: pandas.core.frame.DataFrame, kind: List[str])[source]#

Add interacting residues to graph.

Returns a list of 2-tuples indicating the interacting residues based on the interacting atoms. This is most typically called after the get_interacting_atoms function above.

Also filters out the list such that the residues have to be at least two apart.

### Parameters

  • interacting_atoms: (numpy array) result from get_interacting_atoms function.

  • dataframe: (pandas dataframe) a pandas dataframe that

    houses the euclidean locations of each atom.

  • kind: (list) the kind of interaction. Contains one

    of : - hydrophobic - disulfide - hbond - ionic - aromatic - aromatic_sulphur - cation_pi - delaunay

  • filtered_interacting_resis: (set of tuples) the residues that are in

    an interaction, with the interaction kind specified

graphein.protein.edges.distance.add_ionic_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#

Find all ionic interactions.

Criteria: [ARG, LYS, HIS, ASP, and GLU] (IONIC_RESIS) residues are within 6A. We also check for opposing charges (POS_AA, NEG_AA)

graphein.protein.edges.distance.add_k_nn_edges(G: networkx.classes.graph.Graph, long_interaction_threshold: int, k: int = 5, mode: str = 'connectivity', metric: str = 'minkowski', p: int = 2, include_self: Union[bool, str] = False)[source]#

Adds edges to nodes based on K nearest neighbours. Long interaction threshold is used to specify minimum separation in sequence to add an edge between networkx nodes within the distance threshold

Parameters
  • G (nx.Graph) – Protein Structure graph to add distance edges to

  • long_interaction_threshold (int) – minimum distance in sequence for two nodes to be connected

  • k (int) – Number of neighbors for each sample.

  • mode (str) – Type of returned matrix: "connectivity" will return the connectivity matrix with ones and zeros, and "distance" will return the distances between neighbors according to the given metric.

  • metric (str) – The distance metric used to calculate the k-Neighbors for each sample point. The DistanceMetric class gives a list of available metrics. The default distance is "euclidean" ("minkowski" metric with the p param equal to 2).

  • p (int) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Default is 2 (euclidean).

  • include_self (Union[bool, str]) – Whether or not to mark each sample as the first nearest neighbor to itself. If "auto", then True is used for mode="connectivity" and False for mode="distance". Default is False.

Returns

Graph with knn-based edges added

Return type

nx.Graph

graphein.protein.edges.distance.add_peptide_bonds(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds peptide backbone as edges to residues in each chain.

Parameters

G (nx.Graph) – networkx protein graph.

Return G

networkx protein graph with added peptide bonds.

Return type

nx.Graph

graphein.protein.edges.distance.add_sequence_distance_edges(G: networkx.classes.graph.Graph, d: int, name: str = 'sequence_edge') networkx.classes.graph.Graph[source]#

Adds edges based on sequence distance to residues in each chain.

Eg. if d=6 then we join: nodes (1,7), (2,8), (3,9).. based on their sequence number.

Parameters
  • G (nx.Graph) – networkx protein graph.

  • d – Sequence separation to add edges on.

  • name (str) – Name of the edge type. Defaults to "sequence_edge".

Return G

networkx protein graph with added peptide bonds.

Return type

nx.Graph

graphein.protein.edges.distance.compute_distmat(pdb_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Compute pairwise euclidean distances between every atom.

Design choice: passed in a DataFrame to enable easier testing on dummy data.

Parameters

pdb_df (pd.DataFrame) – pd.Dataframe containing protein structure. Must contain columns [“x_coord”, “y_coord”, “z_coord”]

Returns

pd.Dataframe of euclidean distance matrix

Return type

pd.DataFrame

graphein.protein.edges.distance.get_edges_by_bond_type(G: networkx.classes.graph.Graph, bond_type: str) List[Tuple[str, str]][source]#

Return edges of a particular bond type.

  • bond_type: (str) one of the elements in the variable BOND_TYPES

  • resis: (list) a list of tuples, where each tuple is an edge.

graphein.protein.edges.distance.get_interacting_atoms(angstroms: float, distmat: pandas.core.frame.DataFrame)[source]#

Find the atoms that are within a particular radius of one another.

graphein.protein.edges.distance.get_ring_atoms(dataframe: pandas.core.frame.DataFrame, aa: str) pandas.core.frame.DataFrame[source]#

Return ring atoms from a dataframe.

A helper function for add_aromatic_interactions.

Gets the ring atoms from the particular aromatic amino acid.

  • dataframe: the dataframe containing the atom records.

  • aa: the amino acid of interest, passed in as 3-letter string.

  • dataframe: a filtered dataframe containing just those atoms from the

    particular amino acid selected. e.g. equivalent to selecting just the ring atoms from a particular amino acid.

graphein.protein.edges.distance.get_ring_centroids(ring_atom_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Return aromatic ring centrods.

A helper function for add_aromatic_interactions.

Computes the ring centroids for each a particular amino acid’s ring atoms.

Ring centroids are computed by taking the mean of the x, y, and z coordinates.

  • ring_atom_df: a dataframe computed using get_ring_atoms.

  • aa: the amino acid under study

  • centroid_df: a dataframe containing just the centroid coordinates of

    the ring atoms of each residue.

graphein.protein.edges.distance.node_coords(G: networkx.classes.graph.Graph, n: str) Tuple[float, float, float][source]#

Return the x, y, z coordinates of a node. This is a helper function. Simplifies the code.

Parameters
  • G (nx.Graph) – nx.Graph protein structure graph to extract coordinates from

  • n (str) – str node ID in graph to extract coordinates from

Returns

Tuple of coordinates (x, y, z)

Return type

Tuple[float, float, float]

Intramolecular#

Featurization functions for graph edges.

graphein.protein.edges.intramolecular.add_contacts_edge(G: networkx.classes.graph.Graph, interaction_type: str) networkx.classes.graph.Graph[source]#

Adds specific interaction types to the protein graph.

Parameters
  • G (nx.Graph) – networkx protein graph

  • interaction_type (str) – interaction type to be added

Return G

nx.Graph with specified interaction-based edges added.

Return type

nx.Graph

graphein.protein.edges.intramolecular.get_contacts_df(config: GetContactsConfig, pdb_name: str) pd.DataFrame[source]#

Reads GetContact File and returns it as a pd.DataFrame

Parameters
  • config (GetContactsConfig) – GetContactsConfig object

  • pdb_name (str) – Name of PDB file. Contacts files are name {pdb_name}_contacts.tsv

Returns

DataFrame of prased GetContacts output

Return type

pd.DataFrame

graphein.protein.edges.intramolecular.hydrogen_bond(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds hydrogen bonds to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add hydrogen bonds to

Returns

nx.Graph with hydrogen bonds added

Return type

nx.Graph

graphein.protein.edges.intramolecular.hydrophobic(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds hydrophobic interactions to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add hydrophobic interaction edges to

Returns

nx.Graph with hydrophobic interactions added

Return type

nx.Graph

graphein.protein.edges.intramolecular.peptide_bonds(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds peptide backbone to residues in each chain

Parameters

G (nx.Graph) – nx.Graph protein graph

Returns

nx.Graph protein graph with added peptide bonds

Return type

nx.Graph

graphein.protein.edges.intramolecular.pi_cation(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds pi-cation interactions to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add pi-cation interactions to

Returns

nx.Graph with pi-pi_cation interactions added

Return type

nx.Graph

graphein.protein.edges.intramolecular.pi_stacking(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds pi-stacking interactions to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add pi-stacking interactions to

Returns

nx.Graph with pi-stacking interactions added

Return type

nx.Graph

graphein.protein.edges.intramolecular.read_contacts_file(config: GetContactsConfig, contacts_file: str) pd.DataFrame[source]#

Parses GetContacts file to an edgelist (pd.DataFrame)

Parameters
  • config (GetContactsConfig) – GetContactsConfig object (graphein.protein.config.GetContactsConfig)

  • contacts_file (str) – file name of contacts file

Returns

Pandas Dataframe of edge list

Return type

pd.DataFrame

graphein.protein.edges.intramolecular.run_get_contacts(config: GetContactsConfig, pdb_id: Optional[str] = None, file_name: Optional[str] = None)[source]#

Runs GetContacts on a protein structure. If no file_name is provided, a PDB file is downloaded for the pdb_id

Parameters
  • config (graphein.protein.config.GetContactsConfig) – GetContactsConfig object containing GetContacts parameters

  • pdb_id (str, optional) – 4-character PDB accession code

  • file_name (str, optional) – PDB_name file to use, if annotations to be retrieved from the PDB

graphein.protein.edges.intramolecular.salt_bridge(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds salt bridges to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add salt bridges to

Returns

nx.Graph with salt bridges added

Return type

nx.Graph

graphein.protein.edges.intramolecular.t_stacking(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds t-stacking interactions to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add t-stacking interactions to

Returns

nx.Graph with t-stacking interactions added

Return type

nx.Graph

graphein.protein.edges.intramolecular.van_der_waals(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds van der Waals interactions to protein structure graph

Parameters

G (nx.Graph) – nx.Graph to add van der Waals interactions to

Returns

nx.Graph with van der Waals interactions added

Return type

nx.Graph

Atomic#

Functions for computing atomic structure of proteins.

graphein.protein.edges.atomic.add_atomic_edges(G: networkx.classes.graph.Graph, tolerance: float = 0.56) networkx.classes.graph.Graph[source]#

Computes covalent edges based on atomic distances. Covalent radii are assigned to each atom based on its bond assign_bond_states_to_dataframe The distance matrix is then thresholded to entries less than this distance plus some tolerance to create an adjacency matrix. This adjacency matrix is then parsed into an edge list and covalent edges added

Parameters
  • G (nx.Graph) – Atomic graph (nodes correspond to atoms) to populate with atomic bonds as edges

  • tolerance (float) – Tolerance for atomic distance. Default is 0.56 Angstroms. Commonly used values are: 0.4, 0.45, 0.56

Returns

Atomic graph with edges between bonded atoms added

Return type

nx.Graph

graphein.protein.edges.atomic.add_bond_order(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Assign bond orders to the covalent bond edges between atoms on the basis of bond length. Values are taken from:

Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database. Jon C. Baber and Edward E. Hodgkin*

Parameters

G (nx.Graph) – Atomic-level protein graph with covalent edges.

Returns

Atomic-level protein graph with covalent edges annotated with putative bond order.

Return type

mx.Graph

graphein.protein.edges.atomic.add_ring_status(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Identifies rings in the atomic graph. Assigns the edge attribute "RING" to edges in the ring. We do not distinguish between aromatic and non-aromatic rings. Functions by identifying all cycles in the graph.

Parameters

G (nx.Graph) – Atom-level protein structure graph to add ring edge types to

Returns

Atom-level protein structure graph with added "RING" edge attribute

Return type

nx.Graph

graphein.protein.edges.atomic.assign_bond_states_to_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Takes a PandasPDB atom dataframe and assigns bond states to each atom based on:

Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii Heyrovska, 2008

First, maps atoms to their standard bond states (DEFAULT_BOND_STATE). Second, maps non-standard bonds states (RESIDUE_ATOM_BOND_STATE). Fills NaNs with standard bond states.

Parameters

df (pd.DataFrame) – Pandas PDB dataframe

Returns

Dataframe with added atom_bond_state column

Return type

pd.DataFrame

graphein.protein.edges.atomic.assign_covalent_radii_to_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Assigns covalent radius (COVALENT_RADII) to each atom based on its bond state. Adds a covalent_radius column. Using values from:

Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii Heyrovska, 2008

Parameters

df (pd.DataFrame) – Pandas PDB dataframe with a bond_states_column

Returns

Pandas PDB dataframe with added covalent_radius column

Return type

pd.DataFrame

graphein.protein.edges.atomic.identify_bond_type_from_mapping(G: networkx.classes.graph.Graph, u: str, v: str, a: Dict[str, Any], query: str)[source]#

Compares the bond length between two atoms in the graph, and the relevant experimental value by performing a lookup against the watershed values in:

Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database. Jon C. Baber and Edward E. Hodgkin*

Bond orders are assigned in the order triple < double < single (e.g. if a bond is shorter than the triple bond watershed (w_dt) then it is assigned as a triple bond. Similarly, if a bond is longer than this but shorter than the double bond watershed (w_sd), it is assigned double bond status.

Parameters
  • G (nx.Graph) – nx.Graph of atom-protein structure with atomic edges added

  • u (str) – node 1 in edge

  • v (str) – node 2 in edge

  • a (Dict[str, Any]) – edge data

  • query (str) – "ELEMENTX-ELEMENTY" to perform lookup with (E.g. "C-O",``”N-N”``)

Returns

Graph with atomic edge bond order assigned

Return type

nx.Graph

Features#

Node#

graphein.protein.features.nodes.aaindex.aaindex1(G: networkx.classes.graph.Graph, accession: str) networkx.classes.graph.Graph[source]#

Adds AAIndex1 datavalues for a given accession as node features.

Parameters
  • G (nx.Graph) – nx.Graph protein structure graphein to featurise

  • accession (str) – AAIndex1 accession code for values to use

Returns

Protein Structure graph with AAindex1 node features added

Return type

nx.Graph

graphein.protein.features.nodes.aaindex.fetch_AAIndex(accession: str) Tuple[str, Dict[str, float]][source]#

Fetches AAindex1 dictionary from an accession code. The dictionary maps one-letter AA codes to float values

Parameters

accession (str) – Aaindex1 accession code

Returns

tuple of record titel(str) and dictionary of AA:value mappings

Return type

Tuple[str, Dict[str, float]]

Featurization functions for amino acids.

graphein.protein.features.nodes.amino_acid.amino_acid_one_hot(n, d: Dict[str, Any], return_array: bool = True, allowable_set: Optional[List[str]] = None) Union[pandas.core.series.Series, numpy.ndarray][source]#

Adds a one-hot encoding of amino acid types as a node attribute.

Parameters
  • n (str) – node name, this is unused and only included for compatibility with the other functions

  • d (Dict[str, Any]) – Node data

  • return_array (bool) – If True, returns a numpy array of one-hot encoding, otherwise returns a pd.Series. Default is True.

  • allowable_set – Specifies vocabulary of amino acids. Default is None (which uses graphein.protein.resi_atoms.STANDARD_AMINO_ACIDS).

Returns

One-hot encoding of amino acid types

Return type

Union[pd.Series, np.ndarray]

graphein.protein.features.nodes.amino_acid.expasy_protein_scale(n, d, selection: Optional[List[str]] = None, add_separate: bool = False, return_array: bool = False) Union[pandas.core.series.Series, numpy.ndarray][source]#

Return amino acid features that come from the EXPASY protein scale.

Source: https://web.expasy.org/protscale/

Parameters
  • n – Node in a NetworkX graph

  • d – NetworkX node attributes.

  • selection (List[str], optional) – List of columns to select. Viewable in graphein.protein.features.nodes.meiler_embeddings

  • add_separate – Whether or not to add the expasy features as indvidual entries or as a series.

  • return_array (bool) – Bool indicating whether or not to return a np.ndarray of the features. Default is pd.Series

Returns

pd.Series of amino acid features

Return type

pd.Series

graphein.protein.features.nodes.amino_acid.hydrogen_bond_acceptor(n, d, sum_features: bool = True, return_array: bool = False) pandas.core.series.Series[source]#

Adds Hydrogen Bond Acceptor status to nodes as a feature.”

Parameters
  • n (str) – node id

  • d (Dict[str, Any]) – Dict of node attributes

  • sum_features (bool) – If True, the feature is the number of hydrogen bond acceptors per node. If False, the feature is a boolean indicating whether or not the node has a hydrogen bond acceptor. Default is True.

  • return_array (bool) – If True, returns a np.ndarray, otherwise returns a pd.Series. Default is True.

graphein.protein.features.nodes.amino_acid.hydrogen_bond_donor(n: str, d: Dict[str, Any], sum_features: bool = True, return_array: bool = False) pandas.core.series.Series[source]#

Adds Hydrogen Bond Donor status to nodes as a feature.

Parameters
  • n (str) – node id

  • d (Dict[str, Any]) – Dict of node attributes

  • sum_features (bool) – If True, the feature is the number of hydrogen bond donors per node. If False, the feature is a boolean indicating whether or not the node has a hydrogen bond donor. Default is True.

  • return_array (bool) – If True, returns a np.ndarray, otherwise returns a pd.Series. Default is True.

graphein.protein.features.nodes.amino_acid.load_expasy_scales() pandas.core.frame.DataFrame[source]#

Load pre-downloaded EXPASY scales.

This helps with node featuarization.

The function is LRU-cached in memory for fast access on each function call.

Returns

pd.DataFrame containing expasy scales

Return type

pd.DataFrame

graphein.protein.features.nodes.amino_acid.load_meiler_embeddings() pandas.core.frame.DataFrame[source]#

Load pre-downloaded Meiler embeddings.

This helps with node featurization.

The function is LRU-cached in memory for fast access on each function call.

Returns

pd.DataFrame containing Meiler Embeddings from Meiler et al. 2001

Return type

pd.DataFrame

graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) Union[pandas.core.series.Series, numpy.array][source]#

Return amino acid features from reduced dimensional embeddings of amino acid physicochemical properties.

Source: https://link.springer.com/article/10.1007/s008940100038 doi: https://doi.org/10.1007/s008940100038

Parameters
  • n – Node in a NetworkX graph

  • d – NetworkX node attributes.

Returns

pd.Series of amino acid features

Return type

pd.Series

Featurization functions for graph nodes using DSSP predicted features.

graphein.protein.features.nodes.dssp.add_dssp_df(G: nx.Graph, dssp_config: Optional[DSSPConfig]) nx.Graph[source]#

Construct DSSP dataframe and add as graph level variable to protein graph

Parameters
  • G – Input protein graph

  • G – nx.Graph

  • dssp_config (DSSPConfig, optional) – DSSPConfig object. Specifies which executable to run. Located in graphein.protein.config

Returns

Protein graph with DSSP dataframe added

Return type

nx.Graph

graphein.protein.features.nodes.dssp.add_dssp_feature(G: networkx.classes.graph.Graph, feature: str) networkx.classes.graph.Graph[source]#

Adds add_dssp_feature specified amino acid feature as calculated by DSSP to every node in a protein graph :param G: Protein structure graph to add dssp feature to :param feature: string specifying name of DSSP feature to add: “chain”, “resnum”, “icode”, “aa”, “ss”, “asa”, “phi”, “psi”, “dssp_index”, “NH_O_1_relidx”, “NH_O_1_energy”, “O_NH_1_relidx”, “O_NH_1_energy”, “NH_O_2_relidx”, “NH_O_2_energy”, “O_NH_2_relidx”, “O_NH_2_energy”,

These names parse_dssp_df accessible in the DSSP_COLS list :param G: Protein Graph to add features to :type G: nx.Graph :return: Protein structure graph with DSSP feature added to nodes :rtype: nx.Graph

graphein.protein.features.nodes.dssp.asa(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds ASA of each residue in protein graph as calculated by DSSP.

Parameters

G (nx.Graph) – Input protein graph

Returns

Protein graph with asa values added

Return type

nx.Graph

graphein.protein.features.nodes.dssp.parse_dssp_df(dssp: Dict[str, Any]) pandas.core.frame.DataFrame[source]#

Parse DSSP output to DataFrame

Parameters

dssp (Dict[str, Any]) – Dictionary containing DSSP output

Returns

pd.Dataframe containing parsed DSSP output

Return type

pd.DataFrame

graphein.protein.features.nodes.dssp.phi(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds phi-angles of each residue in protein graph as calculated by DSSP.

Parameters

G (nx.Graph) – Input protein graph

Returns

Protein graph with phi-angles values added

Return type

nx.Graph

graphein.protein.features.nodes.dssp.process_dssp_df(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Processes a DSSP DataFrame to make indexes align with node IDs

Parameters

df (pd.DataFrame) – pd.DataFrame containing the parsed output from DSSP.

Returns

pd.DataFrame with node IDs

Return type

pd.DataFrame

graphein.protein.features.nodes.dssp.psi(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds psi-angles of each residue in protein graph as calculated by DSSP.

Parameters

G (nx.Graph) – Input protein graph

Returns

Protein graph with psi-angles values added

Return type

nx.Graph

graphein.protein.features.nodes.dssp.rsa(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds RSA (relative solvent accessibility) of each residue in protein graph as calculated by DSSP.

Parameters

G (nx.Graph) – Input protein graph

Returns

Protein graph with rsa values added

Return type

nx.Graph

graphein.protein.features.nodes.dssp.secondary_structure(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds secondary structure of each residue in protein graph as calculated by DSSP in the form of a string

Parameters

G (nx.Graph) – Input protein graph

Returns

Protein graph with secondary structure added

Return type

nx.Graph

Provides geometry-based featurisation functions.

graphein.protein.features.nodes.geometry.add_beta_carbon_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False)[source]#

Adds vector from node (typically alpha carbon) to position of beta carbon.

Glycine does not have a beta carbon, so we set it to np.array([0, 0, 0]). We extract the position of the beta carbon from the unprocessed atomic PDB dataframe. For this we use the raw_pdb_df dataframe. If scale, we scale the vector to the unit vector. If reverse is True, we reverse the vector (C beta - node). If reverse is false (default) we compute (node - C beta).

Parameters
  • g (nx.Graph) – Graph to add vector to.

  • scale (bool) – Scale vector to unit vector. Defaults to True.

  • reverse (bool) – Reverse vector. Defaults to False.

graphein.protein.features.nodes.geometry.add_sequence_neighbour_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False, n_to_c: bool = True)[source]#

Computes vector from node to adjacent node in sequence. Typically used with CA (alpha carbon) graphs.

If n_to_c is True (default), we compute the vectors from the N terminus to the C terminus (canonical direction). If reverse is False (default), we compute Node_i - Node_{i+1}. If reverse is ``True, we compute Node_{i+1} - Node_i. :param g: Graph to add vector to. :type g: nx.Graph :param scale: Scale vector to unit vector. Defaults to True. :type scale: bool :param reverse: Reverse vector. Defaults to False. :type reverse: bool :param n_to_c: Compute vector from N to C or C to N. Defaults to True. :type n_to_c: bool

graphein.protein.features.nodes.geometry.add_sidechain_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False)[source]#

Adds vector from node to average position of sidechain atoms.

We compute the mean of the sidechain atoms for each node. For this we use the rgroup_df dataframe. If the graph does not contain the rgroup_df dataframe, we compute it from the raw_pdb_df. If scale, we scale the vector to the unit vector. If reverse is True, we reverse the vector (sidechain - node). If reverse is false (default) we compute (node - sidechain).

Parameters
  • g (nx.Graph) – Graph to add vector to.

  • scale (bool) – Scale vector to unit vector. Defaults to True.

  • reverse (bool) – Reverse vector. Defaults to False.

Sequence#

Functions to add embeddings from pre-trained language models protein structure graphs.

graphein.protein.features.sequence.embeddings.biovec_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds BioVec sequence embedding feature to the graph. Computed over chains.

Source

ProtVec: A Continuous Distributed Representation of Biological Sequences

Paper: http://arxiv.org/pdf/1503.05140v1.pdf

Parameters

G (nx.Graph) – nx.Graph protein structure graph.

Returns

nx.Graph protein structure graph with biovec embedding added. e.g. G.graph["biovec_embedding_A"] for chain A.

Return type

nx.Graph

graphein.protein.features.sequence.embeddings.compute_esm_embedding(sequence: str, representation: str, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) np.ndarray[source]#

Computes sequence embedding using Pre-trained ESM model from FAIR

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob

Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander

Pre-trained models:

Full Name layers params Dataset Embedding Dim Model URL ========= ====== ====== ======= ============= ========= ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt ESM1-main esm1_t34_670M_UR50S 34 670M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt esm1_t34_670M_UR50D 34 670M UR50/D 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt esm1_t34_670M_UR100 34 670M UR100 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt esm1_t12_85M_UR50S 12 85M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt esm1_t6_43M_UR50S 6 43M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Parameters
  • sequence (str) – Protein sequence to embed (str)

  • representation (str) – Type of embedding to extract. "residue" or "sequence". Sequence-level embeddings are averaged residue embeddings

  • model_name (str) – Name of pre-trained model to use

  • output_layer (int) – integer indicating which layer the output should be taken from

Returns

embedding (np.ndarray)

Return type

np.ndarray

graphein.protein.features.sequence.embeddings.esm_residue_embedding(G: networkx.classes.graph.Graph, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) networkx.classes.graph.Graph[source]#

Computes ESM residue embeddings from a protein sequence and adds the to the graph.

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob

Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander

Pre-trained models

Parameters
  • G (nx.Graph) – nx.Graph to add esm embedding to.

  • model_name (str) – Name of pre-trained model to use.

  • output_layer (int) – index of output layer in pre-trained model.

Returns

nx.Graph with esm embedding feature added to nodes.

Return type

nx.Graph

graphein.protein.features.sequence.embeddings.esm_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Computes ESM sequence embedding feature over chains in a graph.

Parameters

G (nx.Graph) – nx.Graph protein structure graph.

Returns

nx.Graph protein structure graph with esm embedding features added eg. G.graph["esm_embedding_A"] for chain A.

Return type

nx.Graph

graphein.protein.features.sequence.propy.aa_dipeptide_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#
Calculate the composition of AADs, dipeptide and 3-mers for a given protein sequence. Contains all composition

values of AADs, dipeptide and 3-mers (8420).

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with aa_dipeptide_composition feature added. G.graph[“aa_dipeptide_composition_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.aa_spectrum(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate the spectrum descriptors of 3-mers for a given protein. Contains the composition values of 8000 3-mers

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with aa_spectrum feature added. G.graph[“aa_spectrum_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.all_composition_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate all composition descriptors based on seven different properties of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_descriptors feature added. G.graph[“composition_descriptors_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.all_ctd_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate all CTD descriptors based seven different properties of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with ctd_descriptors feature added. G.graph[“ctd_descriptors_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.all_distribution_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate all distribution descriptors based on seven different properties of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_descriptors feature added. G.graph[“distribution_descriptors_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.all_transition_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate all transition descriptors based on seven different properties of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_descriptors feature added. G.graph[“transition_descriptors_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.amino_acid_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate the composition of Amino acids for a given protein sequence.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (Optional[List[str]]) – Aggregation types to use

Returns

Protein Graph with amino_acid_composition feature added. G.graph[“amino_acid_composition_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#
Compute Geary autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240 Geary

autocorrelation descriptors based on the given properties(i.e., _AAPropert).

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_all feature added. G.graph[“autocorrelation_geary_all_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on AvFlexibility. contains 30 Geary Autocorrelation

descriptors based on AvFlexibility.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_av_flexibility feature added. G.graph[“autocorrelation_geary_av_flexibility_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on FreeEnergy. result contains 30 Geary Autocorrelation

descriptors based on FreeEnergy.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_free_energy feature added. G.graph[“autocorrelation_geary_av_free_energy_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on hydrophobicity. result contains 30 Geary Autocorrelation

descriptors based on hydrophobicity.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_hydrophobicity feature added. G.graph[“autocorrelation_geary_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on Mutability. result contains 30 Geary Autocorrelation

descriptors based on mutability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_mutability feature added. G.graph[“autocorrelation_geary_mutability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on polarizability. result contains 30 Geary Autocorrelation

descriptors based on polarizability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_polarizability feature added. G.graph[“autocorrelation_geary_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on ResidueASA. result contains 30 Geary Autocorrelation

descriptors based on ResidueASA.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_residue_asa feature added. G.graph[“autocorrelation_geary_residue_asa_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on ResidueVol. result contains 30 Geary Autocorrelation

descriptors based on ResidueVol.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_residue_vol feature added. G.graph[“autocorrelation_geary_residue_vol_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_geary_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Geary Autocorrelation descriptors based on Steric. result contains 30 Geary Autocorrelation

descriptors based on Steric

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_geary_steric feature added. G.graph[“autocorrelation_geary_steric_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#
Compute Moran autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240 Moran

autocorrelation descriptors based on the given properties(i.e., _AAPropert).

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_all feature added. G.graph[“autocorrelation_moran_all_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on AvFlexibility. contains 30 Moran Autocorrelation

descriptors based on AvFlexibility.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_av_flexibility feature added. G.graph[“autocorrelation_moran_av_flexibility_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on FreeEnergy. result contains 30 Moran Autocorrelation

descriptors based on FreeEnergy.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_free_energy feature added. G.graph[“autocorrelation_moran_av_free_energy_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on hydrophobicity. result contains 30 Moran Autocorrelation

descriptors based on hydrophobicity.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_hydrophobicity feature added. G.graph[“autocorrelation_moran_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on Mutability. result contains 30 Moran Autocorrelation

descriptors based on mutability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_mutability feature added. G.graph[“autocorrelation_moran_mutability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on polarizability. result contains 30 Moran Autocorrelation

descriptors based on polarizability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_polarizability feature added. G.graph[“autocorrelation_moran_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on ResidueASA. result contains 30 Moran Autocorrelation

descriptors based on ResidueASA.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_residue_asa feature added. G.graph[“autocorrelation_moran_residue_asa_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on ResidueVol. result contains 30 Moran Autocorrelation

descriptors based on ResidueVol.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_residue_vol feature added. G.graph[“autocorrelation_moran_residue_vol_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_moran_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the Moran Autocorrelation descriptors based on Steric. result contains 30 Moran Autocorrelation

descriptors based on Steric

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_moran_steric feature added. G.graph[“autocorrelation_moran_steric_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#
Compute NormalizedMoreauBroto autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240

NormalizedMoreauBroto autocorrelation descriptors based on the given properties(i.e., _AAPropert).

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_all feature added. G.graph[“autocorrelation_normalized_moreau_broto_all_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on AvFlexibility. contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on AvFlexibility.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_av_flexibility feature added. G.graph[“autocorrelation_normalized_moreau_broto_av_flexibility_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on FreeEnergy. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on FreeEnergy.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_free_energy feature added. G.graph[“autocorrelation_normalized_moreau_broto_av_free_energy_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on hydrophobicity. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on hydrophobicity.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_hydrophobicity feature added. G.graph[“autocorrelation_normalized_moreau_broto_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on Mutability. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on mutability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_mutability feature added. G.graph[“autocorrelation_normalized_moreau_broto_mutability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on polarizability. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on polarizability.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_polarizability feature added. G.graph[“autocorrelation_normalized_moreau_broto_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on ResidueASA. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on ResidueASA.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_residue_asa feature added. G.graph[“autocorrelation_normalized_moreau_broto_residue_asa_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on ResidueVol. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on ResidueVol.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_residue_vol feature added. G.graph[“autocorrelation_normalized_moreau_broto_residue_vol_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#
Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on Steric. result contains 30

NormalizedMoreauBroto Autocorrelation descriptors based on Steric

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_normalized_moreau_broto_steric feature added. G.graph[“autocorrelation_normalized_moreau_broto_steric_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.autocorrelation_total(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#
Compute all autocorrelation descriptors based on 8 properties of AADs. result contains 30*8*3=720 normalized Moreau

Broto, Moran, and Geary

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with autocorrelation_total feature added. G.graph[“autocorrelation_total_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on Charge of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_charge feature added. G.graph[“composition_charge_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Compute composition descriptors.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • AAProperty (Dict[Any, Any]) – contains classification of amino acids such as _Polarizability.

  • AAPName (str) – used for indicating a AAP name.

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_{AAPName} feature added. G.graph[“composition_{AAPName}_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on Hydrophobicity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_hydrophobicity feature added. G.graph[“composition_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on NormalizedVDWV of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_normalized_vdwv feature added. G.graph[“composition_normalized_vdwv_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on Polarity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_polarity feature added. G.graph[“composition_polarity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on Polarizability of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_polarizability feature added. G.graph[“composition_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on SecondaryStr of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_secondary_str feature added. G.graph[“composition_secondary_str_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.composition_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate composition descriptors based on SolventAccessibility of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with composition_solvent_accessibility feature added. G.graph[“composition_solvent_accessibility_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.compute_propy_feature(G: networkx.classes.graph.Graph, func: Callable, feature_name: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Computes Propy Descriptors over chains in a Protein Graph

Parameters
  • G (nx.Graph) – Protein Graph

  • func (Callable) – ProPy wrapper function to compute

  • feature_name (str) – Name of feature to index it in the nx.Graph object

  • aggregation_type (List[str], optional) – Type of aggregation to use when aggregating a feature over multiple chains. One of: [“mean”, “man”, “sum”]. Defaults to None.

Return G

Returns protein Graph with features added. Features are accessible with G.graph[{feature_name}_{chain | aggegation_type}]

Return type

nx.Graph

graphein.protein.features.sequence.propy.dipeptide_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate the composition of dipeptidefor a given protein sequence. Contains composition of 400 dipeptides

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with dipeptide_composition feature added. G.graph[“dipeptide_composition_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on Charge of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_charge feature added. G.graph[“distribution_charge_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Compute distribution descriptors.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • AAProperty (Dict[Any, Any]) – contains classifciation of amino acids such as _Polarizability.

  • AAPName (str) – used for indicating a AAP name.

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_{AAPName} feature added. G.graph[“distribution_{AAPName}_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on Hydrophobicity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_hydrophobicity feature added. G.graph[“distribution_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on NormalizedVDWV of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_normalized_vdwv feature added. G.graph[“distribution_normalized_vdwv_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on Polarity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_polarity feature added. G.graph[“distribution_polarity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on Polarizability of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_polarizability feature added. G.graph[“distribution_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on SecondaryStr of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_secondary_str feature added. G.graph[“distribution_secondary_str_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.distribution_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate distribution descriptors based on SolventAccessibility of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with distribution_solvent_accessibility feature added. G.graph[“distribution_solvent_accessibility_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.quasi_sequence_order(G: networkx.classes.graph.Graph, maxlag: int = 30, weight: float = 0.1) networkx.classes.graph.Graph[source]#

Compute quasi-sequence-order descriptors for a given protein.

Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect.

Biochemical and Biophysical Research Communications 2000, 278, 477-483.

Parameters
  • maxlag (int) – (int, optional (default: 30)) – the maximum lag and the length of the protein should be larger than maxlag

  • weight (float) – (float, optional (default: 0.1)) – a weight factor. Please see reference 1 for its choice.

Returns

Return type

nx.Graph

graphein.protein.features.sequence.propy.sequence_order_coupling_number_total(G: networkx.classes.graph.Graph, maxlag: int = 30) networkx.classes.graph.Graph[source]#

Compute the sequence order coupling numbers from 1 to maxlag for a given protein sequence.

Parameters
  • G (nx.Graph) – Protein Graph

  • (default (maxlag (int, optional) – 30)) – the maximum lag and the length of the protein should be larger

Returns

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on Charge of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_charge feature added. G.graph[“transition_charge_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Compute transition descriptors.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • AAProperty (Dict[Any, Any]) – contains classifciation of amino acids such as _Polarizability.

  • AAPName (str) – used for indicating a AAP name.

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_{AAPName} feature added. G.graph[“transition_{AAPName}_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on Hydrophobicity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_hydrophobicity feature added. G.graph[“transition_hydrophobicity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on NormalizedVDWV of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_normalized_vdwv feature added. G.graph[“transition_normalized_vdwv_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on Polarity of AADs.

Parameters
  • G (nx.Graph) – Protein Graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_polarity feature added. G.graph[“transition_polarity_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on Polarizability of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_polarizability feature added. G.graph[“transition_polarizability_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on SecondaryStr of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_secondary_str feature added. G.graph[“transition_secondary_str_{chain | aggregation_type}”]

Return type

nx.Graph

graphein.protein.features.sequence.propy.transition_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph[source]#

Calculate transition descriptors based on SolventAccessibility of AADs.

Parameters
  • G (nx.Graph) – Protein graph to featurise

  • aggregation_type (List[Optional[str]]) – Aggregation types to use over chains

Returns

Protein Graph with transition_solvent_accessibility feature added. G.graph[“transition_solvent_accessibility_{chain | aggregation_type}”]

Return type

nx.Graph

Functions for graph-level featurization of the sequence of a protein. This submodule is focussed on physicochemical proporties of the sequence.

Sequence Utils#

Functions to add embeddings from pre-trained language models protein structure graphs.

graphein.protein.features.sequence.embeddings.biovec_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Adds BioVec sequence embedding feature to the graph. Computed over chains.

Source

ProtVec: A Continuous Distributed Representation of Biological Sequences

Paper: http://arxiv.org/pdf/1503.05140v1.pdf

Parameters

G (nx.Graph) – nx.Graph protein structure graph.

Returns

nx.Graph protein structure graph with biovec embedding added. e.g. G.graph["biovec_embedding_A"] for chain A.

Return type

nx.Graph

graphein.protein.features.sequence.embeddings.compute_esm_embedding(sequence: str, representation: str, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) np.ndarray[source]#

Computes sequence embedding using Pre-trained ESM model from FAIR

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob

Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander

Pre-trained models:

Full Name layers params Dataset Embedding Dim Model URL ========= ====== ====== ======= ============= ========= ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt ESM1-main esm1_t34_670M_UR50S 34 670M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt esm1_t34_670M_UR50D 34 670M UR50/D 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt esm1_t34_670M_UR100 34 670M UR100 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt esm1_t12_85M_UR50S 12 85M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt esm1_t6_43M_UR50S 6 43M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Parameters
  • sequence (str) – Protein sequence to embed (str)

  • representation (str) – Type of embedding to extract. "residue" or "sequence". Sequence-level embeddings are averaged residue embeddings

  • model_name (str) – Name of pre-trained model to use

  • output_layer (int) – integer indicating which layer the output should be taken from

Returns

embedding (np.ndarray)

Return type

np.ndarray

graphein.protein.features.sequence.embeddings.esm_residue_embedding(G: networkx.classes.graph.Graph, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) networkx.classes.graph.Graph[source]#

Computes ESM residue embeddings from a protein sequence and adds the to the graph.

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob

Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander

Pre-trained models

Parameters
  • G (nx.Graph) – nx.Graph to add esm embedding to.

  • model_name (str) – Name of pre-trained model to use.

  • output_layer (int) – index of output layer in pre-trained model.

Returns

nx.Graph with esm embedding feature added to nodes.

Return type

nx.Graph

graphein.protein.features.sequence.embeddings.esm_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph[source]#

Computes ESM sequence embedding feature over chains in a graph.

Parameters

G (nx.Graph) – nx.Graph protein structure graph.

Returns

nx.Graph protein structure graph with esm embedding features added eg. G.graph["esm_embedding_A"] for chain A.

Return type

nx.Graph

Utils#

Utility functions to work with graph-level features.

graphein.protein.features.utils.aggregate_graph_feature_over_chains(G: networkx.classes.graph.Graph, feature_name: str, aggregation_type: str) networkx.classes.graph.Graph[source]#

Performs aggregation of a feature over the chains. E.g. sums/averages/min/max molecular weights for each chain.

Parameters
  • G (nx.Graph) – nx.Graph of protein containing chain-specific features.

  • feature_name (str) – Name of features to aggregate.

  • aggregation_type (str) – Type of aggregation to perform ("min"`, ``"max", "sum", "mean").

Raises

NameError – If aggregation_type is not one of "min"`, ``"max", "sum", "mean".

Returns

nx.Graph of protein with a new aggregated feature G.graph[f"{feature_name}_{aggregation_type}"].

Return type

nx.Graph

graphein.protein.features.utils.convert_graph_dict_feat_to_series(G: networkx.classes.graph.Graph, feature_name: str) networkx.classes.graph.Graph[source]#

Takes in a graph and a graph-level feature_name. Converts this feature to a pd.Series. This is useful as some features are output as dictionaries and we wish to standardise this.

Parameters
  • G (nx.Graph) – nx.Graph containing G.graph[f"{feature_name}"] (Dict[Any, Any]).

  • feature_name (str) – Name of feature to convert to dictionary.

Returns

nx.Graph containing G.graph[f"{feature_name}"]: pd.Series.

Return type

nx.Graph

Subgraphs#

Provides functions for extracting subgraphs from protein graphs.

graphein.protein.subgraphs.extract_k_hop_subgraph(g: networkx.classes.graph.Graph, central_node: str, k: int, k_only: bool = False, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a k-hop subgraph.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • central_node (str) – The central node to extract the subgraph from.

  • k (int) – The number of hops to extract.

  • k_only (bool) – Whether to only extract the exact k-hop subgraph (e.g. include 2-hop neighbours in 5-hop graph). Defaults to False.

  • filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool, optional) – Whether to inverse the selection, defaults to False

  • return_node_list (bool) – Whether to return the node list. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph(g: networkx.classes.graph.Graph, node_list: Optional[List[str]] = None, sequence_positions: Optional[List[int]] = None, chains: Optional[List[str]] = None, residue_types: Optional[List[str]] = None, atom_types: Optional[List[str]] = None, bond_types: Optional[List[str]] = None, centre_point: Optional[Union[numpy.ndarray, Tuple[float, float, float]]] = None, radius: Optional[float] = None, ss_elements: Optional[List[str]] = None, rsa_threshold: Optional[float] = None, k_hop_central_node: Optional[str] = None, k_hops: Optional[int] = None, k_only: Optional[bool] = None, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a list of nodes, sequence positions, chains, residue types, atom types, centre point and radius.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • node_list (List[str], optional) – List of nodes to extract specified by their node_id. Defaults to None.

  • sequence_positions (List[int], optional) – The sequence positions to extract. Defaults to None.

  • chains (List[str], optional) – The chain(s) to extract. Defaults to None.

  • residue_types (List[str], optional) – List of allowable residue types (3 letter residue names). Defaults to None.

  • atom_types (List[str], optional) – List of allowable atom types. Defaults to None.

  • centre_point (Union[np.ndarray, Tuple[float, float, float]], optional) – The centre point to extract the subgraph from. Defaults to None.

  • radius (float, optional) – The radius to extract the subgraph from. Defaults to None.

  • ss_elements (List[str], optional) – List of secondary structure elements to extract. [“H”, “B”, “E”, “G”, “I”, “T”, “S”, “-“] corresponding to Alpha helix Beta bridge, Strand, Helix-3, Helix-5, Turn, Bend, None. Defaults to None.

  • rsa_threshold (float, optional) – The threshold to use for the RSA. Defaults to None.

  • central_node (str, optional) – The central node to extract the subgraph from. Defaults to None.

  • k (int) – The number of hops to extract.

  • k_only (bool) – Whether to only extract the exact k-hop subgraph (e.g. include 2-hop neighbours in 5-hop graph). Defaults to False.

  • filter_dataframe (bool, optional) – Whether to filter the pdb_df dataframe of the graph. Defaults to True. Defaults to None.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool, optional) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_by_bond_type(g: networkx.classes.graph.Graph, bond_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a list of allowable bond types.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • bond_types (List[str]) – List of allowable bond types.

  • filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool, optional) – Whether to inverse the selection, defaults to False

  • return_node_list (bool, optional) – Whether to return the node list, defaults to False

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_by_sequence_position(g: networkx.classes.graph.Graph, sequence_positions: List[int], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a chain.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • chain (List[int]) – The sequence positions to extract.

  • filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_atom_types(g: networkx.classes.graph.Graph, atom_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a list of atom types.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • atom_types (List[str]) – The list of atom types to extract.

  • filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_chains(g: networkx.classes.graph.Graph, chains: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a chain.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • chain (List[str]) – The chain(s) to extract.

  • filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_node_list(g, node_list: Optional[List[str]], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a list of nodes.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • node_list (List[str]) – The list of nodes to extract.

  • filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_point(g: networkx.classes.graph.Graph, centre_point: Union[numpy.ndarray, Tuple[float, float, float]], radius: float, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a centre point and radius.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • centre_point (Tuple[float, float, float]) – The centre point of the subgraph.

  • radius (float) – The radius of the subgraph.

  • filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_residue_types(g: networkx.classes.graph.Graph, residue_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph from a graph based on a list of allowable residue types.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • residue_types (List[str]) – List of allowable residue types (3 letter residue names).

  • filter_dataframe (bool, optional) – Whether to filer the pdb_df of the graph, defaults to True

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_subgraph_from_secondary_structure(g: networkx.classes.graph.Graph, ss_elements: List[str], inverse: bool = False, filter_dataframe: bool = True, recompute_distmat: bool = False, update_coords: bool = True, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts subgraphs for nodes that have a secondary structure element in the list.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • ss_elements (List[str]) – List of secondary structure elements to extract.

  • inverse (bool) – Whether to inverse the selection. Defaults to False.

  • filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • return_node_list – Whether to return the node list. Defaults to False.

Raises

ProteinGraphConfigurationError – If the graph does not contain ss features on the nodes (d[‘ss’] not in d.keys() for _, d in g.nodes(data=True)).

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

graphein.protein.subgraphs.extract_surface_subgraph(g: networkx.classes.graph.Graph, rsa_threshold: float = 0.2, inverse: bool = False, filter_dataframe: bool = True, recompute_distmat: bool = False, update_coords: bool = True, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]][source]#

Extracts a subgraph based on thresholding the Relative Solvent Accessibility (RSA). This can be used for extracting a surface graph.

Parameters
  • g (nx.Graph) – The graph to extract the subgraph from.

  • rsa_threshold (float) – The threshold to use for the RSA. Defaults to 0.2 (20%)

  • filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True

  • update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.

  • recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.

  • inverse (bool, optional) – Whether to inverse the selection, defaults to False

  • return_node_list (bool) – Whether to return the node list. Defaults to False.

Raises

ProteinGraphConfigurationError – If the graph does not contain RSA features on the nodes (d[‘rsa’] not in d.keys() for _, d in g.nodes(data=True)).

Returns

The subgraph or node list if return_node_list is True.

Return type

Union[nx.Graph, List[str]]

Analysis#

Contains utilities for computing analytics on and plotting summaries of Protein Structure Graphs.

graphein.protein.analysis.graph_summary(G: networkx.classes.graph.Graph, summary_statistics: List[str] = ['degree', 'betweenness_centrality', 'closeness_centrality', 'eigenvector_centrality', 'communicability_betweenness_centrality'], custom_data: Optional[Union[pandas.core.frame.DataFrame, pandas.core.series.Series]] = None, plot: bool = False) pandas.core.frame.DataFrame[source]#

Returns a summary of the graph in a dataframe.

Parameters
  • G (nx.Graph) – NetworkX graph to get summary of.

  • plot (bool) – Whether or not to plot the summary as a heatmap, defaults to False.

Returns

Dataframe of summary or plot.

Return type

pd.DataFrame

graphein.protein.analysis.plot_degree_by_residue_type(g: nx.Graph, normalise_by_residue_occurrence: bool = True) plotly.graph_objects.Figure[source]#

Plots the distribution of node degrees in the graph.

Parameters
  • g (nx.Graph) – networkx graph to plot the distribution of node degrees by residue type of.

  • normalise_by_residue_occurrence (bool) – Whether to normalise the degree by the number of residues of the same type.

Returns

Plotly figure.

Rtpe

plotly.graph_objects.Figure

graphein.protein.analysis.plot_degree_distribution(g: nx.Graph, title: Optional[str] = None) plotly.graph_objects.Figure[source]#

Plots the distribution of node degrees in the graph.

Parameters
  • g (nx.Graph) – networkx graph to plot the distribution of node degrees in.

  • title (Optional[str], optional) – Title of plot. defaults to None.

Returns

Plotly figure.

Rtpe

plotly.graph_objects.Figure

graphein.protein.analysis.plot_edge_type_distribution(g: nx.Graph, plot_type: str = 'bar', title: Optional[str] = None) plotly.graph_objects.Figure[source]#

Plots the distribution of edge types in the graph.

Parameters
  • g (nx.Graph) – NetworkX graph to plot the distribution of edge types in.

  • plot_type (str, optional) – Type of plot to produce, defaults to "bar". One of "bar", "pie".

  • title (Optional[str], optional) – Title of plot. defaults to None

Returns

Plotly figure.

Return type

plotly.graph_objects.Figure

graphein.protein.analysis.plot_graph_metric_property_correlation(g: nx.Graph, summary_statistics: List[str] = ['degree', 'betweenness_centrality', 'closeness_centrality', 'eigenvector_centrality', 'communicability_betweenness_centrality'], properties: List[str] = ['asa'], colour_by: Optional[str] = 'residue_type', opacity: float = 0.2, diagonal_visible: bool = True, title: Optional[str] = None, height: int = 1000, width: int = 1000, font_size: int = 10) plotly.graph_objects.Figure[source]#

Plots the correlation between graph metrics and properties.

Parameters
  • g (nx.Graph) – Protein graph to plot the correlation of.

  • summary_statistics (List[str], optional) – List of graph metrics to employ in plot, defaults to ["degree", "betweenness_centrality", "closeness_centrality", "eigenvector_centrality", "communicability_betweenness_centrality"].

  • properties (List[str], optional) – List of node properties to use in plot, defaults to ["asa"].

  • colour_by (Optional[str], optional) – Controls colouring of points in plot. Options: "residue_type", "position", "chain", defaults to "residue_type".

  • opacity (float, optional) – Opacity of plot points, defaults to 0.2.

  • diagonal_visible (bool, optional) – Whether or not to show the diagonal plots, defaults to True.

  • title (Optional[str], optional) – Title of plot, defaults to None.

  • height (int, optional) – Height of plot, defaults to 1000.

  • width (int, optional) – Width of plot, defaults to 1000.

  • font_size (int, optional) – Font size for plot text, defaults to 10.

Returns

Scatter plot matrix of graph metrics and protein properties.

Return type

plotly.graph_objects.Figure

graphein.protein.analysis.plot_residue_composition(g: nx.Graph, sort_by: Optional[str] = None, plot_type: str = 'bar') plotly.graph_objects.Figure[source]#

Plots the residue composition of the graph.

Parameters
  • g (nx.Graph) – Protein graph to plot the residue composition of.

  • sort_by (Optional[str], optional) – How to sort the values ("alphabetical", "count"), defaults to None (no sorting).

  • plot_type (str, optional) – How to plot the composition ("bar", "pie"), defaults to "bar".

Raises

ValueError – Raises ValueError if sort_by is not one of "alphabetical", "count".

Returns

Plotly figure.

Return type

plotly.graph_objects.Figure

Meshes#

Functions to create protein meshes via pymol.

graphein.protein.meshes.check_for_pymol_installation()[source]#

Checks for presence of a pymol installation

graphein.protein.meshes.configure_pymol_session(config: Optional[graphein.protein.config.ProteinMeshConfig] = None)[source]#

Configures a PyMol session based on config.parse_pymol_commands. Uses default parameters "-cKq".

See: https://pymolwiki.org/index.php/Command_Line_Options

Parameters

config (graphein.protein.config.ProteinMeshConfig) – ProteinMeshConfig to use. Defaults to None which uses default config.

graphein.protein.meshes.convert_verts_and_face_to_mesh(verts: torch.FloatTensor, faces: NamedTuple) Meshes[source]#

Converts vertices and faces into a pytorch3d.structures Meshes object.

Parameters
  • verts (torch.FloatTensor) – Vertices.

  • faces (NamedTuple) – Faces.

Returns

Meshes object.

Return type

pytorch3d.structures.Meshes

graphein.protein.meshes.create_mesh(pdb_file: Optional[str] = None, pdb_code: Optional[str] = None, out_dir: Optional[str] = None, config: Optional[ProteinMeshConfig] = None) Tuple[torch.FloatTensor, NamedTuple, NamedTuple][source]#

Creates a PyTorch3D mesh from a pdb_file or pdb_code.

Parameters
  • pdb_file (str, optional) – path to pdb_file. Defaults to None.

  • pdb_code (str, optional) – 4-letter PDB accession code. Defaults to None.

  • out_dir (str, optional) – output directory to store .obj file. Defaults to None.

  • config (graphein.protein.config.ProteinMeshConfig) – ProteinMeshConfig config to use. Defaults to default config in graphein.protein.config.

Returns

verts, faces, aux.

Return type

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

graphein.protein.meshes.get_obj_file(pdb_file: Optional[str] = None, pdb_code: Optional[str] = None, out_dir: Optional[str] = None, config: Optional[graphein.protein.config.ProteinMeshConfig] = None) str[source]#

Runs PyMol to compute surface/mesh for a given protein.

Parameters
  • pdb_file (str, optional) – path to pdb_file to use. Defaults to None.

  • pdb_code (str, optional) – 4-letter pdb accession code. Defaults to None.

  • out_dir (str, optional) – path to output. Defaults to None.

  • config (graphein.protein.config.ProteinMeshConfig) – ProteinMeshConfig containing pymol commands to run. Default is None ("show surface").

Raises

ValueError if both or neither pdb_file or pdb_code are provided.

Returns

returns path to .obj file (str)

Return type

str

graphein.protein.meshes.normalize_and_center_mesh_vertices(verts: torch.FloatTensor) torch.FloatTensor[source]#

We scale normalize and center the target mesh to fit in a sphere of radius 1 centered at (0,0,0).

(scale, center) will be used to bring the predicted mesh to its original center and scale Note that normalizing the target mesh, speeds up the optimization but is not necessary!

Parameters

verts (torch.FloatTensor) – Mesh vertices.

Returns

Normalized and centered vertices.

Return type

torch.FloatTensor

graphein.protein.meshes.parse_pymol_commands(config: graphein.protein.config.ProteinMeshConfig) List[str][source]#

Parses pymol commands from config. At the moment users can only supply a list of string commands.

Parameters

config (ProteinMeshConfig) – ProteinMeshConfig containing pymol commands to run in config.pymol_commands.

Returns

list of pymol commands to run

Return type

List[str]

graphein.protein.meshes.run_pymol_commands(commands: List[str]) None[source]#

Runs Pymol Commands.

Parameters

commands (List[str]) – List of commands to pass to PyMol.

Visualisation#

Functions for plotting protein graphs and meshes.

graphein.protein.visualisation.add_vector_to_plot(g: networkx.classes.graph.Graph, fig, vector: str = 'sidechain_vector', scale: float = 5, colour: str = 'red', width: int = 10) plotly.graph_objs._figure.Figure[source]#

Adds representations of vector features to the protein graph.

Requires all nodes have a vector feature (1 x 3 array).

Parameters
  • g (nx.Graph) – Protein graph containing vector features

  • fig (go.Figure) – 3D plotly figure to add vectors to.

  • vector (str, optional) – Name of node vector feature to add, defaults to “sidechain_vector”

  • scale (float, optional) – How much to scale the vectors by, defaults to 5

  • colour (str, optional) – Colours for vectors, defaults to “red”

Returns

3D Plotly plot with vectors added.

Return type

go.Figure

graphein.protein.visualisation.asteroid_plot(g: nx.Graph, node_id: str, k: int = 2, colour_nodes_by: str = 'shell', colour_edges_by: str = 'kind', edge_colour_map: plt.cm.Colormap = <matplotlib.colors.ListedColormap object>, show_labels: bool = True, title: Optional[str] = None, width: int = 600, height: int = 500, use_plotly: bool = True, show_edges: bool = False, node_size_multiplier: float = 10) Union[plotly.graph_objects.Figure, matplotlib.figure.Figure][source]#

Plots a k-hop subgraph around a node as concentric shells.

Radius of each point is proportional to the degree of the node (modified by node_size_multiplier).

Parameters
  • g (nx.Graph) – NetworkX graph to plot.

  • node_id (str) – Node to centre the plot around.

  • k (int) – Number of hops to plot. Defaults to 2.

  • colour_nodes_by (str) – Colour the nodes by this attribute. Currently only "shell" is supported.

  • colour_edges_by (str) – Colour the edges by this attribute. Currently only "kind" is supported.

  • edge_colour_map (plt.cm.Colormap) – Colour map for edges. Defaults to plt.cm.plasma.

  • title (str) – Title of the plot. Defaults to None.

  • width (int) – Width of the plot. Defaults to 600.

  • height (int) – Height of the plot. Defaults to 500.

  • use_plotly (bool) – Use plotly to render the graph. Defaults to True.

  • show_edges (bool) – Whether or not to show edges in the plot. Defaults to False.

  • node_size_multiplier (float.) – Multiplier for the size of the nodes. Defaults to 10.

Returns

Plotly figure or matplotlib figure.

Rtpye

Union[plotly.graph_objects.Figure, matplotlib.figure.Figure]

graphein.protein.visualisation.colour_edges(G: networkx.classes.graph.Graph, colour_map: matplotlib.colors.ListedColormap, colour_by: str = 'kind') List[Tuple[float, float, float, float]][source]#

Computes edge colours based on the kind of bond/interaction.

Parameters
  • G (nx.Graph) – nx.Graph protein structure graph to compute edge colours from.

  • colour_map (matplotlib.colors.ListedColormap) – Colourmap to use.

  • colour_by (str) – Edge attribute to colour by. Currently only "kind" is supported.

Returns

List of edge colours.

Return type

List[Tuple[float, float, float, float]]

graphein.protein.visualisation.colour_nodes(G: networkx.classes.graph.Graph, colour_by: str, colour_map: matplotlib.colors.ListedColormap = <matplotlib.colors.ListedColormap object>) List[Tuple[float, float, float, float]][source]#

Computes node colours based on "degree", "seq_position" or node attributes.

Parameters
  • G (nx.Graph) – Graph to compute node colours for

  • colour_map (matplotlib.colors.ListedColormap) – Colourmap to use.

  • colour_by (str) – Manner in which to colour nodes. If not "degree" or "seq_position", this must correspond to a node feature.

Returns

List of node colours

Return type

List[Tuple[float, float, float, float]]

graphein.protein.visualisation.plot_chord_diagram(g: networkx.classes.graph.Graph, show_names: bool = True, order: Optional[List] = None, width: float = 0.1, pad: float = 2.0, gap: float = 0.03, chordwidth: float = 0.7, ax=None, colors=None, cmap=None, alpha=0.7, use_gradient: bool = False, chord_colors=None, show: bool = False, **kwargs)[source]#

Plot a chord diagram.

Based on Tanguy Fardet’s implementation: https://github.com/tfardet/mpl_chord_diagram

Parameters
  • g (nx.Graph) – NetworkX graph to plot Flux data, mat[i, j] is the flux from i to j (adjacency matrix)

  • show_names (bool) – Whether to show the names of the nodes

  • order – list, optional (default: order of the matrix entries) Order in which the arcs should be placed around the trigonometric circle.

  • width (float) – float, optional (default: 0.1) Width/thickness of the ideogram arc.

  • pad (float) – float, optional (default: 2) Distance between two neighboring ideogram arcs. Unit: degree.

  • gap (float) – float, optional (default: 0) Distance between the arc and the beginning of the cord.

  • chordwidth – float, optional (default: 0.7) Position of the control points for the chords, controlling their shape.

  • ax – matplotlib axis, optional (default: new axis) Matplotlib axis where the plot should be drawn.

  • colors – list, optional (default: from cmap) List of user defined colors or floats.

  • cmap – str or colormap object (default: viridis) Colormap that will be used to color the arcs and chords by default. See chord_colors to use different colors for chords.

  • alpha – float in [0, 1], optional (default: 0.7) Opacity of the chord diagram.

  • use_gradient (bool) – bool, optional (default: False) Whether a gradient should be use so that chord extremities have the same color as the arc they belong to.

  • chord_colors

    str, or list of colors, optional (default: None) Specify color(s) to fill the chords differently from the arcs. When the keyword is not used, chord colors default to the colomap given by colors. Possible values for chord_colors are:

    • a single color (do not use an RGB tuple, use hex format instead), e.g. “red” or “#ff0000”; all chords will have this color

    • a list of colors, e.g. ["red", "green", "blue"], one per node (in this case, RGB tuples are accepted as entries to the list). Each chord will get its color from its associated source node, or from both nodes if use_gradient is True.

  • show – bool, optional (default: False) Whether the plot should be displayed immediately via an automatic call to plt.show().

  • kwargs (Dict[str, Any]) –

    keyword arguments Available kwargs are:

    Name

    Type

    Purpose and possible values

    fontcolor

    str or list

    Color of the names

    fontsize

    int

    Size of the font for names

    rotate_names

    (list of) bool(s)

    Rotate names by 90°

    sort

    str

    Either “size” or “distance”

    zero_entry_size

    float

    Size of zero-weight reciprocal

graphein.protein.visualisation.plot_distance_landscape(g: Optional[networkx.classes.graph.Graph] = None, dist_mat: Optional[numpy.ndarray] = None, add_contour: bool = True, title: Optional[str] = None, width: int = 500, height: int = 500, autosize: bool = False) plotly.graph_objs._figure.Figure[source]#

Plots a distance landscape of the graph.

Parameters
  • g (nx.Graph) – Graph to plot (must contain a distance matrix in g.graph["dist_mat"]).

  • add_contour (bool, optional) – Whether or not to show the contour, defaults to True.

  • width (int, optional) – Plot width, defaults to 500.

  • height (int, optional) – Plot height, defaults to 500.

  • autosize (bool, optional) – Whether or not to autosize the plot, defaults to False.

Returns

Plotly figure of distance landscape.

Return type

go.Figure

graphein.protein.visualisation.plot_distance_matrix(g: Optional[networkx.classes.graph.Graph], dist_mat: Optional[numpy.ndarray] = None, use_plotly: bool = True, title: Optional[str] = None, show_residue_labels: bool = True) plotly.graph_objs._figure.Figure[source]#

Plots a distance matrix of the graph.

Parameters
  • g (nx.Graph, optional) – NetworkX graph containing a distance matrix as a graph attribute (g.graph['dist_mat']).

  • dist_mat (np.ndarray, optional) – Distance matrix to plot. If not provided, the distance matrix is taken from the graph. Defaults to None.

  • use_plotly (bool) – Whether to use plotly or seaborn for plotting. Defaults to True.

  • title (str, optional) – Title of the plot.Defaults to None.

Show_residue_labels

Whether to show residue labels on the plot. Defaults to True.

Raises

ValueError if neither a graph g or a dist_mat are provided.

Returns

Plotly figure.

Return type

px.Figure

graphein.protein.visualisation.plot_pointcloud(mesh: Meshes, title: str = '') Axes3D[source]#

Plots pytorch3d Meshes object as pointcloud.

Parameters
  • mesh (pytorch3d.structures.meshes.Meshes) – Meshes object to plot.

  • title (str) – Title of plot.

Returns

returns Axes3D containing plot

Return type

Axes3D

graphein.protein.visualisation.plot_protein_structure_graph(G: networkx.classes.graph.Graph, angle: int = 30, plot_title: typing.Optional[str] = None, figsize: typing.Tuple[int, int] = (10, 7), node_alpha: float = 0.7, node_size_min: float = 20.0, node_size_multiplier: float = 20.0, label_node_ids: bool = True, node_colour_map=<matplotlib.colors.ListedColormap object>, edge_color_map=<matplotlib.colors.ListedColormap object>, colour_nodes_by: str = 'degree', colour_edges_by: str = 'kind', edge_alpha: float = 0.5, plot_style: str = 'ggplot', out_path: typing.Optional[str] = None, out_format: str = '.png') mpl_toolkits.mplot3d.axes3d.Axes3D[source]#

Plots protein structure graph in Axes3D.

Parameters
  • G (nx.Graph) – nx.Graph Protein Structure graph to plot.

  • angle (int) – View angle. Defaults to 30.

  • plot_title (str, optional) – Title of plot. Defaults to None.

  • figsize (Tuple[int, int]) – Size of figure, defaults to (10, 7).

  • node_alpha (float) – Controls node transparency, defaults to 0.7.

  • node_size_min (float) – Specifies node minimum size, defaults to 20.

  • node_size_multiplier (float) – Scales node size by a constant. Node sizes reflect degree. Defaults to 20.

  • label_node_ids (bool) – bool indicating whether or not to plot node_id labels. Defaults to True.

  • node_colour_map (plt.cm) – colour map to use for nodes. Defaults to plt.cm.plasma.

  • edge_color_map (plt.cm) – colour map to use for edges. Defaults to plt.cm.plasma.

  • colour_nodes_by (str) – Specifies how to colour nodes. "degree", "seq_position" or a node feature.

  • colour_edges_by (str) – Specifies how to colour edges. Currently only "kind" is supported.

  • edge_alpha (float) – Controls edge transparency. Defaults to 0.5.

  • plot_style (str) – matplotlib style sheet to use. Defaults to "ggplot".

  • out_path (str, optional) – If not none, writes plot to this location. Defaults to None (does not save).

  • out_format (str) – Fileformat to use for plot

Returns

matplotlib Axes3D object.

Return type

Axes3D

graphein.protein.visualisation.plotly_protein_structure_graph(G: networkx.classes.graph.Graph, plot_title: typing.Optional[str] = None, figsize: typing.Tuple[int, int] = (620, 650), node_alpha: float = 0.7, node_size_min: float = 20.0, node_size_multiplier: float = 20.0, label_node_ids: bool = True, node_colour_map=<matplotlib.colors.ListedColormap object>, edge_color_map=<matplotlib.colors.ListedColormap object>, colour_nodes_by: str = 'degree', colour_edges_by: str = 'kind') plotly.graph_objs._figure.Figure[source]#

Plots protein structure graph using plotly.

Parameters
  • G (nx.Graph) – nx.Graph Protein Structure graph to plot

  • plot_title (str, optional) – Title of plot, defaults to None.

  • figsize (Tuple[int, int]) – Size of figure, defaults to (620, 650).

  • node_alpha (float) – Controls node transparency, defaults to 0.7.

  • node_size_min (float) – Specifies node minimum size. Defaults to 20.0.

  • node_size_multiplier (float) – Scales node size by a constant. Node sizes reflect degree. Defaults to 20.0.

  • label_node_ids (bool) – bool indicating whether or not to plot node_id labels. Defaults to True.

  • node_colour_map (plt.cm) – colour map to use for nodes. Defaults to plt.cm.plasma.

  • edge_color_map (plt.cm) – colour map to use for edges. Defaults to plt.cm.plasma.

  • colour_nodes_by (str) – Specifies how to colour nodes. "degree", "seq_position" or a node feature.

  • colour_edges_by (str) – Specifies how to colour edges. Currently only "kind" is supported.

Returns

Plotly Graph Objects plot

Return type

go.Figure

Utils#

Provides utility functions for use across Graphein.

exception graphein.protein.utils.ProteinGraphConfigurationError(message: str)[source]#

Exception when an invalid Graph configuration if provided to a downstream function or method.

__init__(message: str)[source]#
graphein.protein.utils.compute_rgroup_dataframe(pdb_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Return the atoms that are in R-groups and not the backbone chain.

Parameters

pdb_df (pd.DataFrame) – DataFrame to compute R group dataframe from.

Returns

Dataframe containing R-groups only (backbone atoms removed).

Return type

pd.DataFrame

graphein.protein.utils.download_alphafold_structure(uniprot_id: str, version: int = 2, out_dir: str = '.', rename: bool = True, pdb: bool = True, mmcif: bool = False, aligned_score: bool = True) Union[str, Tuple[str, str]][source]#

Downloads a structure from the Alphafold EBI database (https://alphafold.ebi.ac.uk/files/”).

Parameters
  • uniprot_id (str) – UniProt ID of desired protein.

  • version (int) – Version of the structure to download

  • out_dir (str) – string specifying desired output location. Default is pwd.

  • rename (bool) – boolean specifying whether to rename the output file to $uniprot_id.pdb. Default is True.

  • pdb (bool) – boolean specifying whether to download the PDB file. Default is True.

  • mmcif (bool) – Bool specifying whether to download MMCiF or PDB. Default is false (downloads pdb)

  • retrieve_aligned_score (bool) – Bool specifying whether or not to download score alignment json.

Returns

path to output. Tuple if several outputs specified.

Return type

Union[str, Tuple[str, str]]

graphein.protein.utils.download_pdb(config, pdb_code: str) pathlib.Path[source]#

Download PDB structure from PDB.

If no structure is found, we perform a lookup against the record of obsolete PDB codes (ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat)

Parameters

pdb_code (str) – 4 character PDB accession code.

Returns

returns filepath to downloaded structure.

Return type

str

graphein.protein.utils.filter_dataframe(dataframe: pandas.core.frame.DataFrame, by_column: str, list_of_values: List[Any], boolean: bool) pandas.core.frame.DataFrame[source]#

Filter function for dataframe.

Filters the dataframe such that the by_column values have to be in the list_of_values list if boolean == True, or not in the list if boolean == False.

Parameters
  • dataframe (pd.DataFrame) – pd.DataFrame to filter.

  • by_column (str) – str denoting column of dataframe to filter.

  • list_of_values (List[Any]) – List of values to filter with.

  • boolean (bool) – indicates whether to keep or exclude matching list_of_values. True -> in list, False -> not in list.

Returns

Filtered dataframe.

Return type

pd.DataFrame

graphein.protein.utils.get_obsolete_mapping() Dict[str, str][source]#

Returns a dictionary mapping obsolete PDB codes to their replacement.

Returns

Dictionary mapping obsolete PDB codes to their replacement.

Return type

Dictionary[str, str]

graphein.protein.utils.get_protein_name_from_filename(pdb_path: str) str[source]#

Extracts a filename from a pdb_path

Parameters

pdb_path (str) – Path to extract filename from.

Returns

file name.

Return type

str

graphein.protein.utils.is_tool(name: str) bool[source]#

Checks whether name is on PATH and is marked as an executable.

Source: https://stackoverflow.com/questions/11210104/check-if-a-program-exists-from-a-python-script

Parameters

name (str) – Name of program to check for execution ability.

Returns

Whether name is on PATH and is marked as an executable.

Return type

bool

graphein.protein.utils.save_graph_to_pdb(g: networkx.classes.graph.Graph, path: str, gz: bool = False)[source]#

Saves processed pdb_df (g.graph["pdb_df"]) dataframe to a PDB file.

N.B. PDBs do not contain connectivity information. This only captures the nodes in the graph. Connectivity is filled in according to standard rules by visualisation programs.

Parameters
  • g (nx.Graph) – Protein graph to save dataframe from.

  • path (str) – Path to save PDB file to.

  • gz (bool) – Whether to gzip the file. Defaults to False.

graphein.protein.utils.save_pdb_df_to_pdb(df: pandas.core.frame.DataFrame, path: str, gz: bool = False)[source]#

Saves pdb dataframe to a PDB file.

Parameters
  • g (pd.DataFrame) – Dataframe to save as PDB

  • path (str) – Path to save PDB file to.

  • gz (bool) – Whether to gzip the file. Defaults to False.

graphein.protein.utils.save_rgroup_df_to_pdb(g: networkx.classes.graph.Graph, path: str, gz: bool = False)[source]#

Saves R-group (g.graph["rgroup_df"]) dataframe to a PDB file.

N.B. PDBs do not contain connectivity information. This only captures the atoms in the r groups. Connectivity is filled in according to standard rules by visualisation programs.

Parameters
  • g (nx.Graph) – Protein graph to save R group dataframe from.

  • path (str) – Path to save PDB file to.

  • gz (bool) – Whether to gzip the file. Defaults to False.

graphein.protein.utils.three_to_one_with_mods(res: str) str[source]#

Converts three letter AA codes into 1 letter. Allows for modified residues.

See: RESI_THREE_TO_1.

Parameters

res (str) – Three letter residue code string.

Returns

1-letter residue code.

Return type

str

Constants#

Author: Eric J. Ma, Arian Jamasb Purpose: This is a set of utility variables and functions that can be used across the Graphein project.

These include various collections of standard & non-standard/modified amino acids and their names, identifiers and properties.

We also include mappings of covalent radii and bond lengths for the amino acids used in assembling atomic protein graphs.

graphein.protein.resi_atoms.AA_RING_ATOMS: Dict[str, List[str]] = {'HIS': ['CG', 'CD', 'CE', 'ND', 'NE'], 'PHE': ['CG', 'CD', 'CE', 'CZ'], 'TRP': ['CD', 'CE', 'CH', 'CZ'], 'TYR': ['CG', 'CD', 'CE', 'CZ']}#

Dictionary mapping amino acid 3-letter codes to lists of atoms that are part of rings.

graphein.protein.resi_atoms.AMINO_ACIDS: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']#

Vocabulary of amino acids with one-letter codes. Includes fuzzy standard amino acids: "B" denotes "ASX" which corresponds to "ASP" ("D") or "ASN" ("N") and "Z" denotes "GLX" which corresponds to``”GLU”`` ("E") or "GLN" ("Q").

graphein.protein.resi_atoms.AROMATIC_RESIS: List[str] = ['PHE', 'TRP', 'HIS', 'TYR']#

List of aromatic residues.

graphein.protein.resi_atoms.BACKBONE_ATOMS: List[str] = ['N', 'CA', 'C', 'O']#

Atoms present in Amino Acid Backbones.

graphein.protein.resi_atoms.BASE_AMINO_ACIDS: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']#

Vocabulary of 20 standard amino acids.

graphein.protein.resi_atoms.BOND_LENGTHS: Dict[str, Dict[str, float]] = {'As-N': {'i_d': 1.835, 'i_s': 1.86, 'w_sd': 1.845}, 'As-O': {'i_d': 1.66, 'i_s': 1.71, 'w_sd': 1.68}, 'As-S': {'i_d': 2.08, 'i_s': 2.28, 'w_sd': 2.15}, 'C-C': {'i_d': 1.31, 'i_s': 1.49, 'i_t': 1.18, 'w_dt': 1.21, 'w_sd': 1.38}, 'C-N': {'i_d': 1.32, 'i_s': 1.42, 'i_t': 1.14, 'w_dt': 1.2, 'w_sd': 1.34}, 'C-O': {'i_d': 1.22, 'i_s': 1.41, 'w_sd': 1.28}, 'C-S': {'i_d': 1.68, 'i_s': 1.78, 'w_sd': 1.7}, 'C-Te': {'i_d': 1.8, 'i_s': 2.2, 'w_sd': 2.1}, 'N-N': {'i_d': 1.22, 'i_s': 1.4, 'w_sd': 1.32}, 'N-O': {'i_d': 1.22, 'i_s': 1.39, 'w_sd': 1.25}, 'N-P': {'i_d': 1.59, 'i_s': 1.69, 'w_sd': 1.62}, 'N-S': {'i_d': 1.54, 'i_s': 1.66, 'w_sd': 1.58}, 'N-Se': {'i_d': 1.79, 'i_s': 1.83, 'w_sd': 1.8}, 'O-P': {'i_d': 1.48, 'i_s': 1.6, 'w_sd': 1.52}, 'O-S': {'i_d': 1.45, 'i_s': 1.58, 'w_sd': 1.54}, 'P-P': {'i_d': 2.04, 'i_s': 2.23, 'w_sd': 2.06}}#

Dictionary containing idealised single, double and triple bond lengths (i_s, i_d, i_t) and watersheds (w_sd, w_dt), below which a bond is probably double/triple (e.g. triple < double < single). All lengths are in Angstroms.

Taken from:

Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database Jon C. Baber and Edward E. Hodgkin* J. Chem. Inf. Comput. Sci. 1992, 32. 401-406

graphein.protein.resi_atoms.BOND_ORDERS: Dict = {'As-N': [1, 2], 'As-O': [1, 2], 'As-S': [1, 2], 'C-C': [1, 2, 3], 'C-N': [1, 2, 3], 'C-O': [1, 2], 'C-S': [1, 2], 'C-Te': [1, 2], 'N-N': [1, 2], 'N-O': [1, 2], 'N-P': [1, 2], 'N-S': [1, 2], 'N-Se': [1, 2], 'O-P': [1, 2], 'O-S': [1, 2], 'P-P': [1, 2]}#

Dictionary of allowable bond orders for each covalent bond type.

Taken from:

Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database Jon C. Baber and Edward E. Hodgkin* J. Chem. Inf. Comput. Sci. 1992, 32. 401-406

graphein.protein.resi_atoms.BOND_TYPES: List[str] = ['hydrophobic', 'disulfide', 'hbond', 'ionic', 'aromatic', 'aromatic_sulphur', 'cation_pi', 'backbone', 'delaunay']#

List of supported bond types.

graphein.protein.resi_atoms.CARBOHYDRATE_CODES: List[str] = ['BGC', 'GLC', 'MAN', 'BMA', 'FUC', 'GAL', 'GLA', 'NAG', 'NGA', 'SIA', 'XYS']#

Three letter codes of commonly found carbohydrates in protein structures.

See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.CARBOHYDRATE_CODE_NAME_MAPPING: Dict[str, str] = {'BGC': 'D-GLUCOSE', 'BMA': 'D-MANNOSE', 'FUC': 'FUCOSE', 'GAL': 'D-GALACTOSE', 'GLA': 'D-GALACTOSE', 'GLC': 'D-GLUCOSE', 'MAN': 'D-MANNOSE', 'NAG': 'N-ACETYL-D-GLUCOSAMINE', 'NGA': 'N-ACETYL-D-GALACTOSAMINE', 'SIA': 'O-SIALIC_ACID', 'XYS': 'D-XYLOPYRANOSE'}#

Mapping of 3-letter PDB ligand accession codes for common carbohydrates to their full names.

See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.CARBOHYDRATE_NAMES: List[str] = ['D-GLUCOSE', 'D-MANNOSE', 'FUCOSE', 'D-GALACTOSE', 'N-ACETYL-D-GLUCOSAMINE', 'N-ACETYL-D-GALACTOSAMINE', 'O-SIALIC_ACID', 'D-XYLOPYRANOSE']#

Names of commonly found carbohydrates in protein structures.

See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.CATION_PI_RESIS: List[str] = ['LYS', 'ARG', 'PHE', 'TYR', 'TRP']#

List of residues involved in cation-pi interactions.

graphein.protein.resi_atoms.CATION_RESIS: List[str] = ['LYS', 'ARG']#

List of cationic residues.

graphein.protein.resi_atoms.COFACTOR_CODES: List[str] = ['ADP', 'AMP', 'ATP', 'CMP', 'COA', 'FAD', 'FMN', 'NAP', 'NDP']#

Three letter codes of cofactors commonly found in PDB structures.

See: http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.COFACTOR_CODE_NAME_MAPPING: Dict[str, str] = {'ADP': 'ADP', 'AMP': 'AMP', 'ATP': 'ATP', 'CMP': 'cAMP', 'COA': 'COENZYME_A', 'FAD': 'FAD', 'FMN': 'FLAVIN_MONONUCLEOTIDE', 'NAP': 'NADP', 'NDP': 'NADPH'}#

Mapping between 3-letter PDB ligand codes and cofactor names.

See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.COFACTOR_NAMES: List[str] = ['ADP', 'AMP', 'ATP', 'cAMP', 'COENZYME_A', 'FAD', 'FLAVIN_MONONUCLEOTIDE', 'NADP', 'NADPH']#

Names of cofactors commonly found in PDB structures.

See: http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.COVALENT_RADII: Dict[str, float] = {'Cdb': 0.67, 'Cres': 0.72, 'Csb': 0.77, 'Hsb': 0.37, 'Ndb': 0.62, 'Nres': 0.66, 'Nsb': 0.7, 'Odb': 0.6, 'Ores': 0.635, 'Osb': 0.67, 'Ssb': 1.04}#

Covalent radii for OpenSCAD output. Adding Ores between Osb and Odb for Asp and Glu, Nres between Nsb and Ndb for Arg, as PDB does not specify

Covalent radii from:

Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’

Paper: https://arxiv.org/pdf/0804.2488.pdf

graphein.protein.resi_atoms.DEFAULT_BOND_STATE: Dict[str, str] = {'1HD2': 'Hsb', '1HH1': 'Hsb', '1HH2': 'Hsb', '2HD2': 'Hsb', '2HH1': 'Hsb', '2HH2': 'Hsb', 'C': 'Cdb', 'CA': 'Csb', 'CB': 'Csb', 'H': 'Hsb', 'HE': 'Hsb', 'HG': 'Hsb', 'HG1': 'Hsb', 'HH': 'Hsb', 'HZ1': 'Hsb', 'HZ2': 'Hsb', 'HZ3': 'Hsb', 'N': 'Nsb', 'O': 'Odb', 'OXT': 'Osb'}#

Assignment of atom classes to atomic radii.

Covalent radii from:

Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’

Paper: https://arxiv.org/pdf/0804.2488.pdf

graphein.protein.resi_atoms.DISULFIDE_ATOMS: List[str] = ['SG']#

List of atoms capable of forming disulphide bonds.

graphein.protein.resi_atoms.DISULFIDE_RESIS: List[str] = ['CYS']#

Residues capable of forming disulfide bonds.

graphein.protein.resi_atoms.GRANTHAM_CHEMICAL_DISTANCE_MATRIX: Dict[str, float] = {'AA': 0.0, 'AC': 0.112, 'AD': 0.819, 'AE': 0.827, 'AF': 0.54, 'AG': 0.208, 'AH': 0.696, 'AI': 0.407, 'AK': 0.891, 'AL': 0.406, 'AM': 0.379, 'AN': 0.318, 'AP': 0.191, 'AQ': 0.372, 'AR': 1.0, 'AS': 0.094, 'AT': 0.22, 'AV': 0.273, 'AW': 0.739, 'AY': 0.552, 'CA': 0.114, 'CC': 0.0, 'CD': 0.847, 'CE': 0.838, 'CF': 0.437, 'CG': 0.32, 'CH': 0.66, 'CI': 0.304, 'CK': 0.887, 'CL': 0.301, 'CM': 0.277, 'CN': 0.324, 'CP': 0.157, 'CQ': 0.341, 'CR': 1.0, 'CS': 0.176, 'CT': 0.233, 'CV': 0.167, 'CW': 0.639, 'CY': 0.457, 'DA': 0.729, 'DC': 0.742, 'DD': 0.0, 'DE': 0.124, 'DF': 0.924, 'DG': 0.697, 'DH': 0.435, 'DI': 0.847, 'DK': 0.249, 'DL': 0.841, 'DM': 0.819, 'DN': 0.56, 'DP': 0.657, 'DQ': 0.584, 'DR': 0.295, 'DS': 0.667, 'DT': 0.649, 'DV': 0.797, 'DW': 1.0, 'DY': 0.836, 'EA': 0.79, 'EC': 0.788, 'ED': 0.133, 'EE': 0.0, 'EF': 0.932, 'EG': 0.779, 'EH': 0.406, 'EI': 0.86, 'EK': 0.143, 'EL': 0.854, 'EM': 0.83, 'EN': 0.599, 'EP': 0.688, 'EQ': 0.598, 'ER': 0.234, 'ES': 0.726, 'ET': 0.682, 'EV': 0.824, 'EW': 1.0, 'EY': 0.837, 'FA': 0.508, 'FC': 0.405, 'FD': 0.977, 'FE': 0.918, 'FF': 0.0, 'FG': 0.69, 'FH': 0.663, 'FI': 0.128, 'FK': 0.903, 'FL': 0.131, 'FM': 0.169, 'FN': 0.541, 'FP': 0.42, 'FQ': 0.459, 'FR': 1.0, 'FS': 0.548, 'FT': 0.499, 'FV': 0.252, 'FW': 0.207, 'FY': 0.179, 'GA': 0.206, 'GC': 0.312, 'GD': 0.776, 'GE': 0.807, 'GF': 0.727, 'GG': 0.0, 'GH': 0.769, 'GI': 0.592, 'GK': 0.894, 'GL': 0.591, 'GM': 0.557, 'GN': 0.381, 'GP': 0.323, 'GQ': 0.467, 'GR': 1.0, 'GS': 0.158, 'GT': 0.272, 'GV': 0.464, 'GW': 0.923, 'GY': 0.728, 'HA': 0.896, 'HC': 0.836, 'HD': 0.629, 'HE': 0.547, 'HF': 0.907, 'HG': 1.0, 'HH': 0.0, 'HI': 0.848, 'HK': 0.566, 'HL': 0.842, 'HM': 0.825, 'HN': 0.754, 'HP': 0.777, 'HQ': 0.716, 'HR': 0.697, 'HS': 0.865, 'HT': 0.834, 'HV': 0.831, 'HW': 0.981, 'HY': 0.821, 'IA': 0.403, 'IC': 0.296, 'ID': 0.942, 'IE': 0.891, 'IF': 0.134, 'IG': 0.592, 'IH': 0.652, 'II': 0.0, 'IK': 0.892, 'IL': 0.013, 'IM': 0.057, 'IN': 0.457, 'IP': 0.311, 'IQ': 0.383, 'IR': 1.0, 'IS': 0.443, 'IT': 0.396, 'IV': 0.133, 'IW': 0.339, 'IY': 0.213, 'KA': 0.889, 'KC': 0.871, 'KD': 0.279, 'KE': 0.149, 'KF': 0.957, 'KG': 0.9, 'KH': 0.438, 'KI': 0.899, 'KK': 0.0, 'KL': 0.892, 'KM': 0.871, 'KN': 0.667, 'KP': 0.757, 'KQ': 0.639, 'KR': 0.154, 'KS': 0.825, 'KT': 0.759, 'KV': 0.882, 'KW': 1.0, 'KY': 0.848, 'LA': 0.405, 'LC': 0.296, 'LD': 0.944, 'LE': 0.892, 'LF': 0.139, 'LG': 0.596, 'LH': 0.653, 'LI': 0.013, 'LK': 0.893, 'LL': 0.0, 'LM': 0.062, 'LN': 0.452, 'LP': 0.309, 'LQ': 0.376, 'LR': 1.0, 'LS': 0.443, 'LT': 0.397, 'LV': 0.133, 'LW': 0.341, 'LY': 0.205, 'MA': 0.383, 'MC': 0.276, 'MD': 0.932, 'ME': 0.879, 'MF': 0.182, 'MG': 0.569, 'MH': 0.648, 'MI': 0.058, 'MK': 0.884, 'ML': 0.062, 'MM': 0.0, 'MN': 0.447, 'MP': 0.285, 'MQ': 0.372, 'MR': 1.0, 'MS': 0.417, 'MT': 0.358, 'MV': 0.12, 'MW': 0.391, 'MY': 0.255, 'NA': 0.424, 'NC': 0.425, 'ND': 0.838, 'NE': 0.835, 'NF': 0.766, 'NG': 0.512, 'NH': 0.78, 'NI': 0.615, 'NK': 0.891, 'NL': 0.603, 'NM': 0.588, 'NN': 0.0, 'NP': 0.266, 'NQ': 0.175, 'NR': 1.0, 'NS': 0.361, 'NT': 0.368, 'NV': 0.503, 'NW': 0.945, 'NY': 0.641, 'PA': 0.22, 'PC': 0.179, 'PD': 0.852, 'PE': 0.831, 'PF': 0.515, 'PG': 0.376, 'PH': 0.696, 'PI': 0.363, 'PK': 0.875, 'PL': 0.357, 'PM': 0.326, 'PN': 0.231, 'PP': 0.0, 'PQ': 0.228, 'PR': 1.0, 'PS': 0.196, 'PT': 0.161, 'PV': 0.244, 'PW': 0.72, 'PY': 0.481, 'QA': 0.512, 'QC': 0.462, 'QD': 0.903, 'QE': 0.861, 'QF': 0.671, 'QG': 0.648, 'QH': 0.765, 'QI': 0.532, 'QK': 0.881, 'QL': 0.518, 'QM': 0.505, 'QN': 0.181, 'QP': 0.272, 'QQ': 0.0, 'QR': 1.0, 'QS': 0.461, 'QT': 0.389, 'QV': 0.464, 'QW': 0.831, 'QY': 0.522, 'RA': 0.919, 'RC': 0.905, 'RD': 0.305, 'RE': 0.225, 'RF': 0.977, 'RG': 0.928, 'RH': 0.498, 'RI': 0.929, 'RK': 0.141, 'RL': 0.92, 'RM': 0.908, 'RN': 0.69, 'RP': 0.796, 'RQ': 0.668, 'RR': 0.0, 'RS': 0.86, 'RT': 0.808, 'RV': 0.914, 'RW': 1.0, 'RY': 0.859, 'SA': 0.1, 'SC': 0.185, 'SD': 0.801, 'SE': 0.812, 'SF': 0.622, 'SG': 0.17, 'SH': 0.718, 'SI': 0.478, 'SK': 0.883, 'SL': 0.474, 'SM': 0.44, 'SN': 0.289, 'SP': 0.181, 'SQ': 0.358, 'SR': 1.0, 'SS': 0.0, 'ST': 0.174, 'SV': 0.342, 'SW': 0.827, 'SY': 0.615, 'TA': 0.251, 'TC': 0.261, 'TD': 0.83, 'TE': 0.812, 'TF': 0.604, 'TG': 0.312, 'TH': 0.737, 'TI': 0.455, 'TK': 0.866, 'TL': 0.453, 'TM': 0.403, 'TN': 0.315, 'TP': 0.159, 'TQ': 0.322, 'TR': 1.0, 'TS': 0.185, 'TT': 0.0, 'TV': 0.345, 'TW': 0.816, 'TY': 0.596, 'VA': 0.275, 'VC': 0.165, 'VD': 0.9, 'VE': 0.867, 'VF': 0.269, 'VG': 0.471, 'VH': 0.649, 'VI': 0.135, 'VK': 0.889, 'VL': 0.134, 'VM': 0.12, 'VN': 0.38, 'VP': 0.212, 'VQ': 0.339, 'VR': 1.0, 'VS': 0.322, 'VT': 0.305, 'VV': 0.0, 'VW': 0.472, 'VY': 0.31, 'WA': 0.658, 'WC': 0.56, 'WD': 1.0, 'WE': 0.931, 'WF': 0.196, 'WG': 0.829, 'WH': 0.678, 'WI': 0.305, 'WK': 0.892, 'WL': 0.304, 'WM': 0.344, 'WN': 0.631, 'WP': 0.555, 'WQ': 0.538, 'WR': 0.968, 'WS': 0.689, 'WT': 0.638, 'WV': 0.418, 'WW': 0.0, 'WY': 0.204, 'YA': 0.587, 'YC': 0.478, 'YD': 1.0, 'YE': 0.932, 'YF': 0.202, 'YG': 0.782, 'YH': 0.678, 'YI': 0.23, 'YK': 0.904, 'YL': 0.219, 'YM': 0.268, 'YN': 0.512, 'YP': 0.444, 'YQ': 0.404, 'YR': 0.995, 'YS': 0.612, 'YT': 0.557, 'YV': 0.328, 'YW': 0.244, 'YY': 0.0}#

Grantham Chemical Distance Matrix. Taken from ProPy3 https://github.com/MartinThoma/propy3

Amino Acid Difference Formula to Help Explain Protein Evolution R. Grantham Science Vol 185, Issue 4154 06 September 1974

Paper: https://science.sciencemag.org/content/185/4154/862/tab-pdf

graphein.protein.resi_atoms.HYDROGEN_BOND_ACCEPTORS: Dict[str, Dict[str, int]] = {'ASN': {'OD1': 2}, 'ASP': {'OD1': 2, 'OD2': 2}, 'GLN': {'OE1': 2}, 'GLU': {'OE1': 2, 'OE2': 2}, 'HIS': {'ND1': 1, 'NE2': 1}, 'SER': {'OG': 2}, 'THR': {'OG1': 2}, 'TYR': {'OH': 1}}#

Number of hydrogen bonds that an acceptor atom can accept, if more than one.

9 amino acids (alanine, cysteine, glycine, isoleucine, leucine, methionine, phenylalanine, proline, valine) have no hydrogen donor or acceptor atoms in their side chains.

https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/charge/

graphein.protein.resi_atoms.HYDROGEN_BOND_DONORS: Dict[str, Dict[str, int]] = {'ARG': {'NE': 1, 'NH1': 2, 'NH2': 2}, 'ASN': {'ND2': 2}, 'GLN': {'NE2': 2}, 'HIS': {'ND1': 2, 'NE2': 2}, 'LYS': {'NZ': 3}, 'SER': {'OG': 1}, 'THR': {'OG1': 1}, 'TRP': {'NE1': 1}, 'TYR': {'OH': 1}}#

Number of hydrogen bonds that a donor atom can donate, if more than one.

9 amino acids (alanine, cysteine, glycine, isoleucine, leucine, methionine, phenylalanine, proline, valine) have no hydrogen donor or acceptor atoms in their side chains.

https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/charge/

graphein.protein.resi_atoms.HYDROPHOBIC_RESIS: List[str] = ['ALA', 'VAL', 'LEU', 'ILE', 'MET', 'PHE', 'TRP', 'PRO', 'TYR']#

List of residues that are considered to be hydrophobic.

graphein.protein.resi_atoms.IONIC_RESIS: List[str] = ['ARG', 'LYS', 'HIS', 'ASP', 'GLU']#

Residues capable of forming ionic interactions.

graphein.protein.resi_atoms.ISOELECTRIC_POINTS: Dict[str, float] = {'ALA': 6.11, 'ARG': 10.76, 'ASN': 10.76, 'ASP': 2.98, 'ASX': 6.87, 'CYS': 5.02, 'GLN': 5.65, 'GLU': 3.08, 'GLX': 4.35, 'GLY': 6.06, 'HIS': 7.64, 'ILE': 6.04, 'LEU': 6.04, 'LYS': 9.74, 'MET': 5.74, 'PHE': 5.91, 'PRO': 6.3, 'SER': 5.68, 'THR': 5.6, 'TRP': 5.88, 'TYR': 5.63, 'UNK': 7.0, 'VAL': 6.02}#

Dictionary of isoelectric points for standard amino acids. For "UNK" residues, neutral (pH 7.0) is assigned. For "ASX" and "GLX" the average of their constituents ("D" and "N", and "E" and "Q", respectively) is assigned.

graphein.protein.resi_atoms.ISOELECTRIC_POINTS_STD = {'ALA': array([[-0.0986554]]), 'ARG': array([[2.33811019]]), 'ASN': array([[2.33811019]]), 'ASP': array([[-1.73888686]]), 'ASX': array([[0.29961166]]), 'CYS': array([[-0.66985422]]), 'GLN': array([[-0.33971178]]), 'GLU': array([[-1.6864833]]), 'GLX': array([[-1.02095808]]), 'GLY': array([[-0.12485718]]), 'HIS': array([[0.70311909]]), 'ILE': array([[-0.13533789]]), 'LEU': array([[-0.13533789]]), 'LYS': array([[1.80359387]]), 'MET': array([[-0.29254858]]), 'PHE': array([[-0.20346252]]), 'PRO': array([[0.00091137]]), 'SER': array([[-0.32399071]]), 'THR': array([[-0.36591356]]), 'TRP': array([[-0.21918359]]), 'TYR': array([[-0.35019249]]), 'UNK': array([[0.36773629]]), 'VAL': array([[-0.1458186]])}#

Standardized (sklearn.StandardScaler) isoelectric points for standard amino acids.

See ISOELECTRIC_POINTS for details.

graphein.protein.resi_atoms.MAX_NEIGHBOURS: Dict[str, int] = {'B': 3, 'Br': 1, 'C': 4, 'F': 1, 'H': 1, 'I': 3, 'O': 2}#

Maximum number of neighbours an atom can have.

Taken from: https://www.daylight.com/meetings/mug01/Sayle/m4xbondage.html

graphein.protein.resi_atoms.MOLECULAR_WEIGHTS: Dict[str, float] = {'ALA': 89.0935, 'ARG': 174.2017, 'ASN': 132.1184, 'ASP': 133.1032, 'ASX': 132.6108, 'CYS': 121.159, 'GLN': 146.1451, 'GLU': 147.1299, 'GLX': 146.6375, 'GLY': 75.0669, 'HIS': 155.1552, 'ILE': 131.1736, 'LEU': 131.1736, 'LYS': 146.1882, 'MET': 149.2124, 'PHE': 165.19, 'PRO': 115.131, 'SER': 105.093, 'THR': 119.1197, 'TRP': 204.2262, 'TYR': 181.1894, 'UNK': 137.1484, 'VAL': 117.1469}#

Mapping of 3-letter amino acid names to molecular weights. UNK is used for unknown residues and takes the mean of known weights. For "ASX" and "GLX" the average of their constituents ("D" and "N", and "E" and "Q", respectively) is assigned.

graphein.protein.resi_atoms.MOLECULAR_WEIGHTS_STD = {'ALA': array([[-1.70781298]]), 'ARG': array([[1.31682834]]), 'ASN': array([[-0.17876066]]), 'ASP': array([[-0.14376208]]), 'ASX': array([[-0.16126137]]), 'CYS': array([[-0.56824433]]), 'GLN': array([[0.31973109]]), 'GLU': array([[0.35472968]]), 'GLX': array([[0.33723039]]), 'GLY': array([[-2.20630119]]), 'HIS': array([[0.63993903]]), 'ILE': array([[-0.2123377]]), 'LEU': array([[-0.2123377]]), 'LYS': array([[0.32126282]]), 'MET': array([[0.42873918]]), 'PHE': array([[0.99656354]]), 'PRO': array([[-0.78247208]]), 'SER': array([[-1.13921032]]), 'THR': array([[-0.64071856]]), 'TRP': array([[2.38386234]]), 'TYR': array([[1.56516265]]), 'UNK': array([[-6.18065683e-07]]), 'VAL': array([[-0.71082946]])}#

Standardized (sklearn.StandardScaler) molecular weights for standard amino acids.

See MOLECULAR_WEIGHTS for details.

graphein.protein.resi_atoms.NEG_AA: List[str] = ['GLU', 'ASP']#

Negatively charged amino acids.

graphein.protein.resi_atoms.NON_STANDARD_AMINO_ACIDS: List[str] = ['O', 'U']#

Non-standard amino acids with one-letter codes.

graphein.protein.resi_atoms.NON_STANDARD_AMINO_ACID_MAPPING_3_TO_1: Dict[str, str] = {'PYL': 'O', 'SEC': 'U'}#

Mapping of 3-letter non-standard amino acids codes to their one-letter form.

See: http://ligand-expo.rcsb.org/

graphein.protein.resi_atoms.NON_STANDARD_RESIS_NAME: List[str] = ['3-SULFINOALANINE', '4-HYDROXYPROLINE', '4-METHYL-4-[(E)-2-BUTENYL]-4,N-METHYL-THREONINE', '5-HYDROXYPROLINE', 'ACETYL_GROUP', 'ALPHA-AMINOBUTYRIC_ACID', 'ALPHA-AMINOISOBUTYRIC_ACID', 'AMINO_GROUP', 'CARBOXY_GROUP', 'CYSTEINE-S-DIOXIDE', 'CYSTEINESULFONIC_ACID', 'D-ALANINE', 'D-ARGININE', 'D-ASPARAGINE', 'D-ASPARTATE', 'D-CYSTEINE', 'DECARBOXY(PARAHYDROXYBENZYLIDENE-IMIDAZOLIDINONE)THREONINE', 'D-GLUTAMATE', 'D-GLUTAMINE', 'D-HISTIDINE', 'D-ISOLEUCINE', 'D-ISOVALINE', 'D-LEUCINE', 'D-LYSINE', 'D-PHENYLALANINE', 'D-PROLINE', 'D-SERINE', 'D-THREONINE', 'D-TRYPTOPHANE', 'D-TYROSINE', 'D-VALINE', 'FORMYL_GROUP', 'GAMMA-CARBOXY-GLUTAMIC_ACID', 'ISOVALERIC_ACID', 'LYSINE_NZ-CARBOXYLIC_ACID', "LYSINE-PYRIDOXAL-5'-PHOSPHATE", 'N-CARBOXYMETHIONINE', 'N-FORMYLMETHIONINE', 'N-METHYLLEUCINE', 'N-METHYLVALINE', 'NORLEUCINE', 'O-PHOSPHOTYROSINE', 'ORNITHINE', 'PHOSPHOSERINE', 'PHOSPHOTHREONINE', 'PYROGLUTAMIC_ACID', 'PYRUVOYL_GROUP', 'SARCOSINE', 'S-HYDROXY-CYSTEINE', 'S-HYDROXYCYSTEINE', 'S-MERCAPTOCYSTEINE', 'S-OXY_CYSTEINE', 'S,S-(2-HYDROXYETHYL)THIOCYSTEINE', 'SULFONATED_TYROSINE', 'TERT-BUTYLOXYCARBONYL_GROUP', 'TOPO-QUINONE', 'TYROSINE-O-SULPHONIC_ACID']#

Non-standard residue info taken from: https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html PYL (pyrolysine) and SEC are added

graphein.protein.resi_atoms.NON_STANDARD_RESIS_PARENT: Dict[str, str] = {'5HP': 'GLU', 'ABA': 'ALA', 'ACE': '-', 'AIB': 'ALA', 'BMT': 'THR', 'BOC': '-', 'CBX': '-', 'CEA': 'CYS', 'CGU': 'GLU', 'CME': 'CYS', 'CRO': 'CRO', 'CSD': 'CYS', 'CSO': 'CYS', 'CSS': 'CYS', 'CSW': 'CYS', 'CSX': 'CYS', 'CXM': 'MET', 'DAL': 'ALA', 'DAR': 'ARG', 'DCY': 'CYS', 'DGL': 'GLU', 'DGN': 'GLN', 'DHI': 'HIS', 'DIL': 'ILE', 'DIV': 'VAL', 'DLE': 'LEU', 'DLY': 'LYS', 'DPN': 'PHE', 'DPR': 'PRO', 'DSG': 'ASN', 'DSN': 'SER', 'DSP': 'ASP', 'DTH': 'THR', 'DTR': 'DTR', 'DTY': 'TYR', 'DVA': 'VAL', 'FME': 'MET', 'FOR': '-', 'HYP': 'PRO', 'IVA': '-', 'KCX': 'LYS', 'LLP': 'LYS', 'MLE': 'LEU', 'MVA': 'VAL', 'NH2': '-', 'NLE': 'LEU', 'OCS': 'CYS', 'ORN': 'ALA', 'PCA': 'GLU', 'PTR': 'TYR', 'PVL': '-', 'PYL': 'LYS', 'SAR': 'GLY', 'SEC': 'CYS', 'SEP': 'SER', 'STY': 'TYR', 'TPO': 'THR', 'TPQ': 'PHE', 'TYS': 'TYR'}#

Mapping of 3-letter non-standard/modified residues to their 3-letter parent residue names.

graphein.protein.resi_atoms.NON_STANDARD_RESI_NAMES: List[str] = ['CSD', 'HYP', 'BMT', '5HP', 'ACE', 'ABA', 'AIB', 'NH2', 'CBX', 'CSW', 'OCS', 'DAL', 'DAR', 'DSG', 'DSP', 'DCY', 'CRO', 'DGL', 'DGN', 'DHI', 'DIL', 'DIV', 'DLE', 'DLY', 'DPN', 'DPR', 'DSN', 'DTH', 'DTR', 'DTY', 'DVA', 'FOR', 'CGU', 'IVA', 'KCX', 'LLP', 'CXM', 'FME', 'MLE', 'MVA', 'NLE', 'PTR', 'ORN', 'SEP', 'SEC', 'TPO', 'PCA', 'PVL', 'PYL', 'SAR', 'CEA', 'CSO', 'CSS', 'CSX', 'CME', 'TYS', 'BOC', 'TPQ', 'STY']#

List of non-standard residue 3-letter names.

Collected from: https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html

graphein.protein.resi_atoms.PI_RESIS: List[str] = ['PHE', 'TYR', 'TRP']#

List of residues involved in pi interactions.

graphein.protein.resi_atoms.POS_AA: List[str] = ['HIS', 'LYS', 'ARG']#

Positively charged amino acids.

graphein.protein.resi_atoms.RESIDUE_ATOM_BOND_STATE: Dict[str, Dict[str, str]] = {'ARG': {'CD': 'Csb', 'CG': 'Csb', 'CZ': 'Cdb', 'NE': 'Nsb', 'NH1': 'Nres', 'NH2': 'Nres'}, 'ASN': {'CG': 'Csb', 'ND2': 'Ndb', 'OD1': 'Odb'}, 'ASP': {'CG': 'Csb', 'OD1': 'Ores', 'OD2': 'Ores'}, 'CYS': {'SG': 'Ssb'}, 'GLN': {'CD': 'Csb', 'CG': 'Csb', 'NE2': 'Ndb', 'OE1': 'Odb'}, 'GLU': {'CD': 'Csb', 'CG': 'Csb', 'OE1': 'Ores', 'OE2': 'Ores'}, 'HIS': {'CD2': 'Cdb', 'CE1': 'Cdb', 'CG': 'Cdb', 'ND1': 'Nsb', 'NE2': 'Ndb'}, 'ILE': {'CD1': 'Csb', 'CG1': 'Csb', 'CG2': 'Csb'}, 'LEU': {'CD1': 'Csb', 'CD2': 'Csb', 'CG': 'Csb'}, 'LYS': {'CD': 'Csb', 'CE': 'Csb', 'CG': 'Csb', 'NZ': 'Nsb'}, 'MET': {'CE': 'Csb', 'CG': 'Csb', 'SD': 'Ssb'}, 'PHE': {'CD1': 'Cres', 'CD2': 'Cres', 'CE1': 'Cdb', 'CE2': 'Cdb', 'CG': 'Cdb', 'CZ': 'Cres'}, 'PRO': {'CD': 'Csb', 'CG': 'Csb'}, 'SER': {'OG': 'Osb'}, 'THR': {'CG2': 'Csb', 'OG1': 'Osb'}, 'TRP': {'CD1': 'Cdb', 'CD2': 'Cres', 'CE2': 'Cdb', 'CE3': 'Cdb', 'CG': 'Cdb', 'CH2': 'Cdb', 'CZ2': 'Cres', 'CZ3': 'Cres', 'NE1': 'Nsb'}, 'TYR': {'CD1': 'Cres', 'CD2': 'Cres', 'CE1': 'Cdb', 'CE2': 'Cdb', 'CG': 'Cdb', 'CZ': 'Cres', 'OH': 'Osb'}, 'VAL': {'CG1': 'Csb', 'CG2': 'Csb'}, 'XXX': {'C': 'Cdb', 'CA': 'Csb', 'CB': 'Csb', 'H': 'Hsb', 'N': 'Nsb', 'O': 'Odb', 'OXT': 'Osb'}}#

Assignment of consituent atom classes with each standard residue to atomic radii.

Covalent radii from:

Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’

Paper: https://arxiv.org/pdf/0804.2488.pdf

graphein.protein.resi_atoms.RESI_NAMES: List[str] = ['ALA', 'ASX', 'CYS', 'ASP', 'GLU', 'PHE', 'GLY', 'HIS', 'ILE', 'LYS', 'LEU', 'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL', 'TRP', 'TYR', 'GLX', 'CSD', 'HYP', 'BMT', '5HP', 'ACE', 'ABA', 'AIB', 'NH2', 'CBX', 'CSW', 'OCS', 'DAL', 'DAR', 'DSG', 'DSP', 'DCY', 'CRO', 'DGL', 'DGN', 'DHI', 'DIL', 'DIV', 'DLE', 'DLY', 'DPN', 'DPR', 'DSN', 'DTH', 'DTR', 'DTY', 'DVA', 'FOR', 'CGU', 'IVA', 'KCX', 'LLP', 'CXM', 'FME', 'MLE', 'MVA', 'NLE', 'PTR', 'ORN', 'SEP', 'SEC', 'TPO', 'PCA', 'PVL', 'PYL', 'SAR', 'CEA', 'CSO', 'CSS', 'CSX', 'CME', 'TYS', 'BOC', 'TPQ', 'STY', 'UNK']#

3-letter residue names for all amino acids. Non-standard/modified amino acids are mapped to their parent amino acid. Includes "UNK" to denote unknown residues.

graphein.protein.resi_atoms.RESI_THREE_TO_1: Dict[str, str] = {'3HP': 'X', '4HP': 'X', '5HP': 'Q', 'ABA': 'A', 'ACE': 'X', 'AIB': 'A', 'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'ASX': 'B', 'BMT': 'T', 'BOC': 'X', 'CBX': 'X', 'CEA': 'C', 'CGU': 'E', 'CME': 'C', 'CRO': 'TYG', 'CSD': 'C', 'CSO': 'C', 'CSS': 'C', 'CSW': 'C', 'CSX': 'C', 'CXM': 'M', 'CYS': 'C', 'DAL': 'A', 'DAR': 'R', 'DCY': 'C', 'DGL': 'E', 'DGN': 'Q', 'DHI': 'H', 'DIL': 'I', 'DIV': 'V', 'DLE': 'L', 'DLY': 'K', 'DPN': 'F', 'DPR': 'P', 'DSG': 'N', 'DSN': 'S', 'DSP': 'D', 'DTH': 'T', 'DTR': 'W', 'DTY': 'Y', 'DVA': 'V', 'FME': 'M', 'FOR': 'X', 'GLN': 'Q', 'GLU': 'E', 'GLX': 'Z', 'GLY': 'G', 'HIS': 'H', 'HYP': 'P', 'ILE': 'I', 'IVA': 'X', 'KCX': 'K', 'LEU': 'L', 'LLP': 'K', 'LYS': 'K', 'MET': 'M', 'MLE': 'L', 'MVA': 'V', 'NH2': 'X', 'NLE': 'L', 'OCS': 'C', 'ORN': 'A', 'PCA': 'Q', 'PHE': 'F', 'PRO': 'P', 'PTR': 'Y', 'PVL': 'X', 'PYL': 'O', 'SAR': 'G', 'SEC': 'U', 'SEP': 'S', 'SER': 'S', 'STY': 'Y', 'THR': 'T', 'TPO': 'T', 'TPQ': 'Y', 'TRP': 'W', 'TYR': 'Y', 'TYS': 'Y', 'UNK': 'X', 'VAL': 'V'}#

Mapping of 3-letter residue names to 1-letter residue names. Non-standard/modified amino acids are mapped to their parent amino acid. Includes "UNK" to denote unknown residues.

graphein.protein.resi_atoms.SCHNEIDER_WREDE_DISTMAT: Dict[str, float] = {'AA': 0.0, 'AC': 0.112, 'AD': 0.819, 'AE': 0.827, 'AF': 0.54, 'AG': 0.208, 'AH': 0.696, 'AI': 0.407, 'AK': 0.891, 'AL': 0.406, 'AM': 0.379, 'AN': 0.318, 'AP': 0.191, 'AQ': 0.372, 'AR': 1.0, 'AS': 0.094, 'AT': 0.22, 'AV': 0.273, 'AW': 0.739, 'AY': 0.552, 'CA': 0.114, 'CC': 0.0, 'CD': 0.847, 'CE': 0.838, 'CF': 0.437, 'CG': 0.32, 'CH': 0.66, 'CI': 0.304, 'CK': 0.887, 'CL': 0.301, 'CM': 0.277, 'CN': 0.324, 'CP': 0.157, 'CQ': 0.341, 'CR': 1.0, 'CS': 0.176, 'CT': 0.233, 'CV': 0.167, 'CW': 0.639, 'CY': 0.457, 'DA': 0.729, 'DC': 0.742, 'DD': 0.0, 'DE': 0.124, 'DF': 0.924, 'DG': 0.697, 'DH': 0.435, 'DI': 0.847, 'DK': 0.249, 'DL': 0.841, 'DM': 0.819, 'DN': 0.56, 'DP': 0.657, 'DQ': 0.584, 'DR': 0.295, 'DS': 0.667, 'DT': 0.649, 'DV': 0.797, 'DW': 1.0, 'DY': 0.836, 'EA': 0.79, 'EC': 0.788, 'ED': 0.133, 'EE': 0.0, 'EF': 0.932, 'EG': 0.779, 'EH': 0.406, 'EI': 0.86, 'EK': 0.143, 'EL': 0.854, 'EM': 0.83, 'EN': 0.599, 'EP': 0.688, 'EQ': 0.598, 'ER': 0.234, 'ES': 0.726, 'ET': 0.682, 'EV': 0.824, 'EW': 1.0, 'EY': 0.837, 'FA': 0.508, 'FC': 0.405, 'FD': 0.977, 'FE': 0.918, 'FF': 0.0, 'FG': 0.69, 'FH': 0.663, 'FI': 0.128, 'FK': 0.903, 'FL': 0.131, 'FM': 0.169, 'FN': 0.541, 'FP': 0.42, 'FQ': 0.459, 'FR': 1.0, 'FS': 0.548, 'FT': 0.499, 'FV': 0.252, 'FW': 0.207, 'FY': 0.179, 'GA': 0.206, 'GC': 0.312, 'GD': 0.776, 'GE': 0.807, 'GF': 0.727, 'GG': 0.0, 'GH': 0.769, 'GI': 0.592, 'GK': 0.894, 'GL': 0.591, 'GM': 0.557, 'GN': 0.381, 'GP': 0.323, 'GQ': 0.467, 'GR': 1.0, 'GS': 0.158, 'GT': 0.272, 'GV': 0.464, 'GW': 0.923, 'GY': 0.728, 'HA': 0.896, 'HC': 0.836, 'HD': 0.629, 'HE': 0.547, 'HF': 0.907, 'HG': 1.0, 'HH': 0.0, 'HI': 0.848, 'HK': 0.566, 'HL': 0.842, 'HM': 0.825, 'HN': 0.754, 'HP': 0.777, 'HQ': 0.716, 'HR': 0.697, 'HS': 0.865, 'HT': 0.834, 'HV': 0.831, 'HW': 0.981, 'HY': 0.821, 'IA': 0.403, 'IC': 0.296, 'ID': 0.942, 'IE': 0.891, 'IF': 0.134, 'IG': 0.592, 'IH': 0.652, 'II': 0.0, 'IK': 0.892, 'IL': 0.013, 'IM': 0.057, 'IN': 0.457, 'IP': 0.311, 'IQ': 0.383, 'IR': 1.0, 'IS': 0.443, 'IT': 0.396, 'IV': 0.133, 'IW': 0.339, 'IY': 0.213, 'KA': 0.889, 'KC': 0.871, 'KD': 0.279, 'KE': 0.149, 'KF': 0.957, 'KG': 0.9, 'KH': 0.438, 'KI': 0.899, 'KK': 0.0, 'KL': 0.892, 'KM': 0.871, 'KN': 0.667, 'KP': 0.757, 'KQ': 0.639, 'KR': 0.154, 'KS': 0.825, 'KT': 0.759, 'KV': 0.882, 'KW': 1.0, 'KY': 0.848, 'LA': 0.405, 'LC': 0.296, 'LD': 0.944, 'LE': 0.892, 'LF': 0.139, 'LG': 0.596, 'LH': 0.653, 'LI': 0.013, 'LK': 0.893, 'LL': 0.0, 'LM': 0.062, 'LN': 0.452, 'LP': 0.309, 'LQ': 0.376, 'LR': 1.0, 'LS': 0.443, 'LT': 0.397, 'LV': 0.133, 'LW': 0.341, 'LY': 0.205, 'MA': 0.383, 'MC': 0.276, 'MD': 0.932, 'ME': 0.879, 'MF': 0.182, 'MG': 0.569, 'MH': 0.648, 'MI': 0.058, 'MK': 0.884, 'ML': 0.062, 'MM': 0.0, 'MN': 0.447, 'MP': 0.285, 'MQ': 0.372, 'MR': 1.0, 'MS': 0.417, 'MT': 0.358, 'MV': 0.12, 'MW': 0.391, 'MY': 0.255, 'NA': 0.424, 'NC': 0.425, 'ND': 0.838, 'NE': 0.835, 'NF': 0.766, 'NG': 0.512, 'NH': 0.78, 'NI': 0.615, 'NK': 0.891, 'NL': 0.603, 'NM': 0.588, 'NN': 0.0, 'NP': 0.266, 'NQ': 0.175, 'NR': 1.0, 'NS': 0.361, 'NT': 0.368, 'NV': 0.503, 'NW': 0.945, 'NY': 0.641, 'PA': 0.22, 'PC': 0.179, 'PD': 0.852, 'PE': 0.831, 'PF': 0.515, 'PG': 0.376, 'PH': 0.696, 'PI': 0.363, 'PK': 0.875, 'PL': 0.357, 'PM': 0.326, 'PN': 0.231, 'PP': 0.0, 'PQ': 0.228, 'PR': 1.0, 'PS': 0.196, 'PT': 0.161, 'PV': 0.244, 'PW': 0.72, 'PY': 0.481, 'QA': 0.512, 'QC': 0.462, 'QD': 0.903, 'QE': 0.861, 'QF': 0.671, 'QG': 0.648, 'QH': 0.765, 'QI': 0.532, 'QK': 0.881, 'QL': 0.518, 'QM': 0.505, 'QN': 0.181, 'QP': 0.272, 'QQ': 0.0, 'QR': 1.0, 'QS': 0.461, 'QT': 0.389, 'QV': 0.464, 'QW': 0.831, 'QY': 0.522, 'RA': 0.919, 'RC': 0.905, 'RD': 0.305, 'RE': 0.225, 'RF': 0.977, 'RG': 0.928, 'RH': 0.498, 'RI': 0.929, 'RK': 0.141, 'RL': 0.92, 'RM': 0.908, 'RN': 0.69, 'RP': 0.796, 'RQ': 0.668, 'RR': 0.0, 'RS': 0.86, 'RT': 0.808, 'RV': 0.914, 'RW': 1.0, 'RY': 0.859, 'SA': 0.1, 'SC': 0.185, 'SD': 0.801, 'SE': 0.812, 'SF': 0.622, 'SG': 0.17, 'SH': 0.718, 'SI': 0.478, 'SK': 0.883, 'SL': 0.474, 'SM': 0.44, 'SN': 0.289, 'SP': 0.181, 'SQ': 0.358, 'SR': 1.0, 'SS': 0.0, 'ST': 0.174, 'SV': 0.342, 'SW': 0.827, 'SY': 0.615, 'TA': 0.251, 'TC': 0.261, 'TD': 0.83, 'TE': 0.812, 'TF': 0.604, 'TG': 0.312, 'TH': 0.737, 'TI': 0.455, 'TK': 0.866, 'TL': 0.453, 'TM': 0.403, 'TN': 0.315, 'TP': 0.159, 'TQ': 0.322, 'TR': 1.0, 'TS': 0.185, 'TT': 0.0, 'TV': 0.345, 'TW': 0.816, 'TY': 0.596, 'VA': 0.275, 'VC': 0.165, 'VD': 0.9, 'VE': 0.867, 'VF': 0.269, 'VG': 0.471, 'VH': 0.649, 'VI': 0.135, 'VK': 0.889, 'VL': 0.134, 'VM': 0.12, 'VN': 0.38, 'VP': 0.212, 'VQ': 0.339, 'VR': 1.0, 'VS': 0.322, 'VT': 0.305, 'VV': 0.0, 'VW': 0.472, 'VY': 0.31, 'WA': 0.658, 'WC': 0.56, 'WD': 1.0, 'WE': 0.931, 'WF': 0.196, 'WG': 0.829, 'WH': 0.678, 'WI': 0.305, 'WK': 0.892, 'WL': 0.304, 'WM': 0.344, 'WN': 0.631, 'WP': 0.555, 'WQ': 0.538, 'WR': 0.968, 'WS': 0.689, 'WT': 0.638, 'WV': 0.418, 'WW': 0.0, 'WY': 0.204, 'YA': 0.587, 'YC': 0.478, 'YD': 1.0, 'YE': 0.932, 'YF': 0.202, 'YG': 0.782, 'YH': 0.678, 'YI': 0.23, 'YK': 0.904, 'YL': 0.219, 'YM': 0.268, 'YN': 0.512, 'YP': 0.444, 'YQ': 0.404, 'YR': 0.995, 'YS': 0.612, 'YT': 0.557, 'YV': 0.328, 'YW': 0.244, 'YY': 0.0}#

Scheider-Wrede Physicochemical Distance Matrix taken from ProPy3 https://github.com/MartinThoma/propy3.

Paper

The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site Biophysical Journal Volume 66, Issue 2, Part 1, February 1994, Pages 335-344 G.Schneider, P.Wrede

graphein.protein.resi_atoms.STANDARD_AMINO_ACIDS: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z']#

Vocabulary of amino acids with one-letter codes. Includes fuzzy standard amino acids: "B" denotes "ASX" which corresponds to "ASP" ("D") or "ASN" ("N") and "Z" denotes "GLX" which corresponds to "GLU" ("E") or "GLN" ("Q").

graphein.protein.resi_atoms.STANDARD_RESI_NAMES: List[str] = ['ALA', 'ASX', 'CYS', 'ASP', 'GLU', 'PHE', 'GLY', 'HIS', 'ILE', 'LYS', 'LEU', 'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL', 'TRP', 'TYR', 'GLX', 'UNK']#

List of standard residue 3-letter names. Includes "UNK" for unknown residues. "ASX" denotes "ASP" or "ASN" and "GLX" denotes "GLU" or "GLN".

graphein.protein.resi_atoms.SULPHUR_RESIS: List[str] = ['MET', 'CYS']#

Residues containing sulphur atoms.