graphein.protein#
Config#
Base Config object for use with Protein Graph Construction.
- class graphein.protein.config.GetContactsConfig(*, get_contacts_path: pathlib.Path = PosixPath('/Users/arianjamasb/github/getcontacts'), contacts_dir: pathlib.Path = PosixPath('/Users/arianjamasb/graphein/examples/contacts'), pdb_dir: pathlib.Path = PosixPath('/Users/arianjamasb/graphein/examples/pdbs'), granularity: str = 'CA')[source]#
Config object for parameters relating to running
GetContacts
.GetContacts
is an optional dependency from which intramolecular interactions can be computed and used as edges in the graph.More information about
GetContacts
can be found at https://getcontacts.github.io/- Parameters
get_contacts_path (pathlib.Path) – Path to
GetContacts
installationcontacts_dir (pathlib.Path) – Path to store output of
GetContacts
pdb_dir (pathlib.Path) – Path to PDB files to be used to compute intramolecular interactions.
granularity (str) – Specifies the node types of the graph, defaults to
"CA"
for alpha-carbons as nodes. Other options are"CB"
(beta-carbon),"atom"
for all-atom graphs, and"centroid"
for nodes positioned as residue centroids.
- graphein.protein.config.GranularityOpts#
Allowable granularity options for nodes in the graph.
alias of
Literal
[‘atom’, ‘centroids’]
- graphein.protein.config.GraphAtoms#
Allowable atom types for nodes in the graph.
alias of
Literal
[‘N’, ‘CA’, ‘C’, ‘O’, ‘CB’, ‘OG’, ‘CG’, ‘CD1’, ‘CD2’, ‘CE1’, ‘CE2’, ‘CZ’, ‘OD1’, ‘ND2’, ‘CG1’, ‘CG2’, ‘CD’, ‘CE’, ‘NZ’, ‘OD2’, ‘OE1’, ‘NE2’, ‘OE2’, ‘OH’, ‘NE’, ‘NH1’, ‘NH2’, ‘OG1’, ‘SD’, ‘ND1’, ‘SG’, ‘NE1’, ‘CE3’, ‘CZ2’, ‘CZ3’, ‘CH2’, ‘OXT’]
- class graphein.protein.config.ProteinGraphConfig(*, granularity: typing.Union[typing.Literal['N', 'CA', 'C', 'O', 'CB', 'OG', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OD1', 'ND2', 'CG1', 'CG2', 'CD', 'CE', 'NZ', 'OD2', 'OE1', 'NE2', 'OE2', 'OH', 'NE', 'NH1', 'NH2', 'OG1', 'SD', 'ND1', 'SG', 'NE1', 'CE3', 'CZ2', 'CZ3', 'CH2', 'OXT'], typing.Literal['atom', 'centroids']] = 'CA', keep_hets: bool = False, insertions: bool = False, pdb_dir: pathlib.Path = PosixPath('../examples/pdbs'), verbose: bool = False, exclude_waters: bool = True, deprotonate: bool = False, protein_df_processing_functions: typing.List[typing.Callable] = None, edge_construction_functions: typing.List[typing.Union[typing.Callable, str]] = [<function add_peptide_bonds>], node_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = [<function meiler_embedding>], edge_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = None, graph_metadata_functions: typing.List[typing.Callable] = None, get_contacts_config: graphein.protein.config.GetContactsConfig = None, dssp_config: graphein.protein.config.DSSPConfig = None)[source]#
Config Object for Protein Structure Graph Construction.
If you encounter a problematic structure, perusing https://www.umass.edu/microbio/chime/pe_beta/pe/protexpl/badpdbs.htm may provide some additional insight. PDBs are notoriously troublesome and this is an excellent overview.
- Parameters
granularity (str (Union[graphein.protein.config.GraphAtoms, graphein.protein.config.GranularityOpts])) – Controls the granularity of the graph construction.
"atom"
builds an atomic-scale graph where nodes are constituent atoms. Residue-level graphs can be build by specifying which constituent atom should represent node positions (seeGraphAtoms
). Additionally,"centroids"
can be specified to compute the centre of gravity for a given atom (Specified inGranularityOpts
). Defaults to"CA"
(alpha-Carbon).keep_hets (bool) –
Controls whether or not heteroatoms are removed from the PDB file. These are typically modified residues, bound ligands, crystallographic adjuvants, ions or water molecules.
For more information, see: https://proteopedia.org/wiki/index.php/Hetero_atoms
insertions (bool) – Controls whether or not insertions are allowed.
pdb_dir (pathlib.Path) – Specifies path to download protein structures into.
verbose (bool) – Specifies verbosity of graph creation process.
exclude_waters – Specifies whether or not water molecules are excluded from the structure
deprotonate (bool) – Specifies whether or not to remove
H
atoms from the graph.protein_df_processing_functions (Optional[List[Callable]]) – List of functions that take a
pd.DataFrame
and return apd.DataFrame
. This allows users to define their own series of processing functions for the protein structure DataFrame and override the default sequencing of processing steps provided by Graphein. We refer users to our low-level API tutorial for more details.edge_construction_functions (List[Callable]) – List of functions that take an
nx.Graph
and return annx.Graph
with desired edges added. Prepared edge constructions can be found in graphein.protein.edgesnode_metadata_functions (List[Callable], optional) – List of functions that take an
nx.Graph
edge_metadata_functions (List[Callable], optional) – List of functions that take an
graph_metadata_functions (List[Callable], optional) – List of functions that take an
nx.Graph
and return annx.Graph
with added graph-level features and metadata.get_contacts_config (GetContactsConfig, optional) – Config object containing parameters for running
GetContacts
for computing intramolecular contact-based edges. Defaults to None.dssp_config (DSSPConfig, optional) – Config Object containing reference to
DSSP
executable. Defaults to None. NB DSSP must be installed. See installation instructions: https://graphein.ai/getting_started/installation.html#optional-dependencies
- class graphein.protein.config.ProteinMeshConfig(*, pymol_command_line_options: str = '-cKq', pymol_commands: List[str] = ['show surface'])[source]#
Config object for parameters relating to Protein Mesh construction with
PyMol
NB PyMol must be installed. See: https://graphein.ai/getting_started/installation.html#optional-dependencies
- Parameters
pymol_command_line_options (str, optional) – List of CLI args for running PyMol. See: https://pymolwiki.org/index.php/Command_Line_Options. Defaults to
"-cKq"
()pymol_commands (List[str], optional) – List of Commands passed to PyMol in surface construction.
Graphs#
Functions for working with Protein Structure Graphs.
- graphein.protein.graphs.add_nodes_to_graph(G: networkx.classes.graph.Graph, protein_df: Optional[pandas.core.frame.DataFrame] = None, verbose: bool = False) networkx.classes.graph.Graph [source]#
Add nodes into protein graph.
- Parameters
G (nx.Graph) –
nx.Graph
with metadata to populate with nodes.verbose (bool) – Controls verbosity of this step.
- Protein_df
DataFrame of protein structure containing nodes & initial node metadata to add to the graph.
- Returns
nx.Graph with nodes added.
- Return type
nx.Graph
- graphein.protein.graphs.assign_node_id_to_dataframe(protein_df: pandas.core.frame.DataFrame, granularity: str) pandas.core.frame.DataFrame [source]#
Assigns the node ID back to the
pdb_df
dataframe- Parameters
protein_df (pd.DataFrame) – Structure Dataframe
granularity (str) – Granularity of graph. Atom-level, residue (e.g.
CA
) orcentroids
. See:GRAPH_ATOMS
andGRANULARITY_OPTS
.
- Returns
Returns dataframe with added
node_ids
- Return type
pd.DataFrame
- graphein.protein.graphs.calculate_centroid_positions(atoms: pandas.core.frame.DataFrame, verbose: bool = False) pandas.core.frame.DataFrame [source]#
Calculates position of sidechain centroids.
- Parameters
atoms (pd.DataFrame) – ATOM df of protein structure.
verbose (bool) – bool controlling verbosity.
- Returns
centroids (df).
- Return type
pd.DataFrame
- graphein.protein.graphs.compute_chain_graph(g: networkx.classes.graph.Graph, chain_list: Optional[List[str]] = None, remove_self_loops: bool = False, return_weighted_graph: bool = False) Union[networkx.classes.graph.Graph, networkx.classes.multigraph.MultiGraph] [source]#
Computes a chain-level graph from a protein structure graph.
This graph features nodes as individual chains in a complex and edges as the interactions between constituent nodes in each chain. You have the option of returning an unweighted graph (multigraph,
return_weighted_graph=False
) or a weighted graph (return_weighted_graph=True
). The difference between these is the unweighted graph features and edge for each interaction between chains (ie the number of edges will be equal to the number of edges in the input protein structure graph), while the weighted graph sums these interactions to a single edge between chains with the counts stored as features.- Parameters
g (nx.Graph) – A protein structure graph to compute the chain graph of.
chain_list (Optional[List[str]]) – A list of chains to extract from the input graph. If
None
, all chains will be used. This is provided as input toextract_subgraph_from_chains
. Default isNone
.remove_self_loops (bool) – Whether to remove self-loops from the graph. Default is False.
- Returns
A chain-level graph.
- Return type
Union[nx.Graph, nx.MultiGraph]
- graphein.protein.graphs.compute_edges(G: networkx.classes.graph.Graph, funcs: List[Callable], get_contacts_config: Optional[graphein.protein.config.GetContactsConfig] = None) networkx.classes.graph.Graph [source]#
Computes edges for the protein structure graph. Will compute a pairwise distance matrix between nodes which is added to the graph metadata to facilitate some edge computations.
- Parameters
G (nx.Graph) – nx.Graph with nodes to add edges to.
funcs (List[Callable]) – List of edge construction functions.
get_contacts_config (graphein.protein.config.GetContactsConfig) – Config object for
GetContacts
if intramolecular edges are being used.
- Returns
Graph with added edges.
- Return type
nx.Graph
- graphein.protein.graphs.compute_secondary_structure_graph(g: networkx.classes.graph.Graph, allowable_ss_elements: Optional[List[str]] = None, remove_non_ss: bool = True, remove_self_loops: bool = False, return_weighted_graph: bool = False) Union[networkx.classes.graph.Graph, networkx.classes.multigraph.MultiGraph] [source]#
Computes a secondary structure graph from a protein structure graph.
- Parameters
g (nx.Graph) – A protein structure graph to compute the secondary structure graph of.
remove_non_ss (bool) – Whether to remove non-secondary structure nodes from the graph. These are denoted as
"-"
by DSSP. Default is True.remove_self_loops (bool) – Whether to remove self-loops from the graph. Default is
False
.return_weighted_graph (bool) – Whether to return a weighted graph. Default is False.
- Raises
ProteinGraphConfigurationError – If the protein structure graph is not configured correctly with secondary structure assignments on all nodes.
- Returns
A secondary structure graph.
- Return type
Union[nx.Graph, nx.MultiGraph]
- graphein.protein.graphs.compute_weighted_graph_from_multigraph(g: networkx.classes.multigraph.MultiGraph) networkx.classes.graph.Graph [source]#
Computes a weighted graph from a multigraph.
This function is used to convert a multigraph to a weighted graph. The weights of the edges are the number of interactions between the nodes.
- Parameters
g (nx.MultiGraph) – A multigraph.
- Returns
A weighted graph.
- Return type
nx.Graph
- graphein.protein.graphs.construct_graph(config: Optional[graphein.protein.config.ProteinGraphConfig] = None, pdb_path: Optional[str] = None, pdb_code: Optional[str] = None, chain_selection: str = 'all', df_processing_funcs: Optional[List[Callable]] = None, edge_construction_funcs: Optional[List[Callable]] = None, edge_annotation_funcs: Optional[List[Callable]] = None, node_annotation_funcs: Optional[List[Callable]] = None, graph_annotation_funcs: Optional[List[Callable]] = None) networkx.classes.graph.Graph [source]#
Constructs protein structure graph from a
pdb_code
orpdb_path
.Users can provide a
ProteinGraphConfig
object to specify construction parameters.However, config parameters can be overridden by passing arguments directly to the function.
- Parameters
config (graphein.protein.config.ProteinGraphConfig, optional) –
ProteinGraphConfig
object. If None, defaults to config ingraphein.protein.config
.pdb_path (str, optional) – Path to
pdb_file
to build graph from. Default isNone
.pdb_code (str, optional) – 4-character PDB accession pdb_code to build graph from. Default is
None
.chain_selection (str) – String of polypeptide chains to include in graph. E.g
"ABDF"
or"all"
. Default is"all"
.df_processing_funcs (List[Callable], optional) – List of dataframe processing functions. Default is
None
.edge_construction_funcs (List[Callable], optional) – List of edge construction functions. Default is
None
.edge_annotation_funcs (List[Callable], optional) – List of edge annotation functions. Default is
None
.node_annotation_funcs (List[Callable], optional) – List of node annotation functions. Default is
None
.graph_annotation_funcs (List[Callable]) – List of graph annotation function. Default is
None
.
- Returns
Protein Structure Graph
- Type
nx.Graph
- graphein.protein.graphs.construct_graphs_mp(pdb_code_it: Optional[List[str]] = None, pdb_path_it: Optional[List[str]] = None, chain_selections: Optional[list[str]] = None, config: ProteinGraphConfig = ProteinGraphConfig(granularity='CA', keep_hets=False, insertions=False, pdb_dir=PosixPath('../examples/pdbs'), verbose=False, exclude_waters=True, deprotonate=False, protein_df_processing_functions=None, edge_construction_functions=[<function add_peptide_bonds>], node_metadata_functions=[<function meiler_embedding>], edge_metadata_functions=None, graph_metadata_functions=None, get_contacts_config=None, dssp_config=None), num_cores: int = 16, return_dict: bool = True) Union[List[nx.Graph], Dict[str, nx.Graph]] [source]#
Constructs protein graphs for a list of pdb codes or pdb paths using multiprocessing.
- Parameters
pdb_code_it (Optional[List[str]], defaults to None) – List of pdb codes to use for protein graph construction
pdb_path_it (Optional[List[str]], defaults to None) – List of paths to PDB files to use for protein graph construction
chain_selections (Optional[List[str]], defaults to None) – List of chains to select from the protein structures (e.g. [“ABC”, “A”, “L”, “CD”…])
config (graphein.protein.config.ProteinGraphConfig, defaults to default config params) – ProteinGraphConfig to use.
num_cores (int, defaults to 16) – Number of cores to use for multiprocessing. The more the merrier
return_dict (bool, default to True) – Whether or not to return a dictionary (indexed by pdb codes/paths) or a list of graphs.
- Returns
Iterable of protein graphs. None values indicate there was a problem in constructing the graph for this particular pdb
- Return type
Union[List[nx.Graph], Dict[str, nx.Graph]]
- graphein.protein.graphs.convert_structure_to_centroids(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Overwrite existing
(x, y, z)
coordinates with centroids of the amino acids.- Parameters
df (pd.DataFrame) – Pandas Dataframe protein structure to convert into a dataframe of centroid positions.
- Returns
pd.DataFrame with atoms/residues positions converted into centroid positions.
- Return type
pd.DataFrame
- graphein.protein.graphs.deprotonate_structure(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Remove protons from PDB dataframe.
- Parameters
df (pd.DataFrame) – Atomic dataframe.
- Returns
Atomic dataframe with all
atom_name == "H"
removed.- Return type
pd.DataFrame
- graphein.protein.graphs.filter_hetatms(df: pandas.core.frame.DataFrame, keep_hets: List[str]) List[pandas.core.frame.DataFrame] [source]#
Return hetatms of interest.
- Parameters
df (pd.DataFrame) – Protein Structure dataframe to filter hetatoms from.
keep_hets – List of hetero atom names to keep.
- Returns
Protein structure dataframe with heteroatoms removed
:rtype pd.DataFrame
- graphein.protein.graphs.initialise_graph_with_metadata(protein_df: pandas.core.frame.DataFrame, raw_pdb_df: pandas.core.frame.DataFrame, pdb_id: str, granularity: str) networkx.classes.graph.Graph [source]#
Initializes the nx Graph object with initial metadata.
- Parameters
protein_df (pd.DataFrame) – Processed Dataframe of protein structure.
raw_pdb_df (pd.DataFrame) – Unprocessed dataframe of protein structure for comparison and traceability downstream.
pdb_id (str) – PDB Accession code.
granularity (str) – Granularity of the graph (eg
"atom"
,"CA"
,"CB"
etc or"centroid"
). See:GRAPH_ATOMS
andGRANULARITY_OPTS
.
- Returns
Returns initial protein structure graph with metadata.
- Return type
nx.Graph
- graphein.protein.graphs.number_groups_of_runs(list_of_values: List[Any]) List[str] [source]#
Numbers groups of runs in a list of values.
E.g.
["A", "A", "B", "A", "A", "A", "B", "B"] -> ["A1", "A1", "B1", "A2", "A2", "A2", "B2", "B2"]
- Parameters
list_of_values (List[Any]) – List of values to number.
- Returns
List of numbered values.
- Return type
List[str]
- graphein.protein.graphs.process_dataframe(protein_df: pandas.core.frame.DataFrame, atom_df_processing_funcs: Optional[List[Callable]] = None, hetatom_df_processing_funcs: Optional[List[Callable]] = None, granularity: str = 'centroids', chain_selection: str = 'all', insertions: bool = False, deprotonate: bool = True, keep_hets: List[str] = [], verbose: bool = False) pandas.core.frame.DataFrame [source]#
Process ATOM and HETATM dataframes to produce singular dataframe used for graph construction.
- Parameters
protein_df (pd.DataFrame) – Dataframe to process. Should be the object returned from
read_pdb_to_dataframe()
.atom_df_processing_funcs (List[Callable], optional) – List of functions to process dataframe. These must take in a dataframe and return a dataframe. Defaults to None.
hetatom_df_processing_funcs (List[Callable], optional) – List of functions to process the hetatom dataframe. These must take in a dataframe and return a dataframe
granularity (str) – The level of granularity for the graph. This determines the node definition. Acceptable values include:
"centroids"
,"atoms"
, any of the atom_names in the PDB file (e.g."CA"
,"CB"
,"OG"
, etc.). See:GRAPH_ATOMS
andGRANULARITY_OPTS
.insertions – Whether or not to keep insertions.
insertions – bool
deprotonate (bool) – Whether or not to remove hydrogen atoms (i.e. deprotonation).
keep_hets (List[str]) – Hetatoms to keep. Defaults to an empty list. To keep a hetatom, pass it inside a list of hetatom names to keep.
verbose (bool) – Verbosity level.
chain_selection (str) – Which protein chain to select. Defaults to
"all"
. Eg can use"ACF"
to select 3 chains (A
,C
&F
)
- Returns
A protein dataframe that can be consumed by other graph construction functions.
- Return type
pd.DataFrame
- graphein.protein.graphs.read_pdb_to_dataframe(pdb_path: Optional[str] = None, pdb_code: Optional[str] = None, verbose: bool = False, granularity: str = 'CA') pandas.core.frame.DataFrame [source]#
Reads PDB file to
PandasPDB
object.Returns
atomic_df
, which is a dataframe enumerating all atoms and their cartesian coordinates in 3D space. Also contains associated metadata from the PDB file.- Parameters
pdb_path (str, optional) – path to PDB file. Defaults to None.
pdb_code (str, optional) – 4-character PDB accession. Defaults to None.
verbose (bool) – print dataframe?
granularity (str) – Specifies granularity of dataframe. See
ProteinGraphConfig
for further details.
- Returns
pd.DataFrame
containing protein structure- Return type
pd.DataFrame
- graphein.protein.graphs.remove_insertions(df: pandas.core.frame.DataFrame, keep: str = 'first') pandas.core.frame.DataFrame [source]#
This function removes insertions from PDB dataframes.
- Parameters
df (pd.DataFrame) – Protein Structure dataframe to remove insertions from.
keep (str) – Specifies which insertion to keep. Options are
"first"
or"last"
. Default is"first"
- Returns
Protein structure dataframe with insertions removed
- Return type
pd.DataFrame
- graphein.protein.graphs.select_chains(protein_df: pandas.core.frame.DataFrame, chain_selection: str, verbose: bool = False) pandas.core.frame.DataFrame [source]#
Extracts relevant chains from
protein_df
.- Parameters
- Returns
Protein structure dataframe containing only entries in the chain selection.
- Return type
pd.DataFrame
- graphein.protein.graphs.subset_structure_to_atom_type(df: pandas.core.frame.DataFrame, granularity: str) pandas.core.frame.DataFrame [source]#
Return a subset of atomic dataframe that contains only certain atom names.
- Parameters
df (pd.DataFrame) – Protein Structure dataframe to subset.
- Returns
Subsetted protein structure dataframe.
- Return type
pd.DataFrame
Edges#
Distance#
Functions for computing biochemical edges of graphs.
- graphein.protein.edges.distance.add_aromatic_interactions(G: networkx.classes.graph.Graph, pdb_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Find all aromatic-aromatic interaction.
Criteria: phenyl ring centroids separated between 4.5A to 7A. Phenyl rings are present on
PHE, TRP, HIS, TYR
(AROMATIC_RESIS
). Phenyl ring atoms on these amino acids are defined by the following atoms: - PHE: CG, CD, CE, CZ - TRP: CD, CE, CH, CZ - HIS: CG, CD, ND, NE, CE - TYR: CG, CD, CE, CZ Centroids of these atoms are taken by taking:(mean x), (mean y), (mean z)
for each of the ring atoms. Notes for future self/developers: - Because of the requirement to pre-compute ring centroids, we do not
use the functions written above (filter_dataframe, compute_distmat, get_interacting_atoms), as they do not return centroid atom euclidean coordinates.
- graphein.protein.edges.distance.add_aromatic_sulphur_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Find all aromatic-sulphur interactions.
- graphein.protein.edges.distance.add_cation_pi_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Add cation-pi interactions.
- graphein.protein.edges.distance.add_delaunay_triangulation(G: networkx.classes.graph.Graph, allowable_nodes: Optional[List[str]] = None)[source]#
Compute the Delaunay triangulation of the protein structure.
This has been used in prior work. References:
Harrison, R. W., Yu, X. & Weber, I. T. Using triangulation to include target structure improves drug resistance prediction accuracy. in 1–1 (IEEE, 2013). doi:10.1109/ICCABS.2013.6629236
Yu, X., Weber, I. T. & Harrison, R. W. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. BMC Genomics 15 Suppl 5, S1 (2014).
Notes: 1. We do not use the add_interacting_resis function, because this
interaction is computed on the
CA
atoms. Therefore, there is code duplication. For now, I have chosen to leave this code duplication in.- Parameters
G (nx.Graph) – The networkx graph to add the triangulation to.
allowable_nodes (List[str], optional) – The nodes to include in the triangulation. If
None
(default), no filtering is done. This parameter is used to filter out nodes that are not desired in the triangulation. Eg if you wanted to construct a delaunay triangulation of the CA atoms of an atomic graph.
- graphein.protein.edges.distance.add_distance_threshold(G: networkx.classes.graph.Graph, long_interaction_threshold: int, threshold: float = 5.0)[source]#
Adds edges to any nodes within a given distance of each other. Long interaction threshold is used to specify minimum separation in sequence to add an edge between networkx nodes within the distance threshold
- graphein.protein.edges.distance.add_disulfide_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Find all disulfide interactions between CYS residues (
DISULFIDE_RESIS
,DISULFIDE_ATOMS
).Criteria: sulfur atom pairs are within 2.2A of each other.
- Parameters
G (nx.Graph) – networkx protein graph
rgroup_df (pd.DataFrame, optional) – pd.DataFrame containing rgroup data, defaults to None, which retrieves the df from the provided nx graph.
- graphein.protein.edges.distance.add_hydrogen_bond_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Add all hydrogen-bond interactions.
- graphein.protein.edges.distance.add_hydrophobic_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Find all hydrophobic interactions.
Performs searches between the following residues:
[ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO, TYR]
(HYDROPHOBIC_RESIS
).Criteria: R-group residues are within 5A distance.
- Parameters
G (nx.Graph) – nx.Graph to add hydrophobic interactions to.
rgroup_df (pd.DataFrame, optional) – Optional dataframe of R-group atoms.
- graphein.protein.edges.distance.add_interacting_resis(G: networkx.classes.graph.Graph, interacting_atoms: numpy.ndarray, dataframe: pandas.core.frame.DataFrame, kind: List[str])[source]#
Add interacting residues to graph.
Returns a list of 2-tuples indicating the interacting residues based on the interacting atoms. This is most typically called after the get_interacting_atoms function above.
Also filters out the list such that the residues have to be at least two apart.
### Parameters
interacting_atoms: (numpy array) result from get_interacting_atoms function.
- dataframe: (pandas dataframe) a pandas dataframe that
houses the euclidean locations of each atom.
- kind: (list) the kind of interaction. Contains one
of : - hydrophobic - disulfide - hbond - ionic - aromatic - aromatic_sulphur - cation_pi - delaunay
- filtered_interacting_resis: (set of tuples) the residues that are in
an interaction, with the interaction kind specified
- graphein.protein.edges.distance.add_ionic_interactions(G: networkx.classes.graph.Graph, rgroup_df: Optional[pandas.core.frame.DataFrame] = None)[source]#
Find all ionic interactions.
Criteria:
[ARG, LYS, HIS, ASP, and GLU]
(IONIC_RESIS
) residues are within 6A. We also check for opposing charges (POS_AA
,NEG_AA
)
- graphein.protein.edges.distance.add_k_nn_edges(G: networkx.classes.graph.Graph, long_interaction_threshold: int, k: int = 5, mode: str = 'connectivity', metric: str = 'minkowski', p: int = 2, include_self: Union[bool, str] = False)[source]#
Adds edges to nodes based on K nearest neighbours. Long interaction threshold is used to specify minimum separation in sequence to add an edge between networkx nodes within the distance threshold
- Parameters
G (nx.Graph) – Protein Structure graph to add distance edges to
long_interaction_threshold (int) – minimum distance in sequence for two nodes to be connected
k (int) – Number of neighbors for each sample.
mode (str) – Type of returned matrix:
"connectivity"
will return the connectivity matrix with ones and zeros, and"distance"
will return the distances between neighbors according to the given metric.metric (str) – The distance metric used to calculate the k-Neighbors for each sample point. The DistanceMetric class gives a list of available metrics. The default distance is
"euclidean"
("minkowski"
metric with thep
param equal to2
).p (int) – Power parameter for the Minkowski metric. When
p = 1
, this is equivalent to usingmanhattan_distance
(l1), andeuclidean_distance
(l2) forp = 2
. For arbitraryp
,minkowski_distance
(l_p) is used. Default is2
(euclidean).include_self (Union[bool, str]) – Whether or not to mark each sample as the first nearest neighbor to itself. If
"auto"
, thenTrue
is used formode="connectivity"
andFalse
formode="distance"
. Default isFalse
.
- Returns
Graph with knn-based edges added
- Return type
nx.Graph
- graphein.protein.edges.distance.add_peptide_bonds(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds peptide backbone as edges to residues in each chain.
- Parameters
G (nx.Graph) – networkx protein graph.
- Return G
networkx protein graph with added peptide bonds.
- Return type
nx.Graph
- graphein.protein.edges.distance.add_sequence_distance_edges(G: networkx.classes.graph.Graph, d: int, name: str = 'sequence_edge') networkx.classes.graph.Graph [source]#
Adds edges based on sequence distance to residues in each chain.
Eg. if
d=6
then we join: nodes(1,7), (2,8), (3,9)..
based on their sequence number.- Parameters
G (nx.Graph) – networkx protein graph.
d – Sequence separation to add edges on.
name (str) – Name of the edge type. Defaults to
"sequence_edge"
.
- Return G
networkx protein graph with added peptide bonds.
- Return type
nx.Graph
- graphein.protein.edges.distance.compute_distmat(pdb_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Compute pairwise euclidean distances between every atom.
Design choice: passed in a DataFrame to enable easier testing on dummy data.
- Parameters
pdb_df (pd.DataFrame) – pd.Dataframe containing protein structure. Must contain columns [“x_coord”, “y_coord”, “z_coord”]
- Returns
pd.Dataframe of euclidean distance matrix
- Return type
pd.DataFrame
- graphein.protein.edges.distance.get_edges_by_bond_type(G: networkx.classes.graph.Graph, bond_type: str) List[Tuple[str, str]] [source]#
Return edges of a particular bond type.
bond_type: (str) one of the elements in the variable BOND_TYPES
resis: (list) a list of tuples, where each tuple is an edge.
- graphein.protein.edges.distance.get_interacting_atoms(angstroms: float, distmat: pandas.core.frame.DataFrame)[source]#
Find the atoms that are within a particular radius of one another.
- graphein.protein.edges.distance.get_ring_atoms(dataframe: pandas.core.frame.DataFrame, aa: str) pandas.core.frame.DataFrame [source]#
Return ring atoms from a dataframe.
A helper function for add_aromatic_interactions.
Gets the ring atoms from the particular aromatic amino acid.
dataframe: the dataframe containing the atom records.
aa: the amino acid of interest, passed in as 3-letter string.
- dataframe: a filtered dataframe containing just those atoms from the
particular amino acid selected. e.g. equivalent to selecting just the ring atoms from a particular amino acid.
- graphein.protein.edges.distance.get_ring_centroids(ring_atom_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Return aromatic ring centrods.
A helper function for add_aromatic_interactions.
Computes the ring centroids for each a particular amino acid’s ring atoms.
Ring centroids are computed by taking the mean of the x, y, and z coordinates.
ring_atom_df: a dataframe computed using get_ring_atoms.
aa: the amino acid under study
- centroid_df: a dataframe containing just the centroid coordinates of
the ring atoms of each residue.
Intramolecular#
Featurization functions for graph edges.
- graphein.protein.edges.intramolecular.add_contacts_edge(G: networkx.classes.graph.Graph, interaction_type: str) networkx.classes.graph.Graph [source]#
Adds specific interaction types to the protein graph.
- Parameters
G (nx.Graph) – networkx protein graph
interaction_type (str) – interaction type to be added
- Return G
nx.Graph with specified interaction-based edges added.
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.get_contacts_df(config: GetContactsConfig, pdb_name: str) pd.DataFrame [source]#
Reads GetContact File and returns it as a pd.DataFrame
- Parameters
config (GetContactsConfig) – GetContactsConfig object
pdb_name (str) – Name of PDB file. Contacts files are name {pdb_name}_contacts.tsv
- Returns
DataFrame of prased GetContacts output
- Return type
pd.DataFrame
- graphein.protein.edges.intramolecular.hydrogen_bond(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds hydrogen bonds to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add hydrogen bonds to
- Returns
nx.Graph with hydrogen bonds added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.hydrophobic(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds hydrophobic interactions to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add hydrophobic interaction edges to
- Returns
nx.Graph with hydrophobic interactions added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.peptide_bonds(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds peptide backbone to residues in each chain
- Parameters
G (nx.Graph) – nx.Graph protein graph
- Returns
nx.Graph protein graph with added peptide bonds
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.pi_cation(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds pi-cation interactions to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add pi-cation interactions to
- Returns
nx.Graph with pi-pi_cation interactions added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.pi_stacking(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds pi-stacking interactions to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add pi-stacking interactions to
- Returns
nx.Graph with pi-stacking interactions added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.read_contacts_file(config: GetContactsConfig, contacts_file: str) pd.DataFrame [source]#
Parses GetContacts file to an edgelist (pd.DataFrame)
- Parameters
config (GetContactsConfig) – GetContactsConfig object (graphein.protein.config.GetContactsConfig)
contacts_file (str) – file name of contacts file
- Returns
Pandas Dataframe of edge list
- Return type
pd.DataFrame
- graphein.protein.edges.intramolecular.run_get_contacts(config: GetContactsConfig, pdb_id: Optional[str] = None, file_name: Optional[str] = None)[source]#
Runs GetContacts on a protein structure. If no file_name is provided, a PDB file is downloaded for the pdb_id
- Parameters
config (graphein.protein.config.GetContactsConfig) – GetContactsConfig object containing GetContacts parameters
pdb_id (str, optional) – 4-character PDB accession code
file_name (str, optional) – PDB_name file to use, if annotations to be retrieved from the PDB
- graphein.protein.edges.intramolecular.salt_bridge(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds salt bridges to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add salt bridges to
- Returns
nx.Graph with salt bridges added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.t_stacking(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds t-stacking interactions to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add t-stacking interactions to
- Returns
nx.Graph with t-stacking interactions added
- Return type
nx.Graph
- graphein.protein.edges.intramolecular.van_der_waals(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds van der Waals interactions to protein structure graph
- Parameters
G (nx.Graph) – nx.Graph to add van der Waals interactions to
- Returns
nx.Graph with van der Waals interactions added
- Return type
nx.Graph
Atomic#
Functions for computing atomic structure of proteins.
- graphein.protein.edges.atomic.add_atomic_edges(G: networkx.classes.graph.Graph, tolerance: float = 0.56) networkx.classes.graph.Graph [source]#
Computes covalent edges based on atomic distances. Covalent radii are assigned to each atom based on its bond assign_bond_states_to_dataframe The distance matrix is then thresholded to entries less than this distance plus some tolerance to create an adjacency matrix. This adjacency matrix is then parsed into an edge list and covalent edges added
- Parameters
G (nx.Graph) – Atomic graph (nodes correspond to atoms) to populate with atomic bonds as edges
tolerance (float) – Tolerance for atomic distance. Default is
0.56
Angstroms. Commonly used values are:0.4, 0.45, 0.56
- Returns
Atomic graph with edges between bonded atoms added
- Return type
nx.Graph
- graphein.protein.edges.atomic.add_bond_order(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Assign bond orders to the covalent bond edges between atoms on the basis of bond length. Values are taken from:
Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database. Jon C. Baber and Edward E. Hodgkin*
- Parameters
G (nx.Graph) – Atomic-level protein graph with covalent edges.
- Returns
Atomic-level protein graph with covalent edges annotated with putative bond order.
- Return type
mx.Graph
- graphein.protein.edges.atomic.add_ring_status(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Identifies rings in the atomic graph. Assigns the edge attribute
"RING"
to edges in the ring. We do not distinguish between aromatic and non-aromatic rings. Functions by identifying all cycles in the graph.- Parameters
G (nx.Graph) – Atom-level protein structure graph to add ring edge types to
- Returns
Atom-level protein structure graph with added
"RING"
edge attribute- Return type
nx.Graph
- graphein.protein.edges.atomic.assign_bond_states_to_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Takes a
PandasPDB
atom dataframe and assigns bond states to each atom based on:Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii Heyrovska, 2008
First, maps atoms to their standard bond states (
DEFAULT_BOND_STATE
). Second, maps non-standard bonds states (RESIDUE_ATOM_BOND_STATE
). Fills NaNs with standard bond states.- Parameters
df (pd.DataFrame) – Pandas PDB dataframe
- Returns
Dataframe with added
atom_bond_state
column- Return type
pd.DataFrame
- graphein.protein.edges.atomic.assign_covalent_radii_to_dataframe(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Assigns covalent radius (
COVALENT_RADII
) to each atom based on its bond state. Adds acovalent_radius
column. Using values from:Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii Heyrovska, 2008
- Parameters
df (pd.DataFrame) – Pandas PDB dataframe with a
bond_states_column
- Returns
Pandas PDB dataframe with added
covalent_radius
column- Return type
pd.DataFrame
- graphein.protein.edges.atomic.identify_bond_type_from_mapping(G: networkx.classes.graph.Graph, u: str, v: str, a: Dict[str, Any], query: str)[source]#
Compares the bond length between two atoms in the graph, and the relevant experimental value by performing a lookup against the watershed values in:
Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database. Jon C. Baber and Edward E. Hodgkin*
Bond orders are assigned in the order
triple
<double
<single
(e.g. if a bond is shorter than the triple bond watershed (w_dt
) then it is assigned as a triple bond. Similarly, if a bond is longer than this but shorter than the double bond watershed (w_sd
), it is assigned double bond status.- Parameters
- Returns
Graph with atomic edge bond order assigned
- Return type
nx.Graph
Features#
Node#
- graphein.protein.features.nodes.aaindex.aaindex1(G: networkx.classes.graph.Graph, accession: str) networkx.classes.graph.Graph [source]#
Adds AAIndex1 datavalues for a given accession as node features.
- Parameters
G (nx.Graph) – nx.Graph protein structure graphein to featurise
accession (str) – AAIndex1 accession code for values to use
- Returns
Protein Structure graph with AAindex1 node features added
- Return type
nx.Graph
- graphein.protein.features.nodes.aaindex.fetch_AAIndex(accession: str) Tuple[str, Dict[str, float]] [source]#
Fetches AAindex1 dictionary from an accession code. The dictionary maps one-letter AA codes to float values
Featurization functions for amino acids.
- graphein.protein.features.nodes.amino_acid.amino_acid_one_hot(n, d: Dict[str, Any], return_array: bool = True, allowable_set: Optional[List[str]] = None) Union[pandas.core.series.Series, numpy.ndarray] [source]#
Adds a one-hot encoding of amino acid types as a node attribute.
- Parameters
n (str) – node name, this is unused and only included for compatibility with the other functions
d (Dict[str, Any]) – Node data
return_array (bool) – If True, returns a numpy array of one-hot encoding, otherwise returns a pd.Series. Default is True.
allowable_set – Specifies vocabulary of amino acids. Default is None (which uses graphein.protein.resi_atoms.STANDARD_AMINO_ACIDS).
- Returns
One-hot encoding of amino acid types
- Return type
Union[pd.Series, np.ndarray]
- graphein.protein.features.nodes.amino_acid.expasy_protein_scale(n, d, selection: Optional[List[str]] = None, add_separate: bool = False, return_array: bool = False) Union[pandas.core.series.Series, numpy.ndarray] [source]#
Return amino acid features that come from the EXPASY protein scale.
Source: https://web.expasy.org/protscale/
- Parameters
n – Node in a NetworkX graph
d – NetworkX node attributes.
selection (List[str], optional) – List of columns to select. Viewable in graphein.protein.features.nodes.meiler_embeddings
add_separate – Whether or not to add the expasy features as indvidual entries or as a series.
return_array (bool) – Bool indicating whether or not to return a np.ndarray of the features. Default is pd.Series
- Returns
pd.Series of amino acid features
- Return type
pd.Series
- graphein.protein.features.nodes.amino_acid.hydrogen_bond_acceptor(n, d, sum_features: bool = True, return_array: bool = False) pandas.core.series.Series [source]#
Adds Hydrogen Bond Acceptor status to nodes as a feature.”
- Parameters
n (str) – node id
d (Dict[str, Any]) – Dict of node attributes
sum_features (bool) – If
True
, the feature is the number of hydrogen bond acceptors per node. IfFalse
, the feature is a boolean indicating whether or not the node has a hydrogen bond acceptor. Default isTrue
.return_array (bool) – If
True
, returns anp.ndarray
, otherwise returns apd.Series
. Default isTrue
.
- graphein.protein.features.nodes.amino_acid.hydrogen_bond_donor(n: str, d: Dict[str, Any], sum_features: bool = True, return_array: bool = False) pandas.core.series.Series [source]#
Adds Hydrogen Bond Donor status to nodes as a feature.
- Parameters
n (str) – node id
d (Dict[str, Any]) – Dict of node attributes
sum_features (bool) – If
True
, the feature is the number of hydrogen bond donors per node. IfFalse
, the feature is a boolean indicating whether or not the node has a hydrogen bond donor. Default isTrue
.return_array (bool) – If
True
, returns anp.ndarray
, otherwise returns apd.Series
. Default isTrue
.
- graphein.protein.features.nodes.amino_acid.load_expasy_scales() pandas.core.frame.DataFrame [source]#
Load pre-downloaded EXPASY scales.
This helps with node featuarization.
The function is LRU-cached in memory for fast access on each function call.
- Returns
pd.DataFrame containing expasy scales
- Return type
pd.DataFrame
- graphein.protein.features.nodes.amino_acid.load_meiler_embeddings() pandas.core.frame.DataFrame [source]#
Load pre-downloaded Meiler embeddings.
This helps with node featurization.
The function is LRU-cached in memory for fast access on each function call.
- Returns
pd.DataFrame containing Meiler Embeddings from Meiler et al. 2001
- Return type
pd.DataFrame
- graphein.protein.features.nodes.amino_acid.meiler_embedding(n, d, return_array: bool = False) Union[pandas.core.series.Series, numpy.array] [source]#
Return amino acid features from reduced dimensional embeddings of amino acid physicochemical properties.
Source: https://link.springer.com/article/10.1007/s008940100038 doi: https://doi.org/10.1007/s008940100038
- Parameters
n – Node in a NetworkX graph
d – NetworkX node attributes.
- Returns
pd.Series of amino acid features
- Return type
pd.Series
Featurization functions for graph nodes using DSSP predicted features.
- graphein.protein.features.nodes.dssp.add_dssp_df(G: nx.Graph, dssp_config: Optional[DSSPConfig]) nx.Graph [source]#
Construct DSSP dataframe and add as graph level variable to protein graph
- Parameters
G – Input protein graph
G – nx.Graph
dssp_config (DSSPConfig, optional) – DSSPConfig object. Specifies which executable to run. Located in graphein.protein.config
- Returns
Protein graph with DSSP dataframe added
- Return type
nx.Graph
- graphein.protein.features.nodes.dssp.add_dssp_feature(G: networkx.classes.graph.Graph, feature: str) networkx.classes.graph.Graph [source]#
Adds add_dssp_feature specified amino acid feature as calculated by DSSP to every node in a protein graph :param G: Protein structure graph to add dssp feature to :param feature: string specifying name of DSSP feature to add: “chain”, “resnum”, “icode”, “aa”, “ss”, “asa”, “phi”, “psi”, “dssp_index”, “NH_O_1_relidx”, “NH_O_1_energy”, “O_NH_1_relidx”, “O_NH_1_energy”, “NH_O_2_relidx”, “NH_O_2_energy”, “O_NH_2_relidx”, “O_NH_2_energy”,
These names parse_dssp_df accessible in the DSSP_COLS list :param G: Protein Graph to add features to :type G: nx.Graph :return: Protein structure graph with DSSP feature added to nodes :rtype: nx.Graph
- graphein.protein.features.nodes.dssp.asa(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds ASA of each residue in protein graph as calculated by DSSP.
- Parameters
G (nx.Graph) – Input protein graph
- Returns
Protein graph with asa values added
- Return type
nx.Graph
- graphein.protein.features.nodes.dssp.parse_dssp_df(dssp: Dict[str, Any]) pandas.core.frame.DataFrame [source]#
Parse DSSP output to DataFrame
- Parameters
dssp (Dict[str, Any]) – Dictionary containing DSSP output
- Returns
pd.Dataframe containing parsed DSSP output
- Return type
pd.DataFrame
- graphein.protein.features.nodes.dssp.phi(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds phi-angles of each residue in protein graph as calculated by DSSP.
- Parameters
G (nx.Graph) – Input protein graph
- Returns
Protein graph with phi-angles values added
- Return type
nx.Graph
- graphein.protein.features.nodes.dssp.process_dssp_df(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Processes a DSSP DataFrame to make indexes align with node IDs
- Parameters
df (pd.DataFrame) – pd.DataFrame containing the parsed output from DSSP.
- Returns
pd.DataFrame with node IDs
- Return type
pd.DataFrame
- graphein.protein.features.nodes.dssp.psi(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds psi-angles of each residue in protein graph as calculated by DSSP.
- Parameters
G (nx.Graph) – Input protein graph
- Returns
Protein graph with psi-angles values added
- Return type
nx.Graph
- graphein.protein.features.nodes.dssp.rsa(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds RSA (relative solvent accessibility) of each residue in protein graph as calculated by DSSP.
- Parameters
G (nx.Graph) – Input protein graph
- Returns
Protein graph with rsa values added
- Return type
nx.Graph
- graphein.protein.features.nodes.dssp.secondary_structure(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds secondary structure of each residue in protein graph as calculated by DSSP in the form of a string
- Parameters
G (nx.Graph) – Input protein graph
- Returns
Protein graph with secondary structure added
- Return type
nx.Graph
Provides geometry-based featurisation functions.
- graphein.protein.features.nodes.geometry.add_beta_carbon_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False)[source]#
Adds vector from node (typically alpha carbon) to position of beta carbon.
Glycine does not have a beta carbon, so we set it to
np.array([0, 0, 0])
. We extract the position of the beta carbon from the unprocessed atomic PDB dataframe. For this we use theraw_pdb_df
dataframe. If scale, we scale the vector to the unit vector. If reverse is True, we reverse the vector (C beta - node
). If reverse is false (default) we compute (node - C beta
).
- graphein.protein.features.nodes.geometry.add_sequence_neighbour_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False, n_to_c: bool = True)[source]#
Computes vector from node to adjacent node in sequence. Typically used with
CA
(alpha carbon) graphs.If
n_to_c
isTrue
(default), we compute the vectors from the N terminus to the C terminus (canonical direction). Ifreverse
isFalse
(default), we computeNode_i - Node_{i+1}
. Ifreverse is ``True
, we computeNode_{i+1} - Node_i
. :param g: Graph to add vector to. :type g: nx.Graph :param scale: Scale vector to unit vector. Defaults toTrue
. :type scale: bool :param reverse: Reverse vector. Defaults toFalse
. :type reverse: bool :param n_to_c: Compute vector from N to C or C to N. Defaults toTrue
. :type n_to_c: bool
- graphein.protein.features.nodes.geometry.add_sidechain_vector(g: networkx.classes.graph.Graph, scale: bool = True, reverse: bool = False)[source]#
Adds vector from node to average position of sidechain atoms.
We compute the mean of the sidechain atoms for each node. For this we use the
rgroup_df
dataframe. If the graph does not contain thergroup_df
dataframe, we compute it from theraw_pdb_df
. If scale, we scale the vector to the unit vector. If reverse is True, we reverse the vector (sidechain - node
). If reverse is false (default) we compute (node - sidechain
).
Sequence#
Functions to add embeddings from pre-trained language models protein structure graphs.
- graphein.protein.features.sequence.embeddings.biovec_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds BioVec sequence embedding feature to the graph. Computed over chains.
- Source
ProtVec: A Continuous Distributed Representation of Biological Sequences
Paper: http://arxiv.org/pdf/1503.05140v1.pdf
- Parameters
G (nx.Graph) – nx.Graph protein structure graph.
- Returns
nx.Graph protein structure graph with biovec embedding added. e.g.
G.graph["biovec_embedding_A"]
for chainA
.- Return type
nx.Graph
- graphein.protein.features.sequence.embeddings.compute_esm_embedding(sequence: str, representation: str, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) np.ndarray [source]#
Computes sequence embedding using Pre-trained ESM model from FAIR
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob
Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander
Pre-trained models:
Full Name layers params Dataset Embedding Dim Model URL ========= ====== ====== ======= ============= ========= ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt ESM1-main esm1_t34_670M_UR50S 34 670M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt esm1_t34_670M_UR50D 34 670M UR50/D 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt esm1_t34_670M_UR100 34 670M UR100 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt esm1_t12_85M_UR50S 12 85M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt esm1_t6_43M_UR50S 6 43M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt
- Parameters
sequence (str) – Protein sequence to embed (str)
representation (str) – Type of embedding to extract.
"residue"
or"sequence"
. Sequence-level embeddings are averaged residue embeddingsmodel_name (str) – Name of pre-trained model to use
output_layer (int) – integer indicating which layer the output should be taken from
- Returns
embedding (
np.ndarray
)- Return type
np.ndarray
- graphein.protein.features.sequence.embeddings.esm_residue_embedding(G: networkx.classes.graph.Graph, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) networkx.classes.graph.Graph [source]#
Computes ESM residue embeddings from a protein sequence and adds the to the graph.
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob
Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander
Pre-trained models
- graphein.protein.features.sequence.embeddings.esm_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Computes ESM sequence embedding feature over chains in a graph.
- Parameters
G (nx.Graph) – nx.Graph protein structure graph.
- Returns
nx.Graph protein structure graph with esm embedding features added eg.
G.graph["esm_embedding_A"]
for chain A.- Return type
nx.Graph
- graphein.protein.features.sequence.propy.aa_dipeptide_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
- Calculate the composition of AADs, dipeptide and 3-mers for a given protein sequence. Contains all composition
values of AADs, dipeptide and 3-mers (8420).
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with aa_dipeptide_composition feature added. G.graph[“aa_dipeptide_composition_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.aa_spectrum(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate the spectrum descriptors of 3-mers for a given protein. Contains the composition values of 8000 3-mers
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with aa_spectrum feature added. G.graph[“aa_spectrum_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.all_composition_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate all composition descriptors based on seven different properties of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_descriptors feature added. G.graph[“composition_descriptors_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.all_ctd_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate all CTD descriptors based seven different properties of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with ctd_descriptors feature added. G.graph[“ctd_descriptors_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.all_distribution_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate all distribution descriptors based on seven different properties of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_descriptors feature added. G.graph[“distribution_descriptors_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.all_transition_descriptors(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate all transition descriptors based on seven different properties of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_descriptors feature added. G.graph[“transition_descriptors_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.amino_acid_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate the composition of Amino acids for a given protein sequence.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (Optional[List[str]]) – Aggregation types to use
- Returns
Protein Graph with amino_acid_composition feature added. G.graph[“amino_acid_composition_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
- Compute Geary autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240 Geary
autocorrelation descriptors based on the given properties(i.e., _AAPropert).
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_all feature added. G.graph[“autocorrelation_geary_all_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on AvFlexibility. contains 30 Geary Autocorrelation
descriptors based on AvFlexibility.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_av_flexibility feature added. G.graph[“autocorrelation_geary_av_flexibility_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on FreeEnergy. result contains 30 Geary Autocorrelation
descriptors based on FreeEnergy.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_free_energy feature added. G.graph[“autocorrelation_geary_av_free_energy_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on hydrophobicity. result contains 30 Geary Autocorrelation
descriptors based on hydrophobicity.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_hydrophobicity feature added. G.graph[“autocorrelation_geary_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on Mutability. result contains 30 Geary Autocorrelation
descriptors based on mutability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_mutability feature added. G.graph[“autocorrelation_geary_mutability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on polarizability. result contains 30 Geary Autocorrelation
descriptors based on polarizability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_polarizability feature added. G.graph[“autocorrelation_geary_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on ResidueASA. result contains 30 Geary Autocorrelation
descriptors based on ResidueASA.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_residue_asa feature added. G.graph[“autocorrelation_geary_residue_asa_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on ResidueVol. result contains 30 Geary Autocorrelation
descriptors based on ResidueVol.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_residue_vol feature added. G.graph[“autocorrelation_geary_residue_vol_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_geary_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Geary Autocorrelation descriptors based on Steric. result contains 30 Geary Autocorrelation
descriptors based on Steric
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_geary_steric feature added. G.graph[“autocorrelation_geary_steric_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
- Compute Moran autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240 Moran
autocorrelation descriptors based on the given properties(i.e., _AAPropert).
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_all feature added. G.graph[“autocorrelation_moran_all_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on AvFlexibility. contains 30 Moran Autocorrelation
descriptors based on AvFlexibility.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_av_flexibility feature added. G.graph[“autocorrelation_moran_av_flexibility_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on FreeEnergy. result contains 30 Moran Autocorrelation
descriptors based on FreeEnergy.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_free_energy feature added. G.graph[“autocorrelation_moran_av_free_energy_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on hydrophobicity. result contains 30 Moran Autocorrelation
descriptors based on hydrophobicity.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_hydrophobicity feature added. G.graph[“autocorrelation_moran_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on Mutability. result contains 30 Moran Autocorrelation
descriptors based on mutability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_mutability feature added. G.graph[“autocorrelation_moran_mutability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on polarizability. result contains 30 Moran Autocorrelation
descriptors based on polarizability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_polarizability feature added. G.graph[“autocorrelation_moran_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on ResidueASA. result contains 30 Moran Autocorrelation
descriptors based on ResidueASA.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_residue_asa feature added. G.graph[“autocorrelation_moran_residue_asa_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on ResidueVol. result contains 30 Moran Autocorrelation
descriptors based on ResidueVol.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_residue_vol feature added. G.graph[“autocorrelation_moran_residue_vol_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_moran_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the Moran Autocorrelation descriptors based on Steric. result contains 30 Moran Autocorrelation
descriptors based on Steric
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_moran_steric feature added. G.graph[“autocorrelation_moran_steric_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_all(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
- Compute NormalizedMoreauBroto autocorrelation descriptors based on 8 properties of AADs. Result contains 30*8=240
NormalizedMoreauBroto autocorrelation descriptors based on the given properties(i.e., _AAPropert).
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_all feature added. G.graph[“autocorrelation_normalized_moreau_broto_all_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_av_flexibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on AvFlexibility. contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on AvFlexibility.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_av_flexibility feature added. G.graph[“autocorrelation_normalized_moreau_broto_av_flexibility_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_free_energy(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on FreeEnergy. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on FreeEnergy.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_free_energy feature added. G.graph[“autocorrelation_normalized_moreau_broto_av_free_energy_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on hydrophobicity. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on hydrophobicity.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_hydrophobicity feature added. G.graph[“autocorrelation_normalized_moreau_broto_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_mutability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on Mutability. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on mutability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_mutability feature added. G.graph[“autocorrelation_normalized_moreau_broto_mutability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on polarizability. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on polarizability.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_polarizability feature added. G.graph[“autocorrelation_normalized_moreau_broto_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_residue_asa(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on ResidueASA. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on ResidueASA.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_residue_asa feature added. G.graph[“autocorrelation_normalized_moreau_broto_residue_asa_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_residue_vol(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on ResidueVol. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on ResidueVol.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_residue_vol feature added. G.graph[“autocorrelation_normalized_moreau_broto_residue_vol_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_normalized_moreau_broto_steric(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
- Calculate the NormalizedMoreauBroto Autocorrelation descriptors based on Steric. result contains 30
NormalizedMoreauBroto Autocorrelation descriptors based on Steric
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_normalized_moreau_broto_steric feature added. G.graph[“autocorrelation_normalized_moreau_broto_steric_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.autocorrelation_total(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
- Compute all autocorrelation descriptors based on 8 properties of AADs. result contains 30*8*3=720 normalized Moreau
Broto, Moran, and Geary
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with autocorrelation_total feature added. G.graph[“autocorrelation_total_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on Charge of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_charge feature added. G.graph[“composition_charge_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Compute composition descriptors.
- Parameters
- Returns
Protein Graph with composition_{AAPName} feature added. G.graph[“composition_{AAPName}_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on Hydrophobicity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_hydrophobicity feature added. G.graph[“composition_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on NormalizedVDWV of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_normalized_vdwv feature added. G.graph[“composition_normalized_vdwv_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on Polarity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_polarity feature added. G.graph[“composition_polarity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on Polarizability of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_polarizability feature added. G.graph[“composition_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on SecondaryStr of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_secondary_str feature added. G.graph[“composition_secondary_str_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.composition_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate composition descriptors based on SolventAccessibility of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with composition_solvent_accessibility feature added. G.graph[“composition_solvent_accessibility_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.compute_propy_feature(G: networkx.classes.graph.Graph, func: Callable, feature_name: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Computes Propy Descriptors over chains in a Protein Graph
- Parameters
G (nx.Graph) – Protein Graph
func (Callable) – ProPy wrapper function to compute
feature_name (str) – Name of feature to index it in the nx.Graph object
aggregation_type (List[str], optional) – Type of aggregation to use when aggregating a feature over multiple chains. One of: [“mean”, “man”, “sum”]. Defaults to None.
- Return G
Returns protein Graph with features added. Features are accessible with G.graph[{feature_name}_{chain | aggegation_type}]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.dipeptide_composition(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate the composition of dipeptidefor a given protein sequence. Contains composition of 400 dipeptides
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with dipeptide_composition feature added. G.graph[“dipeptide_composition_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on Charge of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_charge feature added. G.graph[“distribution_charge_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Compute distribution descriptors.
- Parameters
- Returns
Protein Graph with distribution_{AAPName} feature added. G.graph[“distribution_{AAPName}_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on Hydrophobicity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_hydrophobicity feature added. G.graph[“distribution_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on NormalizedVDWV of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_normalized_vdwv feature added. G.graph[“distribution_normalized_vdwv_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on Polarity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_polarity feature added. G.graph[“distribution_polarity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on Polarizability of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_polarizability feature added. G.graph[“distribution_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on SecondaryStr of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_secondary_str feature added. G.graph[“distribution_secondary_str_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.distribution_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate distribution descriptors based on SolventAccessibility of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with distribution_solvent_accessibility feature added. G.graph[“distribution_solvent_accessibility_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.quasi_sequence_order(G: networkx.classes.graph.Graph, maxlag: int = 30, weight: float = 0.1) networkx.classes.graph.Graph [source]#
Compute quasi-sequence-order descriptors for a given protein.
- Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect.
Biochemical and Biophysical Research Communications 2000, 278, 477-483.
- graphein.protein.features.sequence.propy.sequence_order_coupling_number_total(G: networkx.classes.graph.Graph, maxlag: int = 30) networkx.classes.graph.Graph [source]#
Compute the sequence order coupling numbers from 1 to maxlag for a given protein sequence.
- Parameters
G (nx.Graph) – Protein Graph
(default (maxlag (int, optional) – 30)) – the maximum lag and the length of the protein should be larger
- Returns
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_charge(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on Charge of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_charge feature added. G.graph[“transition_charge_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_descriptor(G: networkx.classes.graph.Graph, AAProperty: Dict[Any, Any], AAPName: str, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Compute transition descriptors.
- Parameters
- Returns
Protein Graph with transition_{AAPName} feature added. G.graph[“transition_{AAPName}_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_hydrophobicity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]] = None) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on Hydrophobicity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_hydrophobicity feature added. G.graph[“transition_hydrophobicity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_normalized_vdwv(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on NormalizedVDWV of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_normalized_vdwv feature added. G.graph[“transition_normalized_vdwv_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_polarity(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on Polarity of AADs.
- Parameters
G (nx.Graph) – Protein Graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_polarity feature added. G.graph[“transition_polarity_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_polarizability(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on Polarizability of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_polarizability feature added. G.graph[“transition_polarizability_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_secondary_str(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on SecondaryStr of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_secondary_str feature added. G.graph[“transition_secondary_str_{chain | aggregation_type}”]
- Return type
nx.Graph
- graphein.protein.features.sequence.propy.transition_solvent_accessibility(G: networkx.classes.graph.Graph, aggregation_type: Optional[List[str]]) networkx.classes.graph.Graph [source]#
Calculate transition descriptors based on SolventAccessibility of AADs.
- Parameters
G (nx.Graph) – Protein graph to featurise
aggregation_type (List[Optional[str]]) – Aggregation types to use over chains
- Returns
Protein Graph with transition_solvent_accessibility feature added. G.graph[“transition_solvent_accessibility_{chain | aggregation_type}”]
- Return type
nx.Graph
Functions for graph-level featurization of the sequence of a protein. This submodule is focussed on physicochemical proporties of the sequence.
Sequence Utils#
Functions to add embeddings from pre-trained language models protein structure graphs.
- graphein.protein.features.sequence.embeddings.biovec_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Adds BioVec sequence embedding feature to the graph. Computed over chains.
- Source
ProtVec: A Continuous Distributed Representation of Biological Sequences
Paper: http://arxiv.org/pdf/1503.05140v1.pdf
- Parameters
G (nx.Graph) – nx.Graph protein structure graph.
- Returns
nx.Graph protein structure graph with biovec embedding added. e.g.
G.graph["biovec_embedding_A"]
for chainA
.- Return type
nx.Graph
- graphein.protein.features.sequence.embeddings.compute_esm_embedding(sequence: str, representation: str, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) np.ndarray [source]#
Computes sequence embedding using Pre-trained ESM model from FAIR
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob
Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander
Pre-trained models:
Full Name layers params Dataset Embedding Dim Model URL ========= ====== ====== ======= ============= ========= ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt ESM1-main esm1_t34_670M_UR50S 34 670M UR50/S 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt esm1_t34_670M_UR50D 34 670M UR50/D 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt esm1_t34_670M_UR100 34 670M UR100 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt esm1_t12_85M_UR50S 12 85M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt esm1_t6_43M_UR50S 6 43M UR50/S 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt
- Parameters
sequence (str) – Protein sequence to embed (str)
representation (str) – Type of embedding to extract.
"residue"
or"sequence"
. Sequence-level embeddings are averaged residue embeddingsmodel_name (str) – Name of pre-trained model to use
output_layer (int) – integer indicating which layer the output should be taken from
- Returns
embedding (
np.ndarray
)- Return type
np.ndarray
- graphein.protein.features.sequence.embeddings.esm_residue_embedding(G: networkx.classes.graph.Graph, model_name: str = 'esm1b_t33_650M_UR50S', output_layer: int = 33) networkx.classes.graph.Graph [source]#
Computes ESM residue embeddings from a protein sequence and adds the to the graph.
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (2019) Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob
Transformer protein language models are unsupervised structure learners 2020 Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander
Pre-trained models
- graphein.protein.features.sequence.embeddings.esm_sequence_embedding(G: networkx.classes.graph.Graph) networkx.classes.graph.Graph [source]#
Computes ESM sequence embedding feature over chains in a graph.
- Parameters
G (nx.Graph) – nx.Graph protein structure graph.
- Returns
nx.Graph protein structure graph with esm embedding features added eg.
G.graph["esm_embedding_A"]
for chain A.- Return type
nx.Graph
Utils#
Utility functions to work with graph-level features.
- graphein.protein.features.utils.aggregate_graph_feature_over_chains(G: networkx.classes.graph.Graph, feature_name: str, aggregation_type: str) networkx.classes.graph.Graph [source]#
Performs aggregation of a feature over the chains. E.g. sums/averages/min/max molecular weights for each chain.
- Parameters
- Raises
NameError – If
aggregation_type
is not one of"min"`, ``"max"
,"sum"
,"mean"
.- Returns
nx.Graph of protein with a new aggregated feature
G.graph[f"{feature_name}_{aggregation_type}"]
.- Return type
nx.Graph
- graphein.protein.features.utils.convert_graph_dict_feat_to_series(G: networkx.classes.graph.Graph, feature_name: str) networkx.classes.graph.Graph [source]#
Takes in a graph and a graph-level
feature_name
. Converts this feature to apd.Series
. This is useful as some features are output as dictionaries and we wish to standardise this.- Parameters
G (nx.Graph) – nx.Graph containing
G.graph[f"{feature_name}"]
(Dict[Any, Any]
).feature_name (str) – Name of feature to convert to dictionary.
- Returns
nx.Graph containing
G.graph[f"{feature_name}"]: pd.Series
.- Return type
nx.Graph
Subgraphs#
Provides functions for extracting subgraphs from protein graphs.
- graphein.protein.subgraphs.extract_k_hop_subgraph(g: networkx.classes.graph.Graph, central_node: str, k: int, k_only: bool = False, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a k-hop subgraph.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
central_node (str) – The central node to extract the subgraph from.
k (int) – The number of hops to extract.
k_only (bool) – Whether to only extract the exact k-hop subgraph (e.g. include 2-hop neighbours in 5-hop graph). Defaults to False.
filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool, optional) – Whether to inverse the selection, defaults to False
return_node_list (bool) – Whether to return the node list. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph(g: networkx.classes.graph.Graph, node_list: Optional[List[str]] = None, sequence_positions: Optional[List[int]] = None, chains: Optional[List[str]] = None, residue_types: Optional[List[str]] = None, atom_types: Optional[List[str]] = None, bond_types: Optional[List[str]] = None, centre_point: Optional[Union[numpy.ndarray, Tuple[float, float, float]]] = None, radius: Optional[float] = None, ss_elements: Optional[List[str]] = None, rsa_threshold: Optional[float] = None, k_hop_central_node: Optional[str] = None, k_hops: Optional[int] = None, k_only: Optional[bool] = None, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a list of nodes, sequence positions, chains, residue types, atom types, centre point and radius.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
node_list (List[str], optional) – List of nodes to extract specified by their node_id. Defaults to None.
sequence_positions (List[int], optional) – The sequence positions to extract. Defaults to None.
chains (List[str], optional) – The chain(s) to extract. Defaults to None.
residue_types (List[str], optional) – List of allowable residue types (3 letter residue names). Defaults to None.
atom_types (List[str], optional) – List of allowable atom types. Defaults to None.
centre_point (Union[np.ndarray, Tuple[float, float, float]], optional) – The centre point to extract the subgraph from. Defaults to None.
radius (float, optional) – The radius to extract the subgraph from. Defaults to None.
ss_elements (List[str], optional) – List of secondary structure elements to extract. [“H”, “B”, “E”, “G”, “I”, “T”, “S”, “-“] corresponding to Alpha helix Beta bridge, Strand, Helix-3, Helix-5, Turn, Bend, None. Defaults to None.
rsa_threshold (float, optional) – The threshold to use for the RSA. Defaults to None.
central_node (str, optional) – The central node to extract the subgraph from. Defaults to None.
k (int) – The number of hops to extract.
k_only (bool) – Whether to only extract the exact k-hop subgraph (e.g. include 2-hop neighbours in 5-hop graph). Defaults to False.
filter_dataframe (bool, optional) – Whether to filter the pdb_df dataframe of the graph. Defaults to True. Defaults to None.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool, optional) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_by_bond_type(g: networkx.classes.graph.Graph, bond_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a list of allowable bond types.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
bond_types (List[str]) – List of allowable bond types.
filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool, optional) – Whether to inverse the selection, defaults to False
return_node_list (bool, optional) – Whether to return the node list, defaults to False
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_by_sequence_position(g: networkx.classes.graph.Graph, sequence_positions: List[int], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a chain.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
chain (List[int]) – The sequence positions to extract.
filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_atom_types(g: networkx.classes.graph.Graph, atom_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a list of atom types.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
atom_types (List[str]) – The list of atom types to extract.
filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_chains(g: networkx.classes.graph.Graph, chains: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a chain.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
chain (List[str]) – The chain(s) to extract.
filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_node_list(g, node_list: Optional[List[str]], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a list of nodes.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
node_list (List[str]) – The list of nodes to extract.
filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_point(g: networkx.classes.graph.Graph, centre_point: Union[numpy.ndarray, Tuple[float, float, float]], radius: float, filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a centre point and radius.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
centre_point (Tuple[float, float, float]) – The centre point of the subgraph.
radius (float) – The radius of the subgraph.
filter_dataframe (bool) – Whether to filter the pdb_df dataframe of the graph. Defaults to True.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_residue_types(g: networkx.classes.graph.Graph, residue_types: List[str], filter_dataframe: bool = True, update_coords: bool = True, recompute_distmat: bool = False, inverse: bool = False, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph from a graph based on a list of allowable residue types.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
residue_types (List[str]) – List of allowable residue types (3 letter residue names).
filter_dataframe (bool, optional) – Whether to filer the pdb_df of the graph, defaults to True
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool) – Whether to inverse the selection. Defaults to False.
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_subgraph_from_secondary_structure(g: networkx.classes.graph.Graph, ss_elements: List[str], inverse: bool = False, filter_dataframe: bool = True, recompute_distmat: bool = False, update_coords: bool = True, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts subgraphs for nodes that have a secondary structure element in the list.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
ss_elements (List[str]) – List of secondary structure elements to extract.
inverse (bool) – Whether to inverse the selection. Defaults to False.
filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
return_node_list – Whether to return the node list. Defaults to False.
- Raises
ProteinGraphConfigurationError – If the graph does not contain ss features on the nodes (d[‘ss’] not in d.keys() for _, d in g.nodes(data=True)).
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
- graphein.protein.subgraphs.extract_surface_subgraph(g: networkx.classes.graph.Graph, rsa_threshold: float = 0.2, inverse: bool = False, filter_dataframe: bool = True, recompute_distmat: bool = False, update_coords: bool = True, return_node_list: bool = False) Union[networkx.classes.graph.Graph, List[str]] [source]#
Extracts a subgraph based on thresholding the Relative Solvent Accessibility (RSA). This can be used for extracting a surface graph.
- Parameters
g (nx.Graph) – The graph to extract the subgraph from.
rsa_threshold (float) – The threshold to use for the RSA. Defaults to 0.2 (20%)
filter_dataframe (bool, optional) – Whether to filter the pdb_df of the graph, defaults to True
update_coords (bool) – Whether to update the coordinates of the graph. Defaults to True.
recompute_distmat (bool) – Whether to recompute the distance matrix of the graph. Defaults to False.
inverse (bool, optional) – Whether to inverse the selection, defaults to False
return_node_list (bool) – Whether to return the node list. Defaults to False.
- Raises
ProteinGraphConfigurationError – If the graph does not contain RSA features on the nodes (d[‘rsa’] not in d.keys() for _, d in g.nodes(data=True)).
- Returns
The subgraph or node list if return_node_list is True.
- Return type
Union[nx.Graph, List[str]]
Analysis#
Contains utilities for computing analytics on and plotting summaries of Protein Structure Graphs.
- graphein.protein.analysis.graph_summary(G: networkx.classes.graph.Graph, summary_statistics: List[str] = ['degree', 'betweenness_centrality', 'closeness_centrality', 'eigenvector_centrality', 'communicability_betweenness_centrality'], custom_data: Optional[Union[pandas.core.frame.DataFrame, pandas.core.series.Series]] = None, plot: bool = False) pandas.core.frame.DataFrame [source]#
Returns a summary of the graph in a dataframe.
- Parameters
G (nx.Graph) – NetworkX graph to get summary of.
plot (bool) – Whether or not to plot the summary as a heatmap, defaults to
False
.
- Returns
Dataframe of summary or plot.
- Return type
pd.DataFrame
- graphein.protein.analysis.plot_degree_by_residue_type(g: nx.Graph, normalise_by_residue_occurrence: bool = True) plotly.graph_objects.Figure [source]#
Plots the distribution of node degrees in the graph.
- Parameters
g (nx.Graph) – networkx graph to plot the distribution of node degrees by residue type of.
normalise_by_residue_occurrence (bool) – Whether to normalise the degree by the number of residues of the same type.
- Returns
Plotly figure.
- Rtpe
plotly.graph_objects.Figure
- graphein.protein.analysis.plot_degree_distribution(g: nx.Graph, title: Optional[str] = None) plotly.graph_objects.Figure [source]#
Plots the distribution of node degrees in the graph.
- Parameters
g (nx.Graph) – networkx graph to plot the distribution of node degrees in.
title (Optional[str], optional) – Title of plot. defaults to
None
.
- Returns
Plotly figure.
- Rtpe
plotly.graph_objects.Figure
- graphein.protein.analysis.plot_edge_type_distribution(g: nx.Graph, plot_type: str = 'bar', title: Optional[str] = None) plotly.graph_objects.Figure [source]#
Plots the distribution of edge types in the graph.
- Parameters
- Returns
Plotly figure.
- Return type
plotly.graph_objects.Figure
- graphein.protein.analysis.plot_graph_metric_property_correlation(g: nx.Graph, summary_statistics: List[str] = ['degree', 'betweenness_centrality', 'closeness_centrality', 'eigenvector_centrality', 'communicability_betweenness_centrality'], properties: List[str] = ['asa'], colour_by: Optional[str] = 'residue_type', opacity: float = 0.2, diagonal_visible: bool = True, title: Optional[str] = None, height: int = 1000, width: int = 1000, font_size: int = 10) plotly.graph_objects.Figure [source]#
Plots the correlation between graph metrics and properties.
- Parameters
g (nx.Graph) – Protein graph to plot the correlation of.
summary_statistics (List[str], optional) – List of graph metrics to employ in plot, defaults to
["degree", "betweenness_centrality", "closeness_centrality", "eigenvector_centrality", "communicability_betweenness_centrality"]
.properties (List[str], optional) – List of node properties to use in plot, defaults to
["asa"]
.colour_by (Optional[str], optional) – Controls colouring of points in plot. Options:
"residue_type"
,"position"
,"chain"
, defaults to"residue_type"
.opacity (float, optional) – Opacity of plot points, defaults to
0.2
.diagonal_visible (bool, optional) – Whether or not to show the diagonal plots, defaults to
True
.title (Optional[str], optional) – Title of plot, defaults to
None
.height (int, optional) – Height of plot, defaults to
1000
.width (int, optional) – Width of plot, defaults to
1000
.font_size (int, optional) – Font size for plot text, defaults to
10
.
- Returns
Scatter plot matrix of graph metrics and protein properties.
- Return type
plotly.graph_objects.Figure
- graphein.protein.analysis.plot_residue_composition(g: nx.Graph, sort_by: Optional[str] = None, plot_type: str = 'bar') plotly.graph_objects.Figure [source]#
Plots the residue composition of the graph.
- Parameters
- Raises
ValueError – Raises ValueError if
sort_by
is not one of"alphabetical"
,"count"
.- Returns
Plotly figure.
- Return type
plotly.graph_objects.Figure
Meshes#
Functions to create protein meshes via pymol.
- graphein.protein.meshes.check_for_pymol_installation()[source]#
Checks for presence of a pymol installation
- graphein.protein.meshes.configure_pymol_session(config: Optional[graphein.protein.config.ProteinMeshConfig] = None)[source]#
Configures a PyMol session based on
config.parse_pymol_commands
. Uses default parameters"-cKq"
.See: https://pymolwiki.org/index.php/Command_Line_Options
- Parameters
config (graphein.protein.config.ProteinMeshConfig) –
ProteinMeshConfig
to use. Defaults toNone
which uses default config.
- graphein.protein.meshes.convert_verts_and_face_to_mesh(verts: torch.FloatTensor, faces: NamedTuple) Meshes [source]#
Converts vertices and faces into a
pytorch3d.structures
Meshes object.- Parameters
verts (torch.FloatTensor) – Vertices.
faces (NamedTuple) – Faces.
- Returns
Meshes object.
- Return type
pytorch3d.structures.Meshes
- graphein.protein.meshes.create_mesh(pdb_file: Optional[str] = None, pdb_code: Optional[str] = None, out_dir: Optional[str] = None, config: Optional[ProteinMeshConfig] = None) Tuple[torch.FloatTensor, NamedTuple, NamedTuple] [source]#
Creates a
PyTorch3D
mesh from apdb_file
orpdb_code
.- Parameters
pdb_file (str, optional) – path to
pdb_file
. Defaults toNone
.pdb_code (str, optional) – 4-letter PDB accession code. Defaults to None.
out_dir (str, optional) – output directory to store
.obj
file. Defaults toNone
.config (graphein.protein.config.ProteinMeshConfig) –
ProteinMeshConfig
config to use. Defaults to default config ingraphein.protein.config
.
- Returns
verts
,faces
,aux
.- Return type
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- graphein.protein.meshes.get_obj_file(pdb_file: Optional[str] = None, pdb_code: Optional[str] = None, out_dir: Optional[str] = None, config: Optional[graphein.protein.config.ProteinMeshConfig] = None) str [source]#
Runs PyMol to compute surface/mesh for a given protein.
- Parameters
pdb_file (str, optional) – path to
pdb_file
to use. Defaults toNone
.pdb_code (str, optional) – 4-letter pdb accession code. Defaults to
None
.out_dir (str, optional) – path to output. Defaults to
None
.config (graphein.protein.config.ProteinMeshConfig) –
ProteinMeshConfig
containing pymol commands to run. Default isNone
("show surface"
).
- Raises
ValueError if both or neither
pdb_file
orpdb_code
are provided.- Returns
returns path to
.obj
file (str)- Return type
- graphein.protein.meshes.normalize_and_center_mesh_vertices(verts: torch.FloatTensor) torch.FloatTensor [source]#
We scale normalize and center the target mesh to fit in a sphere of radius 1 centered at
(0,0,0)
.(scale, center)
will be used to bring the predicted mesh to its original center and scale Note that normalizing the target mesh, speeds up the optimization but is not necessary!- Parameters
verts (torch.FloatTensor) – Mesh vertices.
- Returns
Normalized and centered vertices.
- Return type
torch.FloatTensor
- graphein.protein.meshes.parse_pymol_commands(config: graphein.protein.config.ProteinMeshConfig) List[str] [source]#
Parses pymol commands from config. At the moment users can only supply a list of string commands.
- Parameters
config (ProteinMeshConfig) – ProteinMeshConfig containing pymol commands to run in
config.pymol_commands
.- Returns
list of pymol commands to run
- Return type
List[str]
Visualisation#
Functions for plotting protein graphs and meshes.
- graphein.protein.visualisation.add_vector_to_plot(g: networkx.classes.graph.Graph, fig, vector: str = 'sidechain_vector', scale: float = 5, colour: str = 'red', width: int = 10) plotly.graph_objs._figure.Figure [source]#
Adds representations of vector features to the protein graph.
Requires all nodes have a vector feature (1 x 3 array).
- Parameters
g (nx.Graph) – Protein graph containing vector features
fig (go.Figure) – 3D plotly figure to add vectors to.
vector (str, optional) – Name of node vector feature to add, defaults to “sidechain_vector”
scale (float, optional) – How much to scale the vectors by, defaults to 5
colour (str, optional) – Colours for vectors, defaults to “red”
- Returns
3D Plotly plot with vectors added.
- Return type
go.Figure
- graphein.protein.visualisation.asteroid_plot(g: nx.Graph, node_id: str, k: int = 2, colour_nodes_by: str = 'shell', colour_edges_by: str = 'kind', edge_colour_map: plt.cm.Colormap = <matplotlib.colors.ListedColormap object>, show_labels: bool = True, title: Optional[str] = None, width: int = 600, height: int = 500, use_plotly: bool = True, show_edges: bool = False, node_size_multiplier: float = 10) Union[plotly.graph_objects.Figure, matplotlib.figure.Figure] [source]#
Plots a k-hop subgraph around a node as concentric shells.
Radius of each point is proportional to the degree of the node (modified by node_size_multiplier).
- Parameters
g (nx.Graph) – NetworkX graph to plot.
node_id (str) – Node to centre the plot around.
k (int) – Number of hops to plot. Defaults to
2
.colour_nodes_by (str) – Colour the nodes by this attribute. Currently only
"shell"
is supported.colour_edges_by (str) – Colour the edges by this attribute. Currently only
"kind"
is supported.edge_colour_map (plt.cm.Colormap) – Colour map for edges. Defaults to
plt.cm.plasma
.title (str) – Title of the plot. Defaults to
None
.width (int) – Width of the plot. Defaults to
600
.height (int) – Height of the plot. Defaults to
500
.use_plotly (bool) – Use plotly to render the graph. Defaults to
True
.show_edges (bool) – Whether or not to show edges in the plot. Defaults to
False
.node_size_multiplier (float.) – Multiplier for the size of the nodes. Defaults to
10
.
- Returns
Plotly figure or matplotlib figure.
- Rtpye
Union[plotly.graph_objects.Figure, matplotlib.figure.Figure]
- graphein.protein.visualisation.colour_edges(G: networkx.classes.graph.Graph, colour_map: matplotlib.colors.ListedColormap, colour_by: str = 'kind') List[Tuple[float, float, float, float]] [source]#
Computes edge colours based on the kind of bond/interaction.
- Parameters
G (nx.Graph) – nx.Graph protein structure graph to compute edge colours from.
colour_map (matplotlib.colors.ListedColormap) – Colourmap to use.
colour_by (str) – Edge attribute to colour by. Currently only
"kind"
is supported.
- Returns
List of edge colours.
- Return type
- graphein.protein.visualisation.colour_nodes(G: networkx.classes.graph.Graph, colour_by: str, colour_map: matplotlib.colors.ListedColormap = <matplotlib.colors.ListedColormap object>) List[Tuple[float, float, float, float]] [source]#
Computes node colours based on
"degree"
,"seq_position"
or node attributes.- Parameters
G (nx.Graph) – Graph to compute node colours for
colour_map (matplotlib.colors.ListedColormap) – Colourmap to use.
colour_by (str) – Manner in which to colour nodes. If not
"degree"
or"seq_position"
, this must correspond to a node feature.
- Returns
List of node colours
- Return type
- graphein.protein.visualisation.plot_chord_diagram(g: networkx.classes.graph.Graph, show_names: bool = True, order: Optional[List] = None, width: float = 0.1, pad: float = 2.0, gap: float = 0.03, chordwidth: float = 0.7, ax=None, colors=None, cmap=None, alpha=0.7, use_gradient: bool = False, chord_colors=None, show: bool = False, **kwargs)[source]#
Plot a chord diagram.
Based on Tanguy Fardet’s implementation: https://github.com/tfardet/mpl_chord_diagram
- Parameters
g (nx.Graph) – NetworkX graph to plot Flux data, mat[i, j] is the flux from i to j (adjacency matrix)
show_names (bool) – Whether to show the names of the nodes
order – list, optional (default: order of the matrix entries) Order in which the arcs should be placed around the trigonometric circle.
width (float) – float, optional (default: 0.1) Width/thickness of the ideogram arc.
pad (float) – float, optional (default: 2) Distance between two neighboring ideogram arcs. Unit: degree.
gap (float) – float, optional (default: 0) Distance between the arc and the beginning of the cord.
chordwidth – float, optional (default: 0.7) Position of the control points for the chords, controlling their shape.
ax – matplotlib axis, optional (default: new axis) Matplotlib axis where the plot should be drawn.
colors – list, optional (default: from cmap) List of user defined colors or floats.
cmap – str or colormap object (default: viridis) Colormap that will be used to color the arcs and chords by default. See chord_colors to use different colors for chords.
alpha – float in [0, 1], optional (default: 0.7) Opacity of the chord diagram.
use_gradient (bool) – bool, optional (default: False) Whether a gradient should be use so that chord extremities have the same color as the arc they belong to.
chord_colors –
str, or list of colors, optional (default: None) Specify color(s) to fill the chords differently from the arcs. When the keyword is not used, chord colors default to the colomap given by colors. Possible values for chord_colors are:
a single color (do not use an RGB tuple, use hex format instead), e.g. “red” or “#ff0000”; all chords will have this color
a list of colors, e.g.
["red", "green", "blue"]
, one per node (in this case, RGB tuples are accepted as entries to the list). Each chord will get its color from its associated source node, or from both nodes if use_gradient is True.
show – bool, optional (default: False) Whether the plot should be displayed immediately via an automatic call to
plt.show()
.kwargs (Dict[str, Any]) –
keyword arguments Available kwargs are:
Name
Type
Purpose and possible values
fontcolor
str or list
Color of the names
fontsize
int
Size of the font for names
rotate_names
(list of) bool(s)
Rotate names by 90°
sort
str
Either “size” or “distance”
zero_entry_size
float
Size of zero-weight reciprocal
- graphein.protein.visualisation.plot_distance_landscape(g: Optional[networkx.classes.graph.Graph] = None, dist_mat: Optional[numpy.ndarray] = None, add_contour: bool = True, title: Optional[str] = None, width: int = 500, height: int = 500, autosize: bool = False) plotly.graph_objs._figure.Figure [source]#
Plots a distance landscape of the graph.
- Parameters
g (nx.Graph) – Graph to plot (must contain a distance matrix in
g.graph["dist_mat"]
).add_contour (bool, optional) – Whether or not to show the contour, defaults to
True
.width (int, optional) – Plot width, defaults to
500
.height (int, optional) – Plot height, defaults to
500
.autosize (bool, optional) – Whether or not to autosize the plot, defaults to
False
.
- Returns
Plotly figure of distance landscape.
- Return type
go.Figure
- graphein.protein.visualisation.plot_distance_matrix(g: Optional[networkx.classes.graph.Graph], dist_mat: Optional[numpy.ndarray] = None, use_plotly: bool = True, title: Optional[str] = None, show_residue_labels: bool = True) plotly.graph_objs._figure.Figure [source]#
Plots a distance matrix of the graph.
- Parameters
g (nx.Graph, optional) – NetworkX graph containing a distance matrix as a graph attribute (
g.graph['dist_mat']
).dist_mat (np.ndarray, optional) – Distance matrix to plot. If not provided, the distance matrix is taken from the graph. Defaults to
None
.use_plotly (bool) – Whether to use
plotly
orseaborn
for plotting. Defaults toTrue
.title (str, optional) – Title of the plot.Defaults to
None
.
- Show_residue_labels
Whether to show residue labels on the plot. Defaults to
True
.- Raises
ValueError if neither a graph
g
or adist_mat
are provided.- Returns
Plotly figure.
- Return type
px.Figure
- graphein.protein.visualisation.plot_pointcloud(mesh: Meshes, title: str = '') Axes3D [source]#
Plots pytorch3d Meshes object as pointcloud.
- Parameters
mesh (pytorch3d.structures.meshes.Meshes) – Meshes object to plot.
title (str) – Title of plot.
- Returns
returns Axes3D containing plot
- Return type
Axes3D
- graphein.protein.visualisation.plot_protein_structure_graph(G: networkx.classes.graph.Graph, angle: int = 30, plot_title: typing.Optional[str] = None, figsize: typing.Tuple[int, int] = (10, 7), node_alpha: float = 0.7, node_size_min: float = 20.0, node_size_multiplier: float = 20.0, label_node_ids: bool = True, node_colour_map=<matplotlib.colors.ListedColormap object>, edge_color_map=<matplotlib.colors.ListedColormap object>, colour_nodes_by: str = 'degree', colour_edges_by: str = 'kind', edge_alpha: float = 0.5, plot_style: str = 'ggplot', out_path: typing.Optional[str] = None, out_format: str = '.png') mpl_toolkits.mplot3d.axes3d.Axes3D [source]#
Plots protein structure graph in
Axes3D
.- Parameters
G (nx.Graph) – nx.Graph Protein Structure graph to plot.
angle (int) – View angle. Defaults to
30
.plot_title (str, optional) – Title of plot. Defaults to
None
.figsize (Tuple[int, int]) – Size of figure, defaults to
(10, 7)
.node_alpha (float) – Controls node transparency, defaults to
0.7
.node_size_min (float) – Specifies node minimum size, defaults to
20
.node_size_multiplier (float) – Scales node size by a constant. Node sizes reflect degree. Defaults to
20
.label_node_ids (bool) – bool indicating whether or not to plot
node_id
labels. Defaults toTrue
.node_colour_map (plt.cm) – colour map to use for nodes. Defaults to
plt.cm.plasma
.edge_color_map (plt.cm) – colour map to use for edges. Defaults to
plt.cm.plasma
.colour_nodes_by (str) – Specifies how to colour nodes.
"degree"
,"seq_position"
or a node feature.colour_edges_by (str) – Specifies how to colour edges. Currently only
"kind"
is supported.edge_alpha (float) – Controls edge transparency. Defaults to
0.5
.plot_style (str) – matplotlib style sheet to use. Defaults to
"ggplot"
.out_path (str, optional) – If not none, writes plot to this location. Defaults to
None
(does not save).out_format (str) – Fileformat to use for plot
- Returns
matplotlib Axes3D object.
- Return type
Axes3D
- graphein.protein.visualisation.plotly_protein_structure_graph(G: networkx.classes.graph.Graph, plot_title: typing.Optional[str] = None, figsize: typing.Tuple[int, int] = (620, 650), node_alpha: float = 0.7, node_size_min: float = 20.0, node_size_multiplier: float = 20.0, label_node_ids: bool = True, node_colour_map=<matplotlib.colors.ListedColormap object>, edge_color_map=<matplotlib.colors.ListedColormap object>, colour_nodes_by: str = 'degree', colour_edges_by: str = 'kind') plotly.graph_objs._figure.Figure [source]#
Plots protein structure graph using plotly.
- Parameters
G (nx.Graph) – nx.Graph Protein Structure graph to plot
plot_title (str, optional) – Title of plot, defaults to
None
.figsize (Tuple[int, int]) – Size of figure, defaults to
(620, 650)
.node_alpha (float) – Controls node transparency, defaults to
0.7
.node_size_min (float) – Specifies node minimum size. Defaults to
20.0
.node_size_multiplier (float) – Scales node size by a constant. Node sizes reflect degree. Defaults to
20.0
.label_node_ids (bool) – bool indicating whether or not to plot
node_id
labels. Defaults toTrue
.node_colour_map (plt.cm) – colour map to use for nodes. Defaults to
plt.cm.plasma
.edge_color_map (plt.cm) – colour map to use for edges. Defaults to
plt.cm.plasma
.colour_nodes_by (str) – Specifies how to colour nodes.
"degree"
,"seq_position"
or a node feature.colour_edges_by (str) – Specifies how to colour edges. Currently only
"kind"
is supported.
- Returns
Plotly Graph Objects plot
- Return type
go.Figure
Utils#
Provides utility functions for use across Graphein.
- exception graphein.protein.utils.ProteinGraphConfigurationError(message: str)[source]#
Exception when an invalid Graph configuration if provided to a downstream function or method.
- graphein.protein.utils.compute_rgroup_dataframe(pdb_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Return the atoms that are in R-groups and not the backbone chain.
- Parameters
pdb_df (pd.DataFrame) – DataFrame to compute R group dataframe from.
- Returns
Dataframe containing R-groups only (backbone atoms removed).
- Return type
pd.DataFrame
- graphein.protein.utils.download_alphafold_structure(uniprot_id: str, version: int = 2, out_dir: str = '.', rename: bool = True, pdb: bool = True, mmcif: bool = False, aligned_score: bool = True) Union[str, Tuple[str, str]] [source]#
Downloads a structure from the Alphafold EBI database (https://alphafold.ebi.ac.uk/files/”).
- Parameters
uniprot_id (str) – UniProt ID of desired protein.
version (int) – Version of the structure to download
out_dir (str) – string specifying desired output location. Default is pwd.
rename (bool) – boolean specifying whether to rename the output file to
$uniprot_id.pdb
. Default isTrue
.pdb (bool) – boolean specifying whether to download the PDB file. Default is
True
.mmcif (bool) – Bool specifying whether to download MMCiF or PDB. Default is false (downloads pdb)
retrieve_aligned_score (bool) – Bool specifying whether or not to download score alignment json.
- Returns
path to output. Tuple if several outputs specified.
- Return type
- graphein.protein.utils.download_pdb(config, pdb_code: str) pathlib.Path [source]#
Download PDB structure from PDB.
If no structure is found, we perform a lookup against the record of obsolete PDB codes (ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat)
- graphein.protein.utils.filter_dataframe(dataframe: pandas.core.frame.DataFrame, by_column: str, list_of_values: List[Any], boolean: bool) pandas.core.frame.DataFrame [source]#
Filter function for dataframe.
Filters the dataframe such that the
by_column
values have to be in thelist_of_values
list ifboolean == True
, or not in the list ifboolean == False
.- Parameters
- Returns
Filtered dataframe.
- Return type
pd.DataFrame
- graphein.protein.utils.get_obsolete_mapping() Dict[str, str] [source]#
Returns a dictionary mapping obsolete PDB codes to their replacement.
- graphein.protein.utils.get_protein_name_from_filename(pdb_path: str) str [source]#
Extracts a filename from a
pdb_path
- graphein.protein.utils.is_tool(name: str) bool [source]#
Checks whether
name
is on PATH and is marked as an executable.Source: https://stackoverflow.com/questions/11210104/check-if-a-program-exists-from-a-python-script
- graphein.protein.utils.save_graph_to_pdb(g: networkx.classes.graph.Graph, path: str, gz: bool = False)[source]#
Saves processed
pdb_df
(g.graph["pdb_df"]
) dataframe to a PDB file.N.B. PDBs do not contain connectivity information. This only captures the nodes in the graph. Connectivity is filled in according to standard rules by visualisation programs.
- graphein.protein.utils.save_pdb_df_to_pdb(df: pandas.core.frame.DataFrame, path: str, gz: bool = False)[source]#
Saves pdb dataframe to a PDB file.
- graphein.protein.utils.save_rgroup_df_to_pdb(g: networkx.classes.graph.Graph, path: str, gz: bool = False)[source]#
Saves R-group (
g.graph["rgroup_df"]
) dataframe to a PDB file.N.B. PDBs do not contain connectivity information. This only captures the atoms in the r groups. Connectivity is filled in according to standard rules by visualisation programs.
Constants#
Author: Eric J. Ma, Arian Jamasb Purpose: This is a set of utility variables and functions that can be used across the Graphein project.
These include various collections of standard & non-standard/modified amino acids and their names, identifiers and properties.
We also include mappings of covalent radii and bond lengths for the amino acids used in assembling atomic protein graphs.
- graphein.protein.resi_atoms.AA_RING_ATOMS: Dict[str, List[str]] = {'HIS': ['CG', 'CD', 'CE', 'ND', 'NE'], 'PHE': ['CG', 'CD', 'CE', 'CZ'], 'TRP': ['CD', 'CE', 'CH', 'CZ'], 'TYR': ['CG', 'CD', 'CE', 'CZ']}#
Dictionary mapping amino acid 3-letter codes to lists of atoms that are part of rings.
- graphein.protein.resi_atoms.AMINO_ACIDS: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']#
Vocabulary of amino acids with one-letter codes. Includes fuzzy standard amino acids:
"B"
denotes"ASX"
which corresponds to"ASP"
("D"
) or"ASN"
("N"
) and"Z"
denotes"GLX"
which corresponds to``”GLU”`` ("E"
) or"GLN"
("Q"
).
- graphein.protein.resi_atoms.AROMATIC_RESIS: List[str] = ['PHE', 'TRP', 'HIS', 'TYR']#
List of aromatic residues.
- graphein.protein.resi_atoms.BACKBONE_ATOMS: List[str] = ['N', 'CA', 'C', 'O']#
Atoms present in Amino Acid Backbones.
- graphein.protein.resi_atoms.BASE_AMINO_ACIDS: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']#
Vocabulary of 20 standard amino acids.
- graphein.protein.resi_atoms.BOND_LENGTHS: Dict[str, Dict[str, float]] = {'As-N': {'i_d': 1.835, 'i_s': 1.86, 'w_sd': 1.845}, 'As-O': {'i_d': 1.66, 'i_s': 1.71, 'w_sd': 1.68}, 'As-S': {'i_d': 2.08, 'i_s': 2.28, 'w_sd': 2.15}, 'C-C': {'i_d': 1.31, 'i_s': 1.49, 'i_t': 1.18, 'w_dt': 1.21, 'w_sd': 1.38}, 'C-N': {'i_d': 1.32, 'i_s': 1.42, 'i_t': 1.14, 'w_dt': 1.2, 'w_sd': 1.34}, 'C-O': {'i_d': 1.22, 'i_s': 1.41, 'w_sd': 1.28}, 'C-S': {'i_d': 1.68, 'i_s': 1.78, 'w_sd': 1.7}, 'C-Te': {'i_d': 1.8, 'i_s': 2.2, 'w_sd': 2.1}, 'N-N': {'i_d': 1.22, 'i_s': 1.4, 'w_sd': 1.32}, 'N-O': {'i_d': 1.22, 'i_s': 1.39, 'w_sd': 1.25}, 'N-P': {'i_d': 1.59, 'i_s': 1.69, 'w_sd': 1.62}, 'N-S': {'i_d': 1.54, 'i_s': 1.66, 'w_sd': 1.58}, 'N-Se': {'i_d': 1.79, 'i_s': 1.83, 'w_sd': 1.8}, 'O-P': {'i_d': 1.48, 'i_s': 1.6, 'w_sd': 1.52}, 'O-S': {'i_d': 1.45, 'i_s': 1.58, 'w_sd': 1.54}, 'P-P': {'i_d': 2.04, 'i_s': 2.23, 'w_sd': 2.06}}#
Dictionary containing idealised single, double and triple bond lengths (
i_s
,i_d
,i_t
) and watersheds (w_sd
,w_dt
), below which a bond is probably double/triple (e.g.triple
<double
<single
). All lengths are in Angstroms.Taken from:
Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database Jon C. Baber and Edward E. Hodgkin* J. Chem. Inf. Comput. Sci. 1992, 32. 401-406
- graphein.protein.resi_atoms.BOND_ORDERS: Dict = {'As-N': [1, 2], 'As-O': [1, 2], 'As-S': [1, 2], 'C-C': [1, 2, 3], 'C-N': [1, 2, 3], 'C-O': [1, 2], 'C-S': [1, 2], 'C-Te': [1, 2], 'N-N': [1, 2], 'N-O': [1, 2], 'N-P': [1, 2], 'N-S': [1, 2], 'N-Se': [1, 2], 'O-P': [1, 2], 'O-S': [1, 2], 'P-P': [1, 2]}#
Dictionary of allowable bond orders for each covalent bond type.
Taken from:
Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database Jon C. Baber and Edward E. Hodgkin* J. Chem. Inf. Comput. Sci. 1992, 32. 401-406
- graphein.protein.resi_atoms.BOND_TYPES: List[str] = ['hydrophobic', 'disulfide', 'hbond', 'ionic', 'aromatic', 'aromatic_sulphur', 'cation_pi', 'backbone', 'delaunay']#
List of supported bond types.
- graphein.protein.resi_atoms.CARBOHYDRATE_CODES: List[str] = ['BGC', 'GLC', 'MAN', 'BMA', 'FUC', 'GAL', 'GLA', 'NAG', 'NGA', 'SIA', 'XYS']#
Three letter codes of commonly found carbohydrates in protein structures.
See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.CARBOHYDRATE_CODE_NAME_MAPPING: Dict[str, str] = {'BGC': 'D-GLUCOSE', 'BMA': 'D-MANNOSE', 'FUC': 'FUCOSE', 'GAL': 'D-GALACTOSE', 'GLA': 'D-GALACTOSE', 'GLC': 'D-GLUCOSE', 'MAN': 'D-MANNOSE', 'NAG': 'N-ACETYL-D-GLUCOSAMINE', 'NGA': 'N-ACETYL-D-GALACTOSAMINE', 'SIA': 'O-SIALIC_ACID', 'XYS': 'D-XYLOPYRANOSE'}#
Mapping of 3-letter PDB ligand accession codes for common carbohydrates to their full names.
See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.CARBOHYDRATE_NAMES: List[str] = ['D-GLUCOSE', 'D-MANNOSE', 'FUCOSE', 'D-GALACTOSE', 'N-ACETYL-D-GLUCOSAMINE', 'N-ACETYL-D-GALACTOSAMINE', 'O-SIALIC_ACID', 'D-XYLOPYRANOSE']#
Names of commonly found carbohydrates in protein structures.
See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.CATION_PI_RESIS: List[str] = ['LYS', 'ARG', 'PHE', 'TYR', 'TRP']#
List of residues involved in cation-pi interactions.
- graphein.protein.resi_atoms.COFACTOR_CODES: List[str] = ['ADP', 'AMP', 'ATP', 'CMP', 'COA', 'FAD', 'FMN', 'NAP', 'NDP']#
Three letter codes of cofactors commonly found in PDB structures.
See: http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.COFACTOR_CODE_NAME_MAPPING: Dict[str, str] = {'ADP': 'ADP', 'AMP': 'AMP', 'ATP': 'ATP', 'CMP': 'cAMP', 'COA': 'COENZYME_A', 'FAD': 'FAD', 'FMN': 'FLAVIN_MONONUCLEOTIDE', 'NAP': 'NADP', 'NDP': 'NADPH'}#
Mapping between 3-letter PDB ligand codes and cofactor names.
See http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.COFACTOR_NAMES: List[str] = ['ADP', 'AMP', 'ATP', 'cAMP', 'COENZYME_A', 'FAD', 'FLAVIN_MONONUCLEOTIDE', 'NADP', 'NADPH']#
Names of cofactors commonly found in PDB structures.
See: http://ligand-expo.rcsb.org/ and https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.COVALENT_RADII: Dict[str, float] = {'Cdb': 0.67, 'Cres': 0.72, 'Csb': 0.77, 'Hsb': 0.37, 'Ndb': 0.62, 'Nres': 0.66, 'Nsb': 0.7, 'Odb': 0.6, 'Ores': 0.635, 'Osb': 0.67, 'Ssb': 1.04}#
Covalent radii for OpenSCAD output. Adding
Ores
betweenOsb
andOdb
forAsp
andGlu
,Nres
betweenNsb
andNdb
forArg
, as PDB does not specifyCovalent radii from:
Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’
- graphein.protein.resi_atoms.DEFAULT_BOND_STATE: Dict[str, str] = {'1HD2': 'Hsb', '1HH1': 'Hsb', '1HH2': 'Hsb', '2HD2': 'Hsb', '2HH1': 'Hsb', '2HH2': 'Hsb', 'C': 'Cdb', 'CA': 'Csb', 'CB': 'Csb', 'H': 'Hsb', 'HE': 'Hsb', 'HG': 'Hsb', 'HG1': 'Hsb', 'HH': 'Hsb', 'HZ1': 'Hsb', 'HZ2': 'Hsb', 'HZ3': 'Hsb', 'N': 'Nsb', 'O': 'Odb', 'OXT': 'Osb'}#
Assignment of atom classes to atomic radii.
Covalent radii from:
Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’
- graphein.protein.resi_atoms.DISULFIDE_ATOMS: List[str] = ['SG']#
List of atoms capable of forming disulphide bonds.
- graphein.protein.resi_atoms.DISULFIDE_RESIS: List[str] = ['CYS']#
Residues capable of forming disulfide bonds.
- graphein.protein.resi_atoms.GRANTHAM_CHEMICAL_DISTANCE_MATRIX: Dict[str, float] = {'AA': 0.0, 'AC': 0.112, 'AD': 0.819, 'AE': 0.827, 'AF': 0.54, 'AG': 0.208, 'AH': 0.696, 'AI': 0.407, 'AK': 0.891, 'AL': 0.406, 'AM': 0.379, 'AN': 0.318, 'AP': 0.191, 'AQ': 0.372, 'AR': 1.0, 'AS': 0.094, 'AT': 0.22, 'AV': 0.273, 'AW': 0.739, 'AY': 0.552, 'CA': 0.114, 'CC': 0.0, 'CD': 0.847, 'CE': 0.838, 'CF': 0.437, 'CG': 0.32, 'CH': 0.66, 'CI': 0.304, 'CK': 0.887, 'CL': 0.301, 'CM': 0.277, 'CN': 0.324, 'CP': 0.157, 'CQ': 0.341, 'CR': 1.0, 'CS': 0.176, 'CT': 0.233, 'CV': 0.167, 'CW': 0.639, 'CY': 0.457, 'DA': 0.729, 'DC': 0.742, 'DD': 0.0, 'DE': 0.124, 'DF': 0.924, 'DG': 0.697, 'DH': 0.435, 'DI': 0.847, 'DK': 0.249, 'DL': 0.841, 'DM': 0.819, 'DN': 0.56, 'DP': 0.657, 'DQ': 0.584, 'DR': 0.295, 'DS': 0.667, 'DT': 0.649, 'DV': 0.797, 'DW': 1.0, 'DY': 0.836, 'EA': 0.79, 'EC': 0.788, 'ED': 0.133, 'EE': 0.0, 'EF': 0.932, 'EG': 0.779, 'EH': 0.406, 'EI': 0.86, 'EK': 0.143, 'EL': 0.854, 'EM': 0.83, 'EN': 0.599, 'EP': 0.688, 'EQ': 0.598, 'ER': 0.234, 'ES': 0.726, 'ET': 0.682, 'EV': 0.824, 'EW': 1.0, 'EY': 0.837, 'FA': 0.508, 'FC': 0.405, 'FD': 0.977, 'FE': 0.918, 'FF': 0.0, 'FG': 0.69, 'FH': 0.663, 'FI': 0.128, 'FK': 0.903, 'FL': 0.131, 'FM': 0.169, 'FN': 0.541, 'FP': 0.42, 'FQ': 0.459, 'FR': 1.0, 'FS': 0.548, 'FT': 0.499, 'FV': 0.252, 'FW': 0.207, 'FY': 0.179, 'GA': 0.206, 'GC': 0.312, 'GD': 0.776, 'GE': 0.807, 'GF': 0.727, 'GG': 0.0, 'GH': 0.769, 'GI': 0.592, 'GK': 0.894, 'GL': 0.591, 'GM': 0.557, 'GN': 0.381, 'GP': 0.323, 'GQ': 0.467, 'GR': 1.0, 'GS': 0.158, 'GT': 0.272, 'GV': 0.464, 'GW': 0.923, 'GY': 0.728, 'HA': 0.896, 'HC': 0.836, 'HD': 0.629, 'HE': 0.547, 'HF': 0.907, 'HG': 1.0, 'HH': 0.0, 'HI': 0.848, 'HK': 0.566, 'HL': 0.842, 'HM': 0.825, 'HN': 0.754, 'HP': 0.777, 'HQ': 0.716, 'HR': 0.697, 'HS': 0.865, 'HT': 0.834, 'HV': 0.831, 'HW': 0.981, 'HY': 0.821, 'IA': 0.403, 'IC': 0.296, 'ID': 0.942, 'IE': 0.891, 'IF': 0.134, 'IG': 0.592, 'IH': 0.652, 'II': 0.0, 'IK': 0.892, 'IL': 0.013, 'IM': 0.057, 'IN': 0.457, 'IP': 0.311, 'IQ': 0.383, 'IR': 1.0, 'IS': 0.443, 'IT': 0.396, 'IV': 0.133, 'IW': 0.339, 'IY': 0.213, 'KA': 0.889, 'KC': 0.871, 'KD': 0.279, 'KE': 0.149, 'KF': 0.957, 'KG': 0.9, 'KH': 0.438, 'KI': 0.899, 'KK': 0.0, 'KL': 0.892, 'KM': 0.871, 'KN': 0.667, 'KP': 0.757, 'KQ': 0.639, 'KR': 0.154, 'KS': 0.825, 'KT': 0.759, 'KV': 0.882, 'KW': 1.0, 'KY': 0.848, 'LA': 0.405, 'LC': 0.296, 'LD': 0.944, 'LE': 0.892, 'LF': 0.139, 'LG': 0.596, 'LH': 0.653, 'LI': 0.013, 'LK': 0.893, 'LL': 0.0, 'LM': 0.062, 'LN': 0.452, 'LP': 0.309, 'LQ': 0.376, 'LR': 1.0, 'LS': 0.443, 'LT': 0.397, 'LV': 0.133, 'LW': 0.341, 'LY': 0.205, 'MA': 0.383, 'MC': 0.276, 'MD': 0.932, 'ME': 0.879, 'MF': 0.182, 'MG': 0.569, 'MH': 0.648, 'MI': 0.058, 'MK': 0.884, 'ML': 0.062, 'MM': 0.0, 'MN': 0.447, 'MP': 0.285, 'MQ': 0.372, 'MR': 1.0, 'MS': 0.417, 'MT': 0.358, 'MV': 0.12, 'MW': 0.391, 'MY': 0.255, 'NA': 0.424, 'NC': 0.425, 'ND': 0.838, 'NE': 0.835, 'NF': 0.766, 'NG': 0.512, 'NH': 0.78, 'NI': 0.615, 'NK': 0.891, 'NL': 0.603, 'NM': 0.588, 'NN': 0.0, 'NP': 0.266, 'NQ': 0.175, 'NR': 1.0, 'NS': 0.361, 'NT': 0.368, 'NV': 0.503, 'NW': 0.945, 'NY': 0.641, 'PA': 0.22, 'PC': 0.179, 'PD': 0.852, 'PE': 0.831, 'PF': 0.515, 'PG': 0.376, 'PH': 0.696, 'PI': 0.363, 'PK': 0.875, 'PL': 0.357, 'PM': 0.326, 'PN': 0.231, 'PP': 0.0, 'PQ': 0.228, 'PR': 1.0, 'PS': 0.196, 'PT': 0.161, 'PV': 0.244, 'PW': 0.72, 'PY': 0.481, 'QA': 0.512, 'QC': 0.462, 'QD': 0.903, 'QE': 0.861, 'QF': 0.671, 'QG': 0.648, 'QH': 0.765, 'QI': 0.532, 'QK': 0.881, 'QL': 0.518, 'QM': 0.505, 'QN': 0.181, 'QP': 0.272, 'QQ': 0.0, 'QR': 1.0, 'QS': 0.461, 'QT': 0.389, 'QV': 0.464, 'QW': 0.831, 'QY': 0.522, 'RA': 0.919, 'RC': 0.905, 'RD': 0.305, 'RE': 0.225, 'RF': 0.977, 'RG': 0.928, 'RH': 0.498, 'RI': 0.929, 'RK': 0.141, 'RL': 0.92, 'RM': 0.908, 'RN': 0.69, 'RP': 0.796, 'RQ': 0.668, 'RR': 0.0, 'RS': 0.86, 'RT': 0.808, 'RV': 0.914, 'RW': 1.0, 'RY': 0.859, 'SA': 0.1, 'SC': 0.185, 'SD': 0.801, 'SE': 0.812, 'SF': 0.622, 'SG': 0.17, 'SH': 0.718, 'SI': 0.478, 'SK': 0.883, 'SL': 0.474, 'SM': 0.44, 'SN': 0.289, 'SP': 0.181, 'SQ': 0.358, 'SR': 1.0, 'SS': 0.0, 'ST': 0.174, 'SV': 0.342, 'SW': 0.827, 'SY': 0.615, 'TA': 0.251, 'TC': 0.261, 'TD': 0.83, 'TE': 0.812, 'TF': 0.604, 'TG': 0.312, 'TH': 0.737, 'TI': 0.455, 'TK': 0.866, 'TL': 0.453, 'TM': 0.403, 'TN': 0.315, 'TP': 0.159, 'TQ': 0.322, 'TR': 1.0, 'TS': 0.185, 'TT': 0.0, 'TV': 0.345, 'TW': 0.816, 'TY': 0.596, 'VA': 0.275, 'VC': 0.165, 'VD': 0.9, 'VE': 0.867, 'VF': 0.269, 'VG': 0.471, 'VH': 0.649, 'VI': 0.135, 'VK': 0.889, 'VL': 0.134, 'VM': 0.12, 'VN': 0.38, 'VP': 0.212, 'VQ': 0.339, 'VR': 1.0, 'VS': 0.322, 'VT': 0.305, 'VV': 0.0, 'VW': 0.472, 'VY': 0.31, 'WA': 0.658, 'WC': 0.56, 'WD': 1.0, 'WE': 0.931, 'WF': 0.196, 'WG': 0.829, 'WH': 0.678, 'WI': 0.305, 'WK': 0.892, 'WL': 0.304, 'WM': 0.344, 'WN': 0.631, 'WP': 0.555, 'WQ': 0.538, 'WR': 0.968, 'WS': 0.689, 'WT': 0.638, 'WV': 0.418, 'WW': 0.0, 'WY': 0.204, 'YA': 0.587, 'YC': 0.478, 'YD': 1.0, 'YE': 0.932, 'YF': 0.202, 'YG': 0.782, 'YH': 0.678, 'YI': 0.23, 'YK': 0.904, 'YL': 0.219, 'YM': 0.268, 'YN': 0.512, 'YP': 0.444, 'YQ': 0.404, 'YR': 0.995, 'YS': 0.612, 'YT': 0.557, 'YV': 0.328, 'YW': 0.244, 'YY': 0.0}#
Grantham Chemical Distance Matrix. Taken from ProPy3 https://github.com/MartinThoma/propy3
Amino Acid Difference Formula to Help Explain Protein Evolution R. Grantham Science Vol 185, Issue 4154 06 September 1974
Paper: https://science.sciencemag.org/content/185/4154/862/tab-pdf
- graphein.protein.resi_atoms.HYDROGEN_BOND_ACCEPTORS: Dict[str, Dict[str, int]] = {'ASN': {'OD1': 2}, 'ASP': {'OD1': 2, 'OD2': 2}, 'GLN': {'OE1': 2}, 'GLU': {'OE1': 2, 'OE2': 2}, 'HIS': {'ND1': 1, 'NE2': 1}, 'SER': {'OG': 2}, 'THR': {'OG1': 2}, 'TYR': {'OH': 1}}#
Number of hydrogen bonds that an acceptor atom can accept, if more than one.
9 amino acids (alanine, cysteine, glycine, isoleucine, leucine, methionine, phenylalanine, proline, valine) have no hydrogen donor or acceptor atoms in their side chains.
https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/charge/
- graphein.protein.resi_atoms.HYDROGEN_BOND_DONORS: Dict[str, Dict[str, int]] = {'ARG': {'NE': 1, 'NH1': 2, 'NH2': 2}, 'ASN': {'ND2': 2}, 'GLN': {'NE2': 2}, 'HIS': {'ND1': 2, 'NE2': 2}, 'LYS': {'NZ': 3}, 'SER': {'OG': 1}, 'THR': {'OG1': 1}, 'TRP': {'NE1': 1}, 'TYR': {'OH': 1}}#
Number of hydrogen bonds that a donor atom can donate, if more than one.
9 amino acids (alanine, cysteine, glycine, isoleucine, leucine, methionine, phenylalanine, proline, valine) have no hydrogen donor or acceptor atoms in their side chains.
https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/charge/
- graphein.protein.resi_atoms.HYDROPHOBIC_RESIS: List[str] = ['ALA', 'VAL', 'LEU', 'ILE', 'MET', 'PHE', 'TRP', 'PRO', 'TYR']#
List of residues that are considered to be hydrophobic.
- graphein.protein.resi_atoms.IONIC_RESIS: List[str] = ['ARG', 'LYS', 'HIS', 'ASP', 'GLU']#
Residues capable of forming ionic interactions.
- graphein.protein.resi_atoms.ISOELECTRIC_POINTS: Dict[str, float] = {'ALA': 6.11, 'ARG': 10.76, 'ASN': 10.76, 'ASP': 2.98, 'ASX': 6.87, 'CYS': 5.02, 'GLN': 5.65, 'GLU': 3.08, 'GLX': 4.35, 'GLY': 6.06, 'HIS': 7.64, 'ILE': 6.04, 'LEU': 6.04, 'LYS': 9.74, 'MET': 5.74, 'PHE': 5.91, 'PRO': 6.3, 'SER': 5.68, 'THR': 5.6, 'TRP': 5.88, 'TYR': 5.63, 'UNK': 7.0, 'VAL': 6.02}#
Dictionary of isoelectric points for standard amino acids. For
"UNK"
residues, neutral (pH 7.0) is assigned. For"ASX"
and"GLX"
the average of their constituents ("D"
and"N"
, and"E"
and"Q"
, respectively) is assigned.
- graphein.protein.resi_atoms.ISOELECTRIC_POINTS_STD = {'ALA': array([[-0.0986554]]), 'ARG': array([[2.33811019]]), 'ASN': array([[2.33811019]]), 'ASP': array([[-1.73888686]]), 'ASX': array([[0.29961166]]), 'CYS': array([[-0.66985422]]), 'GLN': array([[-0.33971178]]), 'GLU': array([[-1.6864833]]), 'GLX': array([[-1.02095808]]), 'GLY': array([[-0.12485718]]), 'HIS': array([[0.70311909]]), 'ILE': array([[-0.13533789]]), 'LEU': array([[-0.13533789]]), 'LYS': array([[1.80359387]]), 'MET': array([[-0.29254858]]), 'PHE': array([[-0.20346252]]), 'PRO': array([[0.00091137]]), 'SER': array([[-0.32399071]]), 'THR': array([[-0.36591356]]), 'TRP': array([[-0.21918359]]), 'TYR': array([[-0.35019249]]), 'UNK': array([[0.36773629]]), 'VAL': array([[-0.1458186]])}#
Standardized (sklearn.StandardScaler) isoelectric points for standard amino acids.
See
ISOELECTRIC_POINTS
for details.
- graphein.protein.resi_atoms.MAX_NEIGHBOURS: Dict[str, int] = {'B': 3, 'Br': 1, 'C': 4, 'F': 1, 'H': 1, 'I': 3, 'O': 2}#
Maximum number of neighbours an atom can have.
Taken from: https://www.daylight.com/meetings/mug01/Sayle/m4xbondage.html
- graphein.protein.resi_atoms.MOLECULAR_WEIGHTS: Dict[str, float] = {'ALA': 89.0935, 'ARG': 174.2017, 'ASN': 132.1184, 'ASP': 133.1032, 'ASX': 132.6108, 'CYS': 121.159, 'GLN': 146.1451, 'GLU': 147.1299, 'GLX': 146.6375, 'GLY': 75.0669, 'HIS': 155.1552, 'ILE': 131.1736, 'LEU': 131.1736, 'LYS': 146.1882, 'MET': 149.2124, 'PHE': 165.19, 'PRO': 115.131, 'SER': 105.093, 'THR': 119.1197, 'TRP': 204.2262, 'TYR': 181.1894, 'UNK': 137.1484, 'VAL': 117.1469}#
Mapping of 3-letter amino acid names to molecular weights.
UNK
is used for unknown residues and takes the mean of known weights. For"ASX"
and"GLX"
the average of their constituents ("D"
and"N"
, and"E"
and"Q"
, respectively) is assigned.
- graphein.protein.resi_atoms.MOLECULAR_WEIGHTS_STD = {'ALA': array([[-1.70781298]]), 'ARG': array([[1.31682834]]), 'ASN': array([[-0.17876066]]), 'ASP': array([[-0.14376208]]), 'ASX': array([[-0.16126137]]), 'CYS': array([[-0.56824433]]), 'GLN': array([[0.31973109]]), 'GLU': array([[0.35472968]]), 'GLX': array([[0.33723039]]), 'GLY': array([[-2.20630119]]), 'HIS': array([[0.63993903]]), 'ILE': array([[-0.2123377]]), 'LEU': array([[-0.2123377]]), 'LYS': array([[0.32126282]]), 'MET': array([[0.42873918]]), 'PHE': array([[0.99656354]]), 'PRO': array([[-0.78247208]]), 'SER': array([[-1.13921032]]), 'THR': array([[-0.64071856]]), 'TRP': array([[2.38386234]]), 'TYR': array([[1.56516265]]), 'UNK': array([[-6.18065683e-07]]), 'VAL': array([[-0.71082946]])}#
Standardized (sklearn.StandardScaler) molecular weights for standard amino acids.
See
MOLECULAR_WEIGHTS
for details.
- graphein.protein.resi_atoms.NON_STANDARD_AMINO_ACIDS: List[str] = ['O', 'U']#
Non-standard amino acids with one-letter codes.
- graphein.protein.resi_atoms.NON_STANDARD_AMINO_ACID_MAPPING_3_TO_1: Dict[str, str] = {'PYL': 'O', 'SEC': 'U'}#
Mapping of 3-letter non-standard amino acids codes to their one-letter form.
- graphein.protein.resi_atoms.NON_STANDARD_RESIS_NAME: List[str] = ['3-SULFINOALANINE', '4-HYDROXYPROLINE', '4-METHYL-4-[(E)-2-BUTENYL]-4,N-METHYL-THREONINE', '5-HYDROXYPROLINE', 'ACETYL_GROUP', 'ALPHA-AMINOBUTYRIC_ACID', 'ALPHA-AMINOISOBUTYRIC_ACID', 'AMINO_GROUP', 'CARBOXY_GROUP', 'CYSTEINE-S-DIOXIDE', 'CYSTEINESULFONIC_ACID', 'D-ALANINE', 'D-ARGININE', 'D-ASPARAGINE', 'D-ASPARTATE', 'D-CYSTEINE', 'DECARBOXY(PARAHYDROXYBENZYLIDENE-IMIDAZOLIDINONE)THREONINE', 'D-GLUTAMATE', 'D-GLUTAMINE', 'D-HISTIDINE', 'D-ISOLEUCINE', 'D-ISOVALINE', 'D-LEUCINE', 'D-LYSINE', 'D-PHENYLALANINE', 'D-PROLINE', 'D-SERINE', 'D-THREONINE', 'D-TRYPTOPHANE', 'D-TYROSINE', 'D-VALINE', 'FORMYL_GROUP', 'GAMMA-CARBOXY-GLUTAMIC_ACID', 'ISOVALERIC_ACID', 'LYSINE_NZ-CARBOXYLIC_ACID', "LYSINE-PYRIDOXAL-5'-PHOSPHATE", 'N-CARBOXYMETHIONINE', 'N-FORMYLMETHIONINE', 'N-METHYLLEUCINE', 'N-METHYLVALINE', 'NORLEUCINE', 'O-PHOSPHOTYROSINE', 'ORNITHINE', 'PHOSPHOSERINE', 'PHOSPHOTHREONINE', 'PYROGLUTAMIC_ACID', 'PYRUVOYL_GROUP', 'SARCOSINE', 'S-HYDROXY-CYSTEINE', 'S-HYDROXYCYSTEINE', 'S-MERCAPTOCYSTEINE', 'S-OXY_CYSTEINE', 'S,S-(2-HYDROXYETHYL)THIOCYSTEINE', 'SULFONATED_TYROSINE', 'TERT-BUTYLOXYCARBONYL_GROUP', 'TOPO-QUINONE', 'TYROSINE-O-SULPHONIC_ACID']#
Non-standard residue info taken from: https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
PYL
(pyrolysine) andSEC
are added
- graphein.protein.resi_atoms.NON_STANDARD_RESIS_PARENT: Dict[str, str] = {'5HP': 'GLU', 'ABA': 'ALA', 'ACE': '-', 'AIB': 'ALA', 'BMT': 'THR', 'BOC': '-', 'CBX': '-', 'CEA': 'CYS', 'CGU': 'GLU', 'CME': 'CYS', 'CRO': 'CRO', 'CSD': 'CYS', 'CSO': 'CYS', 'CSS': 'CYS', 'CSW': 'CYS', 'CSX': 'CYS', 'CXM': 'MET', 'DAL': 'ALA', 'DAR': 'ARG', 'DCY': 'CYS', 'DGL': 'GLU', 'DGN': 'GLN', 'DHI': 'HIS', 'DIL': 'ILE', 'DIV': 'VAL', 'DLE': 'LEU', 'DLY': 'LYS', 'DPN': 'PHE', 'DPR': 'PRO', 'DSG': 'ASN', 'DSN': 'SER', 'DSP': 'ASP', 'DTH': 'THR', 'DTR': 'DTR', 'DTY': 'TYR', 'DVA': 'VAL', 'FME': 'MET', 'FOR': '-', 'HYP': 'PRO', 'IVA': '-', 'KCX': 'LYS', 'LLP': 'LYS', 'MLE': 'LEU', 'MVA': 'VAL', 'NH2': '-', 'NLE': 'LEU', 'OCS': 'CYS', 'ORN': 'ALA', 'PCA': 'GLU', 'PTR': 'TYR', 'PVL': '-', 'PYL': 'LYS', 'SAR': 'GLY', 'SEC': 'CYS', 'SEP': 'SER', 'STY': 'TYR', 'TPO': 'THR', 'TPQ': 'PHE', 'TYS': 'TYR'}#
Mapping of 3-letter non-standard/modified residues to their 3-letter parent residue names.
- graphein.protein.resi_atoms.NON_STANDARD_RESI_NAMES: List[str] = ['CSD', 'HYP', 'BMT', '5HP', 'ACE', 'ABA', 'AIB', 'NH2', 'CBX', 'CSW', 'OCS', 'DAL', 'DAR', 'DSG', 'DSP', 'DCY', 'CRO', 'DGL', 'DGN', 'DHI', 'DIL', 'DIV', 'DLE', 'DLY', 'DPN', 'DPR', 'DSN', 'DTH', 'DTR', 'DTY', 'DVA', 'FOR', 'CGU', 'IVA', 'KCX', 'LLP', 'CXM', 'FME', 'MLE', 'MVA', 'NLE', 'PTR', 'ORN', 'SEP', 'SEC', 'TPO', 'PCA', 'PVL', 'PYL', 'SAR', 'CEA', 'CSO', 'CSS', 'CSX', 'CME', 'TYS', 'BOC', 'TPQ', 'STY']#
List of non-standard residue 3-letter names.
Collected from: https://www.globalphasing.com/buster/manual/maketnt/manual/lib_val/library_validation.html
- graphein.protein.resi_atoms.PI_RESIS: List[str] = ['PHE', 'TYR', 'TRP']#
List of residues involved in pi interactions.
- graphein.protein.resi_atoms.POS_AA: List[str] = ['HIS', 'LYS', 'ARG']#
Positively charged amino acids.
- graphein.protein.resi_atoms.RESIDUE_ATOM_BOND_STATE: Dict[str, Dict[str, str]] = {'ARG': {'CD': 'Csb', 'CG': 'Csb', 'CZ': 'Cdb', 'NE': 'Nsb', 'NH1': 'Nres', 'NH2': 'Nres'}, 'ASN': {'CG': 'Csb', 'ND2': 'Ndb', 'OD1': 'Odb'}, 'ASP': {'CG': 'Csb', 'OD1': 'Ores', 'OD2': 'Ores'}, 'CYS': {'SG': 'Ssb'}, 'GLN': {'CD': 'Csb', 'CG': 'Csb', 'NE2': 'Ndb', 'OE1': 'Odb'}, 'GLU': {'CD': 'Csb', 'CG': 'Csb', 'OE1': 'Ores', 'OE2': 'Ores'}, 'HIS': {'CD2': 'Cdb', 'CE1': 'Cdb', 'CG': 'Cdb', 'ND1': 'Nsb', 'NE2': 'Ndb'}, 'ILE': {'CD1': 'Csb', 'CG1': 'Csb', 'CG2': 'Csb'}, 'LEU': {'CD1': 'Csb', 'CD2': 'Csb', 'CG': 'Csb'}, 'LYS': {'CD': 'Csb', 'CE': 'Csb', 'CG': 'Csb', 'NZ': 'Nsb'}, 'MET': {'CE': 'Csb', 'CG': 'Csb', 'SD': 'Ssb'}, 'PHE': {'CD1': 'Cres', 'CD2': 'Cres', 'CE1': 'Cdb', 'CE2': 'Cdb', 'CG': 'Cdb', 'CZ': 'Cres'}, 'PRO': {'CD': 'Csb', 'CG': 'Csb'}, 'SER': {'OG': 'Osb'}, 'THR': {'CG2': 'Csb', 'OG1': 'Osb'}, 'TRP': {'CD1': 'Cdb', 'CD2': 'Cres', 'CE2': 'Cdb', 'CE3': 'Cdb', 'CG': 'Cdb', 'CH2': 'Cdb', 'CZ2': 'Cres', 'CZ3': 'Cres', 'NE1': 'Nsb'}, 'TYR': {'CD1': 'Cres', 'CD2': 'Cres', 'CE1': 'Cdb', 'CE2': 'Cdb', 'CG': 'Cdb', 'CZ': 'Cres', 'OH': 'Osb'}, 'VAL': {'CG1': 'Csb', 'CG2': 'Csb'}, 'XXX': {'C': 'Cdb', 'CA': 'Csb', 'CB': 'Csb', 'H': 'Hsb', 'N': 'Nsb', 'O': 'Odb', 'OXT': 'Osb'}}#
Assignment of consituent atom classes with each standard residue to atomic radii.
Covalent radii from:
Heyrovska, Raji : ‘Atomic Structures of all the Twenty Essential Amino Acids and a Tripeptide, with Bond Lengths as Sums of Atomic Covalent Radii’
- graphein.protein.resi_atoms.RESI_NAMES: List[str] = ['ALA', 'ASX', 'CYS', 'ASP', 'GLU', 'PHE', 'GLY', 'HIS', 'ILE', 'LYS', 'LEU', 'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL', 'TRP', 'TYR', 'GLX', 'CSD', 'HYP', 'BMT', '5HP', 'ACE', 'ABA', 'AIB', 'NH2', 'CBX', 'CSW', 'OCS', 'DAL', 'DAR', 'DSG', 'DSP', 'DCY', 'CRO', 'DGL', 'DGN', 'DHI', 'DIL', 'DIV', 'DLE', 'DLY', 'DPN', 'DPR', 'DSN', 'DTH', 'DTR', 'DTY', 'DVA', 'FOR', 'CGU', 'IVA', 'KCX', 'LLP', 'CXM', 'FME', 'MLE', 'MVA', 'NLE', 'PTR', 'ORN', 'SEP', 'SEC', 'TPO', 'PCA', 'PVL', 'PYL', 'SAR', 'CEA', 'CSO', 'CSS', 'CSX', 'CME', 'TYS', 'BOC', 'TPQ', 'STY', 'UNK']#
3-letter residue names for all amino acids. Non-standard/modified amino acids are mapped to their parent amino acid. Includes
"UNK"
to denote unknown residues.
- graphein.protein.resi_atoms.RESI_THREE_TO_1: Dict[str, str] = {'3HP': 'X', '4HP': 'X', '5HP': 'Q', 'ABA': 'A', 'ACE': 'X', 'AIB': 'A', 'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'ASX': 'B', 'BMT': 'T', 'BOC': 'X', 'CBX': 'X', 'CEA': 'C', 'CGU': 'E', 'CME': 'C', 'CRO': 'TYG', 'CSD': 'C', 'CSO': 'C', 'CSS': 'C', 'CSW': 'C', 'CSX': 'C', 'CXM': 'M', 'CYS': 'C', 'DAL': 'A', 'DAR': 'R', 'DCY': 'C', 'DGL': 'E', 'DGN': 'Q', 'DHI': 'H', 'DIL': 'I', 'DIV': 'V', 'DLE': 'L', 'DLY': 'K', 'DPN': 'F', 'DPR': 'P', 'DSG': 'N', 'DSN': 'S', 'DSP': 'D', 'DTH': 'T', 'DTR': 'W', 'DTY': 'Y', 'DVA': 'V', 'FME': 'M', 'FOR': 'X', 'GLN': 'Q', 'GLU': 'E', 'GLX': 'Z', 'GLY': 'G', 'HIS': 'H', 'HYP': 'P', 'ILE': 'I', 'IVA': 'X', 'KCX': 'K', 'LEU': 'L', 'LLP': 'K', 'LYS': 'K', 'MET': 'M', 'MLE': 'L', 'MVA': 'V', 'NH2': 'X', 'NLE': 'L', 'OCS': 'C', 'ORN': 'A', 'PCA': 'Q', 'PHE': 'F', 'PRO': 'P', 'PTR': 'Y', 'PVL': 'X', 'PYL': 'O', 'SAR': 'G', 'SEC': 'U', 'SEP': 'S', 'SER': 'S', 'STY': 'Y', 'THR': 'T', 'TPO': 'T', 'TPQ': 'Y', 'TRP': 'W', 'TYR': 'Y', 'TYS': 'Y', 'UNK': 'X', 'VAL': 'V'}#
Mapping of 3-letter residue names to 1-letter residue names. Non-standard/modified amino acids are mapped to their parent amino acid. Includes
"UNK"
to denote unknown residues.
- graphein.protein.resi_atoms.SCHNEIDER_WREDE_DISTMAT: Dict[str, float] = {'AA': 0.0, 'AC': 0.112, 'AD': 0.819, 'AE': 0.827, 'AF': 0.54, 'AG': 0.208, 'AH': 0.696, 'AI': 0.407, 'AK': 0.891, 'AL': 0.406, 'AM': 0.379, 'AN': 0.318, 'AP': 0.191, 'AQ': 0.372, 'AR': 1.0, 'AS': 0.094, 'AT': 0.22, 'AV': 0.273, 'AW': 0.739, 'AY': 0.552, 'CA': 0.114, 'CC': 0.0, 'CD': 0.847, 'CE': 0.838, 'CF': 0.437, 'CG': 0.32, 'CH': 0.66, 'CI': 0.304, 'CK': 0.887, 'CL': 0.301, 'CM': 0.277, 'CN': 0.324, 'CP': 0.157, 'CQ': 0.341, 'CR': 1.0, 'CS': 0.176, 'CT': 0.233, 'CV': 0.167, 'CW': 0.639, 'CY': 0.457, 'DA': 0.729, 'DC': 0.742, 'DD': 0.0, 'DE': 0.124, 'DF': 0.924, 'DG': 0.697, 'DH': 0.435, 'DI': 0.847, 'DK': 0.249, 'DL': 0.841, 'DM': 0.819, 'DN': 0.56, 'DP': 0.657, 'DQ': 0.584, 'DR': 0.295, 'DS': 0.667, 'DT': 0.649, 'DV': 0.797, 'DW': 1.0, 'DY': 0.836, 'EA': 0.79, 'EC': 0.788, 'ED': 0.133, 'EE': 0.0, 'EF': 0.932, 'EG': 0.779, 'EH': 0.406, 'EI': 0.86, 'EK': 0.143, 'EL': 0.854, 'EM': 0.83, 'EN': 0.599, 'EP': 0.688, 'EQ': 0.598, 'ER': 0.234, 'ES': 0.726, 'ET': 0.682, 'EV': 0.824, 'EW': 1.0, 'EY': 0.837, 'FA': 0.508, 'FC': 0.405, 'FD': 0.977, 'FE': 0.918, 'FF': 0.0, 'FG': 0.69, 'FH': 0.663, 'FI': 0.128, 'FK': 0.903, 'FL': 0.131, 'FM': 0.169, 'FN': 0.541, 'FP': 0.42, 'FQ': 0.459, 'FR': 1.0, 'FS': 0.548, 'FT': 0.499, 'FV': 0.252, 'FW': 0.207, 'FY': 0.179, 'GA': 0.206, 'GC': 0.312, 'GD': 0.776, 'GE': 0.807, 'GF': 0.727, 'GG': 0.0, 'GH': 0.769, 'GI': 0.592, 'GK': 0.894, 'GL': 0.591, 'GM': 0.557, 'GN': 0.381, 'GP': 0.323, 'GQ': 0.467, 'GR': 1.0, 'GS': 0.158, 'GT': 0.272, 'GV': 0.464, 'GW': 0.923, 'GY': 0.728, 'HA': 0.896, 'HC': 0.836, 'HD': 0.629, 'HE': 0.547, 'HF': 0.907, 'HG': 1.0, 'HH': 0.0, 'HI': 0.848, 'HK': 0.566, 'HL': 0.842, 'HM': 0.825, 'HN': 0.754, 'HP': 0.777, 'HQ': 0.716, 'HR': 0.697, 'HS': 0.865, 'HT': 0.834, 'HV': 0.831, 'HW': 0.981, 'HY': 0.821, 'IA': 0.403, 'IC': 0.296, 'ID': 0.942, 'IE': 0.891, 'IF': 0.134, 'IG': 0.592, 'IH': 0.652, 'II': 0.0, 'IK': 0.892, 'IL': 0.013, 'IM': 0.057, 'IN': 0.457, 'IP': 0.311, 'IQ': 0.383, 'IR': 1.0, 'IS': 0.443, 'IT': 0.396, 'IV': 0.133, 'IW': 0.339, 'IY': 0.213, 'KA': 0.889, 'KC': 0.871, 'KD': 0.279, 'KE': 0.149, 'KF': 0.957, 'KG': 0.9, 'KH': 0.438, 'KI': 0.899, 'KK': 0.0, 'KL': 0.892, 'KM': 0.871, 'KN': 0.667, 'KP': 0.757, 'KQ': 0.639, 'KR': 0.154, 'KS': 0.825, 'KT': 0.759, 'KV': 0.882, 'KW': 1.0, 'KY': 0.848, 'LA': 0.405, 'LC': 0.296, 'LD': 0.944, 'LE': 0.892, 'LF': 0.139, 'LG': 0.596, 'LH': 0.653, 'LI': 0.013, 'LK': 0.893, 'LL': 0.0, 'LM': 0.062, 'LN': 0.452, 'LP': 0.309, 'LQ': 0.376, 'LR': 1.0, 'LS': 0.443, 'LT': 0.397, 'LV': 0.133, 'LW': 0.341, 'LY': 0.205, 'MA': 0.383, 'MC': 0.276, 'MD': 0.932, 'ME': 0.879, 'MF': 0.182, 'MG': 0.569, 'MH': 0.648, 'MI': 0.058, 'MK': 0.884, 'ML': 0.062, 'MM': 0.0, 'MN': 0.447, 'MP': 0.285, 'MQ': 0.372, 'MR': 1.0, 'MS': 0.417, 'MT': 0.358, 'MV': 0.12, 'MW': 0.391, 'MY': 0.255, 'NA': 0.424, 'NC': 0.425, 'ND': 0.838, 'NE': 0.835, 'NF': 0.766, 'NG': 0.512, 'NH': 0.78, 'NI': 0.615, 'NK': 0.891, 'NL': 0.603, 'NM': 0.588, 'NN': 0.0, 'NP': 0.266, 'NQ': 0.175, 'NR': 1.0, 'NS': 0.361, 'NT': 0.368, 'NV': 0.503, 'NW': 0.945, 'NY': 0.641, 'PA': 0.22, 'PC': 0.179, 'PD': 0.852, 'PE': 0.831, 'PF': 0.515, 'PG': 0.376, 'PH': 0.696, 'PI': 0.363, 'PK': 0.875, 'PL': 0.357, 'PM': 0.326, 'PN': 0.231, 'PP': 0.0, 'PQ': 0.228, 'PR': 1.0, 'PS': 0.196, 'PT': 0.161, 'PV': 0.244, 'PW': 0.72, 'PY': 0.481, 'QA': 0.512, 'QC': 0.462, 'QD': 0.903, 'QE': 0.861, 'QF': 0.671, 'QG': 0.648, 'QH': 0.765, 'QI': 0.532, 'QK': 0.881, 'QL': 0.518, 'QM': 0.505, 'QN': 0.181, 'QP': 0.272, 'QQ': 0.0, 'QR': 1.0, 'QS': 0.461, 'QT': 0.389, 'QV': 0.464, 'QW': 0.831, 'QY': 0.522, 'RA': 0.919, 'RC': 0.905, 'RD': 0.305, 'RE': 0.225, 'RF': 0.977, 'RG': 0.928, 'RH': 0.498, 'RI': 0.929, 'RK': 0.141, 'RL': 0.92, 'RM': 0.908, 'RN': 0.69, 'RP': 0.796, 'RQ': 0.668, 'RR': 0.0, 'RS': 0.86, 'RT': 0.808, 'RV': 0.914, 'RW': 1.0, 'RY': 0.859, 'SA': 0.1, 'SC': 0.185, 'SD': 0.801, 'SE': 0.812, 'SF': 0.622, 'SG': 0.17, 'SH': 0.718, 'SI': 0.478, 'SK': 0.883, 'SL': 0.474, 'SM': 0.44, 'SN': 0.289, 'SP': 0.181, 'SQ': 0.358, 'SR': 1.0, 'SS': 0.0, 'ST': 0.174, 'SV': 0.342, 'SW': 0.827, 'SY': 0.615, 'TA': 0.251, 'TC': 0.261, 'TD': 0.83, 'TE': 0.812, 'TF': 0.604, 'TG': 0.312, 'TH': 0.737, 'TI': 0.455, 'TK': 0.866, 'TL': 0.453, 'TM': 0.403, 'TN': 0.315, 'TP': 0.159, 'TQ': 0.322, 'TR': 1.0, 'TS': 0.185, 'TT': 0.0, 'TV': 0.345, 'TW': 0.816, 'TY': 0.596, 'VA': 0.275, 'VC': 0.165, 'VD': 0.9, 'VE': 0.867, 'VF': 0.269, 'VG': 0.471, 'VH': 0.649, 'VI': 0.135, 'VK': 0.889, 'VL': 0.134, 'VM': 0.12, 'VN': 0.38, 'VP': 0.212, 'VQ': 0.339, 'VR': 1.0, 'VS': 0.322, 'VT': 0.305, 'VV': 0.0, 'VW': 0.472, 'VY': 0.31, 'WA': 0.658, 'WC': 0.56, 'WD': 1.0, 'WE': 0.931, 'WF': 0.196, 'WG': 0.829, 'WH': 0.678, 'WI': 0.305, 'WK': 0.892, 'WL': 0.304, 'WM': 0.344, 'WN': 0.631, 'WP': 0.555, 'WQ': 0.538, 'WR': 0.968, 'WS': 0.689, 'WT': 0.638, 'WV': 0.418, 'WW': 0.0, 'WY': 0.204, 'YA': 0.587, 'YC': 0.478, 'YD': 1.0, 'YE': 0.932, 'YF': 0.202, 'YG': 0.782, 'YH': 0.678, 'YI': 0.23, 'YK': 0.904, 'YL': 0.219, 'YM': 0.268, 'YN': 0.512, 'YP': 0.444, 'YQ': 0.404, 'YR': 0.995, 'YS': 0.612, 'YT': 0.557, 'YV': 0.328, 'YW': 0.244, 'YY': 0.0}#
Scheider-Wrede Physicochemical Distance Matrix taken from ProPy3 https://github.com/MartinThoma/propy3.
Paper
The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site Biophysical Journal Volume 66, Issue 2, Part 1, February 1994, Pages 335-344 G.Schneider, P.Wrede
- graphein.protein.resi_atoms.STANDARD_AMINO_ACIDS: List[str] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z']#
Vocabulary of amino acids with one-letter codes. Includes fuzzy standard amino acids:
"B"
denotes"ASX"
which corresponds to"ASP"
("D"
) or"ASN"
("N"
) and"Z"
denotes"GLX"
which corresponds to"GLU"
("E"
) or"GLN"
("Q"
).
- graphein.protein.resi_atoms.STANDARD_RESI_NAMES: List[str] = ['ALA', 'ASX', 'CYS', 'ASP', 'GLU', 'PHE', 'GLY', 'HIS', 'ILE', 'LYS', 'LEU', 'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL', 'TRP', 'TYR', 'GLX', 'UNK']#
List of standard residue 3-letter names. Includes
"UNK"
for unknown residues."ASX"
denotes"ASP"
or"ASN"
and"GLX"
denotes"GLU"
or"GLN"
.