You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.
For loading the data i am using the following lines:
"""## Load the task and the dataset"""
datapath = './data/ec'
task = ps_tasks.StructureSimilarityTask(root=datapath)
dset = task.dataset
"""We convert the protein 3D structures to $\epsilon$-graphs ($\epsilon=8$ here):"""
def transform(data):
data, protein_dict = data
data.y = protein_dict['protein']['ID']
return data
dset2 = dset.to_graph(eps=8.0).pyg(transform=transform)
from torch.utils.data import Subset
from torch_geometric.loader import DataLoader
batch_size = args.batch_size
train_loader = DataLoader(Subset(dset2, task.train_index), batch_size=batch_size,shuffle=True, num_workers=0)
val_loader = DataLoader(Subset(dset2, task.val_index), batch_size=batch_size,shuffle=False, num_workers=0)
test_loader = DataLoader(Subset(dset2, task.test_index), batch_size=batch_size,shuffle=False, num_workers=0)
My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:
def forward(self, batch):
it=0
for sample in batch: ## embed each batch in every sample separately.
x=sample.x
edge_index=sample.edge_index
x = self.x_embedding(x)
x = self.conv1(x, edge_index)
x = F.leaky_relu(x)
x=self.bano1(x)
#x = F.dropout(x, training=self.training,p=0.2)
x = self.conv2(x, edge_index)
x = F.leaky_relu(x)
x=self.bano2(x)
#x = F.dropout(x, training=self.training,p=0.2)
x = self.conv3(x, edge_index)
x = F.relu(x)
x = self.bano3(x)
# #x = F.dropout(x, training=self.training,p=0.2)
x = self.conv4(x, edge_index)
# x = F.relu(x)
# x = self.bano3(x)
# # #x = F.dropout(x, training=self.training,p=0.2)
if it==0:
s1=global_add_pool(x, sample.batch)
else:
s2=global_add_pool(x, sample.batch)
it+=1
final=self.mlpRep(s1+s2)
return final
and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.
@torch.no_grad()
def eval_epoch(model, loader):
model.eval()
y_true = []
y_pred = []
for step, batch in enumerate(val_loader):
size = len(batch[0].y)
batch[0] = batch[0].to(device)
batch[1] = batch[1].to(device)
y_hat=model(batch)
truths=[]
for g in range(size):
truths.append(task.targetBatch(batch[0].y[g],batch[1].y[g]))
y_pred.append(y_hat)
y_true.append(torch.Tensor(truths))
y_true = torch.hstack(y_true).detach().cpu().numpy()
y_pred2 = torch.vstack(y_pred).detach().cpu().numpy()
scores = task.evaluate(y_true, y_pred2)
return scores
Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !
The text was updated successfully, but these errors were encountered:
Dear @ahariri13 thank you for contacting us! Glad to hear you are using the tool.
I am very busy for the next week and haven't had a chance to look closely at your code but I know @claying has run some experiments on this dataset and would have some suggestions.
Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.
For loading the data i am using the following lines:
My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:
and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.
Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !
The text was updated successfully, but these errors were encountered: