Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure Similarity Task training #273

Open
ahariri13 opened this issue May 15, 2024 · 1 comment
Open

Structure Similarity Task training #273

ahariri13 opened this issue May 15, 2024 · 1 comment

Comments

@ahariri13
Copy link

Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.

For loading the data i am using the following lines:

"""## Load the task and the dataset"""
datapath = './data/ec'
task = ps_tasks.StructureSimilarityTask(root=datapath)
dset = task.dataset

"""We convert the protein 3D structures to $\epsilon$-graphs ($\epsilon=8$ here):"""

def transform(data):
    data, protein_dict = data
    data.y = protein_dict['protein']['ID']
    return data

dset2 = dset.to_graph(eps=8.0).pyg(transform=transform)

from torch.utils.data import Subset
from torch_geometric.loader import DataLoader

batch_size = args.batch_size
train_loader = DataLoader(Subset(dset2, task.train_index), batch_size=batch_size,shuffle=True, num_workers=0)

val_loader = DataLoader(Subset(dset2, task.val_index), batch_size=batch_size,shuffle=False, num_workers=0)

test_loader = DataLoader(Subset(dset2, task.test_index), batch_size=batch_size,shuffle=False, num_workers=0)

My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:

    def forward(self, batch):

        it=0
        for sample in batch: ## embed each batch in every sample separately. 
          x=sample.x
          edge_index=sample.edge_index

          x = self.x_embedding(x)
          x = self.conv1(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano1(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv2(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano2(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv3(x, edge_index)
          x = F.relu(x)
          x = self.bano3(x)
          # #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv4(x, edge_index)
          # x = F.relu(x)
          # x = self.bano3(x)
          # # #x = F.dropout(x, training=self.training,p=0.2)
          if it==0:
            s1=global_add_pool(x, sample.batch)
          else:
            s2=global_add_pool(x, sample.batch)

          it+=1
        final=self.mlpRep(s1+s2)

        return final 

and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.

@torch.no_grad()
def eval_epoch(model, loader):
    model.eval()

    y_true = []
    y_pred = []

    for step, batch in enumerate(val_loader):
        size = len(batch[0].y)
        batch[0] = batch[0].to(device)
        batch[1] = batch[1].to(device)

        y_hat=model(batch)

        truths=[]
        for g in range(size):
          truths.append(task.targetBatch(batch[0].y[g],batch[1].y[g]))
        y_pred.append(y_hat)
        y_true.append(torch.Tensor(truths))

    y_true = torch.hstack(y_true).detach().cpu().numpy()
    y_pred2 = torch.vstack(y_pred).detach().cpu().numpy()
    scores = task.evaluate(y_true, y_pred2)
    return scores

Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !

@cgoliver
Copy link
Collaborator

cgoliver commented May 16, 2024

Dear @ahariri13 thank you for contacting us! Glad to hear you are using the tool.

I am very busy for the next week and haven't had a chance to look closely at your code but I know @claying has run some experiments on this dataset and would have some suggestions.

Best,
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants