Skip to content

Conversation

@wconnell
Copy link
Contributor

  1. MoleculeDataset(lazy=True) converts a smile to an rdkitMol object and overwrites it with None. Loading large datasets like ZINC2m is severely slowed down by this, so avoid the unnecessary conversion to a Mol when lazy=True
  2. Check download/extract logic with ZINC2m dataset to avoid redundancy and conflicts.

@KiddoZhu
Copy link
Member

Thanks for your contribution! That's a really important step for large datasets.

For the check of file in Zinc2m, I would prefer keeping extracting the zip file anyway. This is because we don't check the md5 of extracted file and it is possible that the extracted file might be broken (e.g. due to Ctrl+C break).

@KiddoZhu
Copy link
Member

Manually rebased and merged

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@wconnell@KiddoZhu