pallatom.helpers.data¶
Classes¶
Lazy-loading Dataset backed by a JSONL file. |
Functions¶
|
Build train, validation, and test DataLoaders from a JSONL file and a |
Build train/val/test DataLoaders backed by DistributedSampler for DDP training. |
Module Contents¶
- class pallatom.helpers.data.ProteinDataset(jsonl_path: str | pathlib.Path, names: list[str], max_seq_length: int = 256)¶
Bases:
torch.utils.data.DatasetLazy-loading Dataset backed by a JSONL file.
Scans the file once at construction to build a name→byte-offset index (only offsets are kept in RAM, not the protein data). Each __getitem__ seeks to the relevant line and parses only that entry.
Compatible with num_workers > 0: the open file handle is excluded from pickling and re-opened lazily inside each worker process.
- JSONL format expected per line:
- {“name”: “1abc.A”, “seq”: “ACDEF…”, “coords”: {“N”: [[x,y,z],…],
“CA”: […], “C”: […], “O”: […]}, …}
- Parameters:
jsonl_path – Path to the JSONL file.
names – List of entry names (e.g. “1abc.A”) to include.
max_seq_length – Sequences longer than this are truncated; shorter ones are zero-padded to this length.
- jsonl_path¶
- max_seq_length = 256¶
- pallatom.helpers.data.make_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, num_workers: int = 0, debug_run: bool = True) tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]¶
Build train, validation, and test DataLoaders from a JSONL file and a splits JSON.
- Parameters:
cfg – TrainConfig — batch_size and max_seq_length are read from cfg.loader.
jsonl_path – Path to the JSONL protein dataset.
splits_path – Path to a JSON file with keys “train”, “validation”, and “test”, each a list of entry names.
num_workers – DataLoader worker processes (0 = main process only).
- Returns:
(train_loader, val_loader, test_loader)
- pallatom.helpers.data.make_ddp_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, rank: int, world_size: int, num_workers: int = 0) tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]¶
Build train/val/test DataLoaders backed by DistributedSampler for DDP training.