pallatom.helpers.data

Classes

ProteinDataset

Lazy-loading Dataset backed by a JSONL file.

Functions

make_data_loaders(→ tuple[torch.utils.data.DataLoader, ...)

Build train, validation, and test DataLoaders from a JSONL file and a

make_ddp_data_loaders(...)

Build train/val/test DataLoaders backed by DistributedSampler for DDP training.

Module Contents

class pallatom.helpers.data.ProteinDataset(jsonl_path: str | pathlib.Path, names: list[str], max_seq_length: int = 256)

Bases: torch.utils.data.Dataset

Lazy-loading Dataset backed by a JSONL file.

Scans the file once at construction to build a name→byte-offset index (only offsets are kept in RAM, not the protein data). Each __getitem__ seeks to the relevant line and parses only that entry.

Compatible with num_workers > 0: the open file handle is excluded from pickling and re-opened lazily inside each worker process.

JSONL format expected per line:
{“name”: “1abc.A”, “seq”: “ACDEF…”, “coords”: {“N”: [[x,y,z],…],

“CA”: […], “C”: […], “O”: […]}, …}

Parameters:
  • jsonl_path – Path to the JSONL file.

  • names – List of entry names (e.g. “1abc.A”) to include.

  • max_seq_length – Sequences longer than this are truncated; shorter ones are zero-padded to this length.

__del__() None
__getitem__(idx: int) dict
__getstate__() dict
__len__() int
jsonl_path
max_seq_length = 256
pallatom.helpers.data.make_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, num_workers: int = 0, debug_run: bool = True) tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]

Build train, validation, and test DataLoaders from a JSONL file and a splits JSON.

Parameters:
  • cfg – TrainConfig — batch_size and max_seq_length are read from cfg.loader.

  • jsonl_path – Path to the JSONL protein dataset.

  • splits_path – Path to a JSON file with keys “train”, “validation”, and “test”, each a list of entry names.

  • num_workers – DataLoader worker processes (0 = main process only).

Returns:

(train_loader, val_loader, test_loader)

pallatom.helpers.data.make_ddp_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, rank: int, world_size: int, num_workers: int = 0) tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]

Build train/val/test DataLoaders backed by DistributedSampler for DDP training.