pallatom.helpers.data¶

Classes¶

ProteinDataset

Lazy-loading Dataset backed by a JSONL file.

Functions¶

`make_data_loaders`(→ tuple[torch.utils.data.DataLoader, ...)	Build train, validation, and test DataLoaders from a JSONL file and a
`make_ddp_data_loaders`(...)	Build train/val/test DataLoaders backed by DistributedSampler for DDP training.

Module Contents¶

class pallatom.helpers.data.ProteinDataset(jsonl_path: str | pathlib.Path, names: list[str], max_seq_length: int = 256)¶

Bases: torch.utils.data.Dataset

Lazy-loading Dataset backed by a JSONL file.

Scans the file once at construction to build a name→byte-offset index (only offsets are kept in RAM, not the protein data). Each __getitem__ seeks to the relevant line and parses only that entry.

Compatible with num_workers > 0: the open file handle is excluded from pickling and re-opened lazily inside each worker process.

JSONL format expected per line:

{“name”: “1abc.A”, “seq”: “ACDEF…”, “coords”: {“N”: [[x,y,z],…],: “CA”: […], “C”: […], “O”: […]}, …}

Parameters:

jsonl_path – Path to the JSONL file.
names – List of entry names (e.g. “1abc.A”) to include.
max_seq_length – Sequences longer than this are truncated; shorter ones are zero-padded to this length.

__del__() → None¶

__getitem__(idx: int) → dict¶

__getstate__() → dict¶

__len__() → int¶

jsonl_path¶

max_seq_length = 256¶

pallatom.helpers.data.make_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, num_workers: int = 0, debug_run: bool = True) → tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]¶

Build train, validation, and test DataLoaders from a JSONL file and a splits JSON.

Parameters:

cfg – TrainConfig — batch_size and max_seq_length are read from cfg.loader.
jsonl_path – Path to the JSONL protein dataset.
splits_path – Path to a JSON file with keys “train”, “validation”, and “test”, each a list of entry names.
num_workers – DataLoader worker processes (0 = main process only).

Returns:

(train_loader, val_loader, test_loader)

pallatom.helpers.data.make_ddp_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, rank: int, world_size: int, num_workers: int = 0) → tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader]¶: Build train/val/test DataLoaders backed by DistributedSampler for DDP training.