pallatom.helpers.data ===================== .. py:module:: pallatom.helpers.data Classes ------- .. autoapisummary:: pallatom.helpers.data.ProteinDataset Functions --------- .. autoapisummary:: pallatom.helpers.data.make_data_loaders pallatom.helpers.data.make_ddp_data_loaders Module Contents --------------- .. py:class:: ProteinDataset(jsonl_path: str | pathlib.Path, names: list[str], max_seq_length: int = 256) Bases: :py:obj:`torch.utils.data.Dataset` Lazy-loading Dataset backed by a JSONL file. Scans the file once at construction to build a name→byte-offset index (only offsets are kept in RAM, not the protein data). Each __getitem__ seeks to the relevant line and parses only that entry. Compatible with num_workers > 0: the open file handle is excluded from pickling and re-opened lazily inside each worker process. JSONL format expected per line: {"name": "1abc.A", "seq": "ACDEF...", "coords": {"N": [[x,y,z],...], "CA": [...], "C": [...], "O": [...]}, ...} :param jsonl_path: Path to the JSONL file. :param names: List of entry names (e.g. "1abc.A") to include. :param max_seq_length: Sequences longer than this are truncated; shorter ones are zero-padded to this length. .. py:method:: __del__() -> None .. py:method:: __getitem__(idx: int) -> dict .. py:method:: __getstate__() -> dict .. py:method:: __len__() -> int .. py:attribute:: jsonl_path .. py:attribute:: max_seq_length :value: 256 .. py:function:: make_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, num_workers: int = 0, debug_run: bool = True) -> tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader] Build train, validation, and test DataLoaders from a JSONL file and a splits JSON. :param cfg: TrainConfig — batch_size and max_seq_length are read from cfg.loader. :param jsonl_path: Path to the JSONL protein dataset. :param splits_path: Path to a JSON file with keys "train", "validation", and "test", each a list of entry names. :param num_workers: DataLoader worker processes (0 = main process only). :returns: (train_loader, val_loader, test_loader) .. py:function:: make_ddp_data_loaders(cfg: train.train_config.TrainConfig, jsonl_path: str | pathlib.Path, splits_path: str | pathlib.Path, rank: int, world_size: int, num_workers: int = 0) -> tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader] Build train/val/test DataLoaders backed by DistributedSampler for DDP training.