MLOps E2E - 2-1. CT : Data Load (kubeflow pipeline)

MLOps

MLOps E2E - 2-1. CT : Data Load (kubeflow pipeline)

Hongma 2022. 4. 27. 20:54

Training 코드는 크게 Data load와 Training, 두 부분으로 이루어져 있다.

먼저 data load에 대한 내용이다.

사용할 dataset은 image classification 분야에서 간단하게 실험해볼 수 있는 cifar10 dataset이다.

kubeflow pipeline에 올라온 pytorch example도 이 데이터를 사용하기 때문에 선택하게 되었다.

그냥 볼륨 하나 만들어서 거기에 dataset을 저장하는 방법도 있지만, 예제에 나타난 방법처럼 webdataset을 이용하여 데이터를 분할 저장하는 방법을 사용해보기로 했다.

(대용량 데이터를 위한 예제인 것 같다.)

./DataLoad/data_load.py

"""Cifar10 pre-process module."""
import subprocess
from pathlib import Path
from argparse import ArgumentParser
import torchvision
import webdataset as wds
from sklearn.model_selection import train_test_split


if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--output_path", default="../data", type=str)

    args = vars(parser.parse_args())
    output_path = args["output_path"]

    Path(output_path).mkdir(parents=True, exist_ok=True)

    trainset = torchvision.datasets.CIFAR10(root=output_path, train=True, download=True)
    testset = torchvision.datasets.CIFAR10(root=output_path, train=False, download=True)

    Path(output_path + "/train").mkdir(parents=True, exist_ok=True)
    Path(output_path + "/val").mkdir(parents=True, exist_ok=True)
    Path(output_path + "/test").mkdir(parents=True, exist_ok=True)

    RANDOM_SEED = 42
    y = trainset.targets
    trainset, valset, y_train, y_val = train_test_split(
        trainset, y, stratify=y, shuffle=True, test_size=0.2, random_state=RANDOM_SEED
    )

    for name in [(trainset, "train"), (valset, "val"), (testset, "test")]:
        with wds.ShardWriter(
            output_path + "/" + str(name[1]) + "/" + str(name[1]) + "-%d.tar",
            maxcount=1000,
        ) as sink:
            for index, (image, cls) in enumerate(name[0]):
                sink.write({"__key__": "%06d" % index, "ppm": image, "cls": cls})

    entry_point = ["ls", "-R", output_path]
    run_code = subprocess.run(
        entry_point, stdout=subprocess.PIPE
    )  # pylint: disable=subprocess-run-check
    print(run_code.stdout)

unittest가 가능하도록 output_path엔 default 값을 넣어두었다.

또한 예제에 있는 visualization 기능들은 추후에 pipeline을 완성한 뒤 하나하나 넣어볼 계획이다.

'MLOps' 카테고리의 다른 글

MLOps E2E - 5. Storage : minio (0)	2022.05.10
MLOps E2E - 4. Logging : wandb (0)	2022.05.04
MLOps E2E - 3. Pipeline (0)	2022.04.29
MLOps E2E - 2-2. CT : Training (kubeflow pipeline) (0)	2022.04.28
MLOps E2E - 1. CI / CD : Github Actions (0)	2022.04.26

현재글MLOps E2E - 2-1. CT : Data Load (kubeflow pipeline)

일상블로그 blog.naver.com/nhmnhm0819 nhmnhm0819@naver.com

ASAPNet, NFNet, Detectron2, raft, ESR-9, mask rcnn, Optical Flow, DCLGAN, U-GAT-IT, subprocess, detectron, gauGAN, flownet, GAN, 감정인식, Variational Autoencoder, kubeflow, CNN, Kubernetes, VAE,

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Hongmin