My layer training takes forever

I have a layer project running a pytorch training process with data from the public dataset. My epoch is set to 1 but training takes forever.

Hi Henry,

Thank you for raising the issue!

I would suggest you try our Pipeline logs feature which will help you debug your model training. Please refer to https://docs.beta.layer.co/docs/guides/pipeline-logs

Please let me know if this helps you figure out what is going on and if you need further assistance.

Thanks,
Dimitar

It says at the end of the log that process exited with code 1 but training keeps going on. command: layer log --follow <PIPELINE_ID> Here is the entire log

[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Using selector: EpollSelector
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Successfully logged into https://beta.layer.co
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Using selector: EpollSelector
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Creating ~/source dir
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Place init.py in ~/source
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Download binary(9ccf7931-18a8-42fc-9553-fad607842abc/cat_and_dog_features/category/91e3ed44-7026-476e-b6d2-867830d07c97/cat_and_dog_features.category.tgz) to temp directory
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Binary archive cat_and_dog_features.category.tgz downloaded and extracted successfully
[2021-11-04 20:53:40, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Installing Python dependencies
[2021-11-04 20:53:42, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Collecting numpy==1.20.3
[2021-11-04 20:53:42, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Downloading numpy-1.20.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.4 MB)
[2021-11-04 20:53:43, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Installing collected packages: numpy
[2021-11-04 20:53:43, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Attempting uninstall: numpy
[2021-11-04 20:53:43, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Found existing installation: numpy 1.21.3
[2021-11-04 20:53:43, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Uninstalling numpy-1.21.3:
[2021-11-04 20:53:43, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Successfully uninstalled numpy-1.21.3
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Successfully installed numpy-1.20.3
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Python dependencies installed successfully
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Importing user code(category.py) from ~/source
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] build_feature function imported successfully
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Injecting the dependencies
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Annotations: {‘sdf’: Dataset(name=‘catsdogs’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘94c2415d-d043-4d13-aa94-aaccc55d0b1d’)), description=’’, id=UUID(‘eb62029e-78f2-4a40-a26c-c67d797c4dc6’), version=’’, schema=’{}’, uri=’’, metadata={}, build=DatasetBuild(id=UUID(‘a7d73c27-feac-434d-8783-c01095517b4d’), status=<DatasetBuildStatus.INVALID: 0>, info=’’)), ‘return’: typing.Any}
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Entity dependencies: {‘featuresets’: {}, ‘models’: {}, ‘datasets’: {‘sdf’: Dataset(name=‘catsdogs’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘94c2415d-d043-4d13-aa94-aaccc55d0b1d’)), description=’’, id=UUID(‘eb62029e-78f2-4a40-a26c-c67d797c4dc6’), version=’’, schema=’{}’, uri=’’, metadata={}, build=DatasetBuild(id=UUID(‘a7d73c27-feac-434d-8783-c01095517b4d’), status=<DatasetBuildStatus.INVALID: 0>, info=’’))}, ‘context’: None, ‘train’: None}
[2021-11-04 20:53:45, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Injecting catsdogs dataset
[2021-11-04 20:53:46, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Using selector: EpollSelector
[2021-11-04 20:53:46, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Injected dependencies successfully
[2021-11-04 20:53:46, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Executing the build_feature
[2021-11-04 20:54:23, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Executed build_feature successfully
[2021-11-04 20:54:23, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Storing built features
[2021-11-04 20:54:30, FEATURESET_BUILD, category, python-runner-0f47cd8a-115d-4cdf-8881-4453732966e9-slvhh] Built features stored successfully
[2021-11-04 20:55:17, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Starting job.
[2021-11-04 20:55:18, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Using selector: EpollSelector
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Using selector: EpollSelector
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully logged into https://beta.layer.co
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Using selector: EpollSelector
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Creating ~/source dir
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Place init.py in ~/source
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Download binary(9ccf7931-18a8-42fc-9553-fad607842abc/29867f72-9328-44c9-9210-bcc763797f1d/custom_pytorch_loss_function.tgz) to temp directory
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Binary archive custom_pytorch_loss_function.tgz downloaded and extracted successfully
[2021-11-04 20:55:19, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Installing python dependencies
[2021-11-04 20:55:20, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting numpy==1.20.3
[2021-11-04 20:55:20, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading numpy-1.20.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.4 MB)
[2021-11-04 20:55:21, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting pillow==8.2.0
[2021-11-04 20:55:21, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading Pillow-8.2.0-cp38-cp38-manylinux1_x86_64.whl (3.0 MB)
[2021-11-04 20:55:21, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting torch==1.7.0
[2021-11-04 20:55:21, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading torch-1.7.0-cp38-cp38-manylinux1_x86_64.whl (776.8 MB)
[2021-11-04 20:55:45, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting pandas==1.3.3
[2021-11-04 20:55:45, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading pandas-1.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting torchvision==0.8.2
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading torchvision-0.8.2-cp38-cp38-manylinux1_x86_64.whl (12.8 MB)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Collecting dataclasses
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Requirement already satisfied: typing-extensions in /venv/lib/python3.8/site-packages (from torch==1.7.0->-r /root/source/requirements.txt (line 3)) (3.10.0.2)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Requirement already satisfied: future in /venv/lib/python3.8/site-packages (from torch==1.7.0->-r /root/source/requirements.txt (line 3)) (0.18.2)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Requirement already satisfied: python-dateutil>=2.7.3 in /venv/lib/python3.8/site-packages (from pandas==1.3.3->-r /root/source/requirements.txt (line 4)) (2.8.2)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Requirement already satisfied: pytz>=2017.3 in /venv/lib/python3.8/site-packages (from pandas==1.3.3->-r /root/source/requirements.txt (line 4)) (2021.3)
[2021-11-04 20:55:46, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Requirement already satisfied: six>=1.5 in /venv/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas==1.3.3->-r /root/source/requirements.txt (line 4)) (1.15.0)
[2021-11-04 20:55:47, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Installing collected packages: numpy, pillow, dataclasses, torch, pandas, torchvision
[2021-11-04 20:55:47, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Attempting uninstall: numpy
[2021-11-04 20:55:47, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Found existing installation: numpy 1.20.2
[2021-11-04 20:55:47, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Uninstalling numpy-1.20.2:
[2021-11-04 20:55:47, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully uninstalled numpy-1.20.2
[2021-11-04 20:55:49, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Attempting uninstall: pillow
[2021-11-04 20:55:49, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Found existing installation: Pillow 8.4.0
[2021-11-04 20:55:49, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Uninstalling Pillow-8.4.0:
[2021-11-04 20:55:49, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully uninstalled Pillow-8.4.0
[2021-11-04 20:55:50, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Attempting uninstall: torch
[2021-11-04 20:55:50, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Found existing installation: torch 1.7.1
[2021-11-04 20:55:50, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Uninstalling torch-1.7.1:
[2021-11-04 20:55:53, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully uninstalled torch-1.7.1
[2021-11-04 20:56:24, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Attempting uninstall: pandas
[2021-11-04 20:56:24, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Found existing installation: pandas 1.2.3
[2021-11-04 20:56:25, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Uninstalling pandas-1.2.3:
[2021-11-04 20:56:25, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully uninstalled pandas-1.2.3
[2021-11-04 20:56:29, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Successfully installed dataclasses-0.6 numpy-1.20.3 pandas-1.3.3 pillow-8.2.0 torch-1.7.0 torchvision-0.8.2
[2021-11-04 20:56:30, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Python dependencies installed successfully
[2021-11-04 20:56:30, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Importing user code(model.py) from /source
[2021-11-04 20:56:30, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] train_model function imported successfully
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Injecting the dependencies
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Annotations: {‘train’: <class ‘layer.client.Train’>, ‘ds’: Dataset(name=‘catsdogs’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘71043bc0-a788-4c47-9272-c9923451ed1e’), version=’’, schema=’{}’, uri=’’, metadata={}, build=DatasetBuild(id=UUID(‘69c7be9a-b559-4dec-9964-3d8658a7f55f’), status=<DatasetBuildStatus.INVALID: 0>, info=’’)), ‘pf’: Featureset(name=‘cat_and_dog_features’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘f97ed49b-ebc2-471c-a444-66fb6ce10094’), features=, feature_names=, dependencies=), ‘return’: typing.Any}
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Entity dependencies: {‘featuresets’: {‘pf’: Featureset(name=‘cat_and_dog_features’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘f97ed49b-ebc2-471c-a444-66fb6ce10094’), features=, feature_names=, dependencies=)}, ‘models’: {}, ‘datasets’: {‘ds’: Dataset(name=‘catsdogs’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘71043bc0-a788-4c47-9272-c9923451ed1e’), version=’’, schema=’{}’, uri=’’, metadata={}, build=DatasetBuild(id=UUID(‘69c7be9a-b559-4dec-9964-3d8658a7f55f’), status=<DatasetBuildStatus.INVALID: 0>, info=’’))}, ‘context’: None, ‘train’: ‘train’}
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Injecting cat_and_dog_features featureset with individual features
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Using selector: EpollSelector
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Injecting catsdogs dataset
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Using selector: EpollSelector
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Injected dependencies successfully: {‘pf’: Featureset(name=‘cat_and_dog_features’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘a2eade29-238c-4017-9213-b3575c5d136c’), features=, feature_names=, dependencies=), ‘ds’: Dataset(name=‘catsdogs’, datasource=DatasourceRef(name=’’, type=<DatasourceType.STORAGE: ‘storage’>, id=UUID(‘ae38d12a-0bdb-4b69-9242-de1d3f3368f9’)), description=’’, id=UUID(‘5a64d0a6-166d-43e8-8d09-bb8a14b39a38’), version=’’, schema=’{}’, uri=’’, metadata={}, build=DatasetBuild(id=UUID(‘f862959d-1a00-4856-a19c-fe9bcb2781ad’), status=<DatasetBuildStatus.INVALID: 0>, info=’’)), ‘train’: <layer.train.Train object at 0x7efd70388910>}
[2021-11-04 20:56:31, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Executing the train_model
[2021-11-04 20:56:56, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] [’/venv/bin/python -X faulthandler -m pyruntime.model.train_executor’ exited with 1]
[2021-11-04 20:56:56, MODEL_TRAIN, custom_pytorch_loss_function, model-training-9fa73418-6cb8-4f11-80ae-4507505796bf-jcgjf] Process exit code: 1

Hi Henry,

Thank you for coming back on this! Unfortunately we can’t currently identify the issue based on the logs due to a bug we are currently working on.

In the meantime, can you please share your whole project so we can help you debug it?

Hi Henry,

Thank you very much for sharing your project over slack.

I’ve managed to get past this error by using a larger fabric - please refer to https://docs.beta.layer.co/docs/reference/fabrics#fabrics-for-models I recommend you use the f-medium fabric for your model.

After this change I got the following error:

⠼ 2021-11-05 16:30:48 | model       custom_pytorch_loss_function   ━━━━━━━━━━━━━━━━━━━━━━ ERROR     [96610ms]
                                     RuntimeError('mat1 and mat2 shapes cannot be multiplied (8x36864 and 400x120)')
**Aborting...   **

Which I believe is an error in your code.

Let me know if you have any more questions or this unblocks you!

Hi Dimitar,
Can you share the entire modification with me? It’s still taking forever on my side.

This is what model.yaml looks like after the modification:

apiVersion: 1

# Name and description of our model
name: 'custom_pytorch_loss_function'
description: 'Image classification using a Custom Binary Cross entropy Loss function'

training:
  name: custom_pytorch_loss_function
  description: 'Model Training'

# The source model definition file with a `train_model` method
  entrypoint: model.py
# File includes the required python libraries with their correct versions
  environment: requirements.txt
  fabric: "f-medium"

I hope that helps!

it is working now. Thanks for the assist

Hi Henry,

Glad to hear that the suggested solution worked for you. Please do not hesitate to contact us in the future if you face any problems. I’m closing this ticket in meantime.

Thanks,

Emin