先web参数在后台训练方式
先中断网页训练(重要!避免冲突)回到网页版,点“中断训练”按钮(Interrupt),等日志显示保存 checkpoint 完成。如果网页已经“假暂停”了,直接关浏览器或关终端也没事(进程可能还在跑),但建议先中断。切换到项目根目录 [代码] 确认 yaml 和 checkpoint 存在 [代码] 用 screen…
作者:lh
cd /mnt/workspace/LLaMA-Factorycd ls -la /mnt/workspace/train/train_2026-02-08-14-09-58 | grep -i yaml # 应该有 training_args.yaml
ls -la /mnt/workspace/train/train_2026-02-08-14-09-58 | grep checkpoint # 看有没有 checkpoint-xxx 文件夹安装 screen(如果没装):
apt update && apt install -y screenapt update && apt install -y screen启动 screen:
screen -S train_0208screen -S train_0208在 screen 里运行 CLI 命令:
llamafactory-cli train /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yam如果遇到报错(我已知):
示例:
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
[rank1]: run_exp()
[rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 60, in _training_function
[rank1]: model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 431, in get_train_args
[rank1]: raise ValueError("Output directory already exists and is not empty. Please set `overwrite_output_dir`.")
[rank1]: ValueError: Output directory already exists and is not empty. Please set `overwrite_output_dir`.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
[rank0]: run_exp()
[rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 125, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 60, in _training_function
[rank0]: model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 431, in get_train_args
[rank0]: raise ValueError("Output directory already exists and is not empty. Please set `overwrite_output_dir`.")
[rank0]: ValueError: Output directory already exists and is not empty. Please set `overwrite_output_dir`.
[rank0]:[W208 16:50:04.273985955 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0208 16:50:05.768000 1468 site-packages/torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1502 closing signal SIGTERM
E0208 16:50:06.032000 1468 site-packages/torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: 1) local_rank: 1 (pid: 1503) of binary: /root/miniconda3/envs/llama-factory/bin/python3.11
Traceback (most recent call last):
File "/root/miniconda3/envs/llama-factory/bin/torchrun", line 7, in <module>
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 991, in main
run(args)
File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 982, in run
elastic_launch(
File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 170, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2026-02-08_16:50:06
host : dsw-706316-6f594f85db-z7xpn
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1502)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-02-08_16:50:05
host : dsw-706316-6f594f85db-z7xpn
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1503)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/root/miniconda3/envs/llama-factory/bin/llamafactory-cli", line 7, in <module>
sys.exit(main())
^^^^^^
File "/mnt/workspace/LLaMA-Factory/src/llamafactory/cli.py", line 24, in main
launcher.launch()
File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 115, in launch
process = subprocess.run(
^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/llama-factory/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '56879', '/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py', '/mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml']' returned non-zero exit status 1.
(llama-factory) root@dsw-706316-6f594f85db-z7xpn:/mnt/workspace/LLaMA-Factory#解决办法:
编辑 yaml 文件,添加 overwrite_output_dir: true
cd /mnt/workspace/LLaMA-Factory
vim /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml在文件里(通常在 training_args 部分或末尾)添加/修改:
overwrite_output_dir: true保存退出(:wq)。
screen -S train_0208_resume
llamafactory-cli train /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml