先web参数在后台训练方式

先中断网页训练(重要!避免冲突)回到网页版,点“中断训练”按钮(Interrupt),等日志显示保存 checkpoint 完成。如果网页已经“假暂停”了,直接关浏览器或关终端也没事(进程可能还在跑),但建议先中断。切换到项目根目录 [代码] 确认 yaml 和 checkpoint 存在 [代码] 用 screen…

作者:lh

  • 先中断网页训练(重要!避免冲突)
  • 回到网页版,点“中断训练”按钮(Interrupt),等日志显示保存 checkpoint 完成。
  • 如果网页已经“假暂停”了,直接关浏览器或关终端也没事(进程可能还在跑),但建议先中断。
  • 切换到项目根目录
  • cd /mnt/workspace/LLaMA-Factorycd 
  • 确认 yaml 和 checkpoint 存在
  • ls -la /mnt/workspace/train/train_2026-02-08-14-09-58 | grep -i yaml   # 应该有 training_args.yaml
    ls -la /mnt/workspace/train/train_2026-02-08-14-09-58 | grep checkpoint   # 看有没有 checkpoint-xxx 文件夹
  • 用 screen 后台运行(最稳)
  • 安装 screen(如果没装):

    apt update && apt install -y screenapt update && apt install -y screen

    启动 screen:

    screen -S train_0208screen -S train_0208

    在 screen 里运行 CLI 命令:


    llamafactory-cli train /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yam

    如果遇到报错(我已知):

    示例:

    [rank1]: Traceback (most recent call last):
    [rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
    [rank1]: run_exp()
    [rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 125, in run_exp
    [rank1]: _training_function(config={"args": args, "callbacks": callbacks})
    [rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 60, in _training_function
    [rank1]: model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
    [rank1]: ^^^^^^^^^^^^^^^^^^^^
    [rank1]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 431, in get_train_args
    [rank1]: raise ValueError("Output directory already exists and is not empty. Please set `overwrite_output_dir`.")
    [rank1]: ValueError: Output directory already exists and is not empty. Please set `overwrite_output_dir`.
    [rank0]: Traceback (most recent call last):
    [rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 185, in <module>
    [rank0]: run_exp()
    [rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 125, in run_exp
    [rank0]: _training_function(config={"args": args, "callbacks": callbacks})
    [rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 60, in _training_function
    [rank0]: model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
    [rank0]: ^^^^^^^^^^^^^^^^^^^^
    [rank0]: File "/mnt/workspace/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 431, in get_train_args
    [rank0]: raise ValueError("Output directory already exists and is not empty. Please set `overwrite_output_dir`.")
    [rank0]: ValueError: Output directory already exists and is not empty. Please set `overwrite_output_dir`.
    [rank0]:[W208 16:50:04.273985955 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
    W0208 16:50:05.768000 1468 site-packages/torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1502 closing signal SIGTERM
    E0208 16:50:06.032000 1468 site-packages/torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: 1) local_rank: 1 (pid: 1503) of binary: /root/miniconda3/envs/llama-factory/bin/python3.11
    Traceback (most recent call last):
      File "/root/miniconda3/envs/llama-factory/bin/torchrun", line 7, in <module>
        sys.exit(main())
                 ^^^^^^
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
        return f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 991, in main
        run(args)
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 982, in run
        elastic_launch(
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 170, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    /mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py FAILED
    ------------------------------------------------------------
    Failures:
    [1]:
      time : 2026-02-08_16:50:06
      host : dsw-706316-6f594f85db-z7xpn
      rank : 0 (local_rank: 0)
      exitcode : 1 (pid: 1502)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time : 2026-02-08_16:50:05
      host : dsw-706316-6f594f85db-z7xpn
      rank : 1 (local_rank: 1)
      exitcode : 1 (pid: 1503)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    Traceback (most recent call last):
      File "/root/miniconda3/envs/llama-factory/bin/llamafactory-cli", line 7, in <module>
        sys.exit(main())
                 ^^^^^^
      File "/mnt/workspace/LLaMA-Factory/src/llamafactory/cli.py", line 24, in main
        launcher.launch()
      File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 115, in launch
        process = subprocess.run(
                  ^^^^^^^^^^^^^^^
      File "/root/miniconda3/envs/llama-factory/lib/python3.11/subprocess.py", line 571, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '56879', '/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py', '/mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml']' returned non-zero exit status 1.
    (llama-factory) root@dsw-706316-6f594f85db-z7xpn:/mnt/workspace/LLaMA-Factory#

    解决办法:

    编辑 yaml 文件,添加 overwrite_output_dir: true

    cd /mnt/workspace/LLaMA-Factory
    vim /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml

    在文件里(通常在 training_args 部分或末尾)添加/修改:

    overwrite_output_dir: true

    保存退出(:wq)。

    screen -S train_0208_resume
    llamafactory-cli train /mnt/workspace/train/train_2026-02-08-14-09-58/training_args.yaml