【训练营】学习率调度器实现 by littleotherut · Pull Request #113 · InfiniTensor/InfiniTrain

littleotherut · 2026-03-11T10:13:21Z

No description provided.

example/gpt2/main.cc

example/llama3/main.cc

Chamberlain0w0 · 2026-03-16T05:37:36Z

infini_train/include/optimizer.h

    std::vector<std::shared_ptr<Tensor>> params_;
+    float learning_rate_ = 0.0f;
+    float initial_learning_rate_ = 0.0f;
+    bool initial_lr_set_ = false;


这部分比较冗余。optimizer 里面可以只存有代表当前学习率的 learning_rate_，不需要额外存 initial lr 的状态；语义上初始学习率可以仅存在 lr scheduler 里（你是实际上已经这样做了，存在 lr scheduler 的 base_lr）。

此处为对齐PyTorch初始化时的设置（源码链接），

PyTorch在对调度器进行初始化时，会访问其关联优化器的参数列表，并进行setdefault，设置initial_rate_，对于首次被关联的优化器，将现在的学习率设置为initial_lr，对于非首次关联的调度器，返回现有值。

目前仅能想到作用为，可保证如果有多个调度器关联同一优化器声明（ChainedScheduler或SequentialLR等），他们的base_lr_均为第一个调度器进行初始化时优化器的学习率。暂不清楚其他应用场景，但出于与PyTorch保持一致，增设了相关参数，如果只涉及ChainedScheduler或SequentialLR的话，确实有其他替代方案，是否需要更改？

可以，与 torch 对齐吧，保留这个 initial_lr 的成员

Chamberlain0w0 · 2026-03-16T06:10:16Z

infini_train/include/lr_scheduler.h

+
+    std::shared_ptr<Optimizer> optimizer_;
+    int64_t last_step_;
+    float current_lr_;


current_lr_ 似乎也有点冗余，语义上 current_lr_ 和 optimizer_->GetLearningRate() 的值在任何时候应等价，现在在你的设计里看到这二者存在各自分开存且混用的状态（读完发现目前的 current_lr_ 像是 optimizer_->GetLearningRate() 的一个副本）；目前的数值正确性上你处理的没问题，但是这种设计交给后人来扩展的时候很可能带来歧义。

建议针对“当前学习率”只保留唯一真状态来源，要么就全程由 optimizer_->GetLearningRate() 跟踪，lr scheduler 里面就不存 current lr 了；要么就由 lr scheduler 跟踪，每次计算完再 set 回 optimizer。个人认为前者较合适。

已修改，由于需要调度器具备恢复训练的能力，而如SequentialLR或ChainedScheduler等不支持closed-form计算，无法根据base_lr和last_epoch快速得到学习率，因此保留接口仅用于学习率恢复，并调整命名为recover_lr避免混淆。

infini_train/src/lr_scheduler.cc

infini_train/include/lr_scheduler.h

Copilot

Pull request overview

This PR introduces a learning-rate scheduler system to infini_train, integrates it with optimizers (including distributed optimizer), and adds standalone C++ test executables plus example CLI wiring to exercise the new schedulers.

Changes:

Add LRScheduler base + concrete schedulers (ConstantLR/StepLR/LinearLR/LambdaLR/SequentialLR/ChainedScheduler) and a CreateLRScheduler factory.
Extend Optimizer with runtime-settable learning rate and initial learning rate tracking; propagate LR to DistributedOptimizer.
Add scheduler coverage tests and wire scheduler flags into example/gpt2 and example/llama3; register new test executables in CMake.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`infini_train/include/lr_scheduler.h`	Declares scheduler APIs, configs, and concrete scheduler types.
`infini_train/src/lr_scheduler.cc`	Implements scheduler logic, factory creation, state save/load, sequential/chained behavior.
`infini_train/include/optimizer.h`	Adds LR getters/setters + initial LR tracking to support schedulers.
`infini_train/src/optimizer.cc`	Implements optimizer LR plumbing and updates SGD/Adam to use base LR storage.
`infini_train/include/nn/parallel/ddp/distributed_optimizer.h`	Overrides LR get/set for distributed optimizer so schedulers affect the real base optimizer.
`infini_train/src/nn/parallel/ddp/distributed_optimizer.cc`	Implements LR propagation to/from the wrapped base optimizer.
`example/gpt2/main.cc`	Adds scheduler CLI flags and steps the scheduler during training.
`example/llama3/main.cc`	Adds scheduler CLI flags and steps the scheduler during training.
`test/lr_scheduler/test_helpers.h`	Shared minimal test helpers/macros for scheduler tests.
`test/lr_scheduler/test_*.cc`	Adds functional + state + validation tests for schedulers.
`CMakeLists.txt`	Adds new scheduler test executables to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/lr_scheduler/test_sequential_lr.cc

test/lr_scheduler/test_chained_lr.cc

infini_train/src/lr_scheduler.cc

Chamberlain0w0 · 2026-03-30T07:35:05Z

另外几个在开发规范上需要修改的地方：

使用英文注释，目前 test/ 下面存在一些中文注释，全局修改一下
rebase 到 master，解决一下冲突
目前 commit 较多较散，建议按照功能实现/修改的要点 squash 成几个主要的 commit

…r accessors, passthrough SetLearningRate/GetLearningRate, and add initial_learning_rate and it's accessors

…StepLR, LinearLR, LambdaLR and SequentialLR

…base class, add factory method Create<T>() with two-phase init and update all tests to use Create<T>() factory method. - Change Step() to virtual with default implementation - Add pure virtual ComputeLR() for subclasses to implement. - Adapt test helpers (IdentityScheduler, LinearDecayScheduler) to implement ComputeLR() instead of Step(). - All existing tests pass without behavioral changes. BREAKING CHANGE: Subclasses must implement ComputeLR() instead of Step().

…closed and chained form, adjust LinearLR、SequentialLR - enhance LRScheduler with chained and closed form learning rate methods - adapt methods(Step, InitialStep, GetClosedFormLR, GetChainedFormLR) to match PyTorch‘s design - add tests for consistency - refactor LinearLR: add end_factor, and rename this class - add SequentialLR InitialStep and UndoChildInitialSteps BREAKING CHANGE: Subclasses must implement GetClosedFormLR instead of ComputeLR(). Should use LinearLR instead of LinearwarmupLR.

- Add LRSchedulerConfig struct with parameters for all basic schedulers(constant, linear, step) - Add CreateLRScheduler() factory function - Support automatic warmup wrapping via SequentialLR when warmup_steps > 0 - Adapt test files

…ogs, and integrate scheduler into training loop

…s, add validation tests for learning rate schedulers - it now only be used for learning rate recovery when using loadstate

kilinchange requested a review from Chamberlain0w0 March 12, 2026 08:10

kilinchange assigned Chamberlain0w0 Mar 17, 2026

Chamberlain0w0 requested changes Mar 17, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 20, 2026 20:34

Copilot started reviewing on behalf of littleotherut March 20, 2026 20:34 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

littleotherut requested a review from Chamberlain0w0 March 20, 2026 20:44

kinorw and others added 8 commits April 2, 2026 00:06

refactor(optimizer): hoist learning_rate_ to Optimizer base and add l…

21ace39

…r accessors, passthrough SetLearningRate/GetLearningRate, and add initial_learning_rate and it's accessors

feat(lr_scheduler): add LRScheduler abstract base class, ConstantLR, …

8a7a4f6

…StepLR, LinearLR, LambdaLR and SequentialLR

refactor(examples): add scheduler placeholder and use runtime lr in l…

831f55e

…ogs, and integrate scheduler into training loop

style: apply clang-format to all legacy files

f0012be

refactor: rename current_lr_ to recover_lr_ and update related method…

327d263

…s, add validation tests for learning rate schedulers - it now only be used for learning rate recovery when using loadstate

littleotherut force-pushed the lr_scheduler branch from dc748bd to 327d263 Compare April 1, 2026 16:48

Conversation

littleotherut commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

littleotherut Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

littleotherut Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants