RL Algorithms¶

Common¶

class rl4co.models.rl.common.base.RL4COLitModule(env, policy, batch_size=512, val_batch_size=None, test_batch_size=None, train_data_size=1280000, val_data_size=10000, test_data_size=10000, optimizer='Adam', optimizer_kwargs={'lr': 0.0001}, lr_scheduler=None, lr_scheduler_kwargs={'gamma': 0.1, 'milestones': [80, 95]}, lr_scheduler_interval='epoch', lr_scheduler_monitor='val/reward', generate_data=True, shuffle_train_dataloader=True, dataloader_num_workers=0, data_dir='data/', log_on_step=True, metrics={}, **litmodule_kwargs)[source]¶

Bases: LightningModule

Base class for Lightning modules for RL4CO. This defines the general training loop in terms of RL algorithms. Subclasses should implement mainly the shared_step to define the specific loss functions and optimization routines.

Parameters:

env¶ (RL4COEnvBase) – RL4CO environment
policy¶ (Module) – policy network (actor)
batch_size¶ (int) – batch size (general one, default used for training)
val_batch_size¶ (Optional[int]) – specific batch size for validation
test_batch_size¶ (Optional[int]) – specific batch size for testing
train_data_size¶ (int) – size of training dataset for one epoch
val_data_size¶ (int) – size of validation dataset for one epoch
test_data_size¶ (int) – size of testing dataset for one epoch
optimizer¶ (Union[str, Optimizer, partial]) – optimizer or optimizer name
optimizer_kwargs¶ (dict) – optimizer kwargs
lr_scheduler¶ (Union[str, LRScheduler, partial, None]) – learning rate scheduler or learning rate scheduler name
lr_scheduler_kwargs¶ (dict) – learning rate scheduler kwargs
lr_scheduler_interval¶ (str) – learning rate scheduler interval
lr_scheduler_monitor¶ (str) – learning rate scheduler monitor
generate_data¶ (bool) – whether to generate data
shuffle_train_dataloader¶ (bool) – whether to shuffle training dataloader
dataloader_num_workers¶ (int) – number of workers for dataloader
data_dir¶ (str) – data directory
metrics¶ (dict) – metrics
litmodule_kwargs¶ – kwargs for LightningModule

configure_optimizers(parameters=None)[source]¶

Parameters:: parameters¶ – parameters to be optimized. If None, will use `self.policy.parameters()

forward(td, **kwargs)[source]¶: Forward pass for the model. Simple wrapper around policy. Uses env from the module if not provided.

instantiate_metrics(metrics)[source]¶: Dictionary of metrics to be logged at each phase

log_metrics(metric_dict, phase)[source]¶: Log metrics to logger and progress bar

on_train_epoch_end()[source]¶: Called at the end of the training epoch. This can be used for instance to update the train dataset with new data (which is the case in RL).

post_setup_hook()[source]¶: Hook to be called after setup. Can be used to set up subclasses without overriding setup

setup(stage='fit')[source]¶: Base LightningModule setup method. This will setup the datasets and dataloaders

Note

We also send to the loggers all hyperparams that are not nn.Module (i.e. the policy). Apparently PyTorch Lightning does not do this by default.

setup_loggers()[source]¶: Log all hyperparameters except those in nn.Module

shared_step(batch, batch_idx, phase)[source]¶: Shared step between train/val/test. To be implemented in subclass

test_dataloader()[source]¶

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

test()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

test_step(batch, batch_idx)[source]¶

Operates on a single batch of data from the test set. In this step you’d normally generate examples or calculate anything of interest such as accuracy.

Parameters:

batch¶ (Any) – The output of your DataLoader.
batch_idx¶ (int) – The index of this batch.
dataloader_id¶ – The index of the dataloader that produced this batch. (only if multiple test dataloaders used).

Returns:

Any of.

Any object or value

None - Testing will skip to the next batch

# if you have one test dataloader:
def test_step(self, batch, batch_idx):
    ...


# if you have multiple test dataloaders:
def test_step(self, batch, batch_idx, dataloader_idx=0):
    ...

Examples:

# CASE 1: A single test dataset
def test_step(self, batch, batch_idx):
    x, y = batch

    # implement your own
    out = self(x)
    loss = self.loss(out, y)

    # log 6 example images
    # or generated text... or whatever
    sample_imgs = x[:6]
    grid = torchvision.utils.make_grid(sample_imgs)
    self.logger.experiment.add_image('example_images', grid, 0)

    # calculate acc
    labels_hat = torch.argmax(out, dim=1)
    test_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0)

    # log the outputs!
    self.log_dict({'test_loss': loss, 'test_acc': test_acc})

If you pass in multiple test dataloaders, test_step() will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.

# CASE 2: multiple test dataloaders
def test_step(self, batch, batch_idx, dataloader_idx=0):
    # dataloader_idx tells you which dataset this is.
    ...

Note

If you don’t need to test you don’t need to implement this method.

Note

When the test_step() is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of the test epoch, the model goes back to training mode and gradients are enabled.

train_dataloader()[source]¶

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set reload_dataloaders_every_n_epochs to a positive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

training_step(batch, batch_idx)[source]¶

Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.

Parameters:

batch¶ (Tensor | (Tensor, …) | [Tensor, …]) – The output of your DataLoader. A tensor, tuple or list.
batch_idx¶ (int) – Integer displaying index of this batch

Returns:

Any of.

Tensor - The loss tensor
dict - A dictionary. Can include any keys, but must include the key 'loss'
None - Training will skip to the next batch. This is only for automatic optimization.
This is not supported for multi-GPU, TPU, IPU, or DeepSpeed.

In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.

Example:

def training_step(self, batch, batch_idx):
    x, y, z = batch
    out = self.encoder(x)
    loss = self.loss(out, x)
    return loss

To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:

def __init__(self):
    super().__init__()
    self.automatic_optimization = False


# Multiple optimizers (e.g.: GANs)
def training_step(self, batch, batch_idx):
    opt1, opt2 = self.optimizers()

    # do training_step with encoder
    ...
    opt1.step()
    # do training_step with decoder
    ...
    opt2.step()

Note

When accumulate_grad_batches > 1, the loss returned here will be automatically normalized by accumulate_grad_batches internally.

val_dataloader()[source]¶

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set reload_dataloaders_every_n_epochs to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

fit()
validate()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

validation_step(batch, batch_idx)[source]¶

Operates on a single batch of data from the validation set. In this step you’d might generate examples or calculate anything of interest like accuracy.

Parameters:

batch¶ (Any) – The output of your DataLoader.
batch_idx¶ (int) – The index of this batch.
dataloader_idx¶ – The index of the dataloader that produced this batch. (only if multiple val dataloaders used)

Returns:

Any object or value
None - Validation will skip to the next batch

# if you have one val dataloader:
def validation_step(self, batch, batch_idx):
    ...


# if you have multiple val dataloaders:
def validation_step(self, batch, batch_idx, dataloader_idx=0):
    ...

Examples:

# CASE 1: A single validation dataset
def validation_step(self, batch, batch_idx):
    x, y = batch

    # implement your own
    out = self(x)
    loss = self.loss(out, y)

    # log 6 example images
    # or generated text... or whatever
    sample_imgs = x[:6]
    grid = torchvision.utils.make_grid(sample_imgs)
    self.logger.experiment.add_image('example_images', grid, 0)

    # calculate acc
    labels_hat = torch.argmax(out, dim=1)
    val_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0)

    # log the outputs!
    self.log_dict({'val_loss': loss, 'val_acc': val_acc})

If you pass in multiple val dataloaders, validation_step() will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.

# CASE 2: multiple validation dataloaders
def validation_step(self, batch, batch_idx, dataloader_idx=0):
    # dataloader_idx tells you which dataset this is.
    ...

Note

If you don’t need to validate you don’t need to implement this method.

Note

When the validation_step() is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of validation, the model goes back to training mode and gradients are enabled.

wrap_dataset(dataset)[source]¶: Wrap dataset with policy-specific wrapper. This is useful i.e. in REINFORCE where we need to collect the greedy rollout baseline outputs.

class rl4co.models.rl.common.critic.CriticNetwork(env_name=None, encoder=None, embedding_dim=128, hidden_dim=512, num_layers=3, num_heads=8, normalization='batch', force_flash_attn=False, **unused_kwargs)[source]¶

Bases: Module

We make the critic network compatible with any problem by using encoder for any environment Refactored from Kool et al. (2019) which only worked for TSP. In our case, we make it compatible with any problem by using the environment init embedding.

Parameters:

env_name¶ (Optional[str]) – environment name to solve
encoder¶ (Optional[Module]) – Encoder to use for the critic
embedding_dim¶ (int) – Dimension of the embeddings
hidden_dim¶ (int) – Hidden dimension for the feed-forward network
num_layers¶ (int) – Number of layers for the encoder
num_heads¶ (int) – Number of heads for the attention
normalization¶ (str) – Normalization to use for the attention
force_flash_attn¶ (bool) – Whether to force the use of flash attention. If True, cast to fp16

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]¶

Forward pass of the critic network: encode the imput in embedding space and return the value

Parameters:: x¶ (Union[Tensor, TensorDict]) – Input containing the environment state. Can be a Tensor or a TensorDict
Return type:: Tensor
Returns:: Value of the input state

PPO¶

class rl4co.models.rl.ppo.ppo.PPO(env, policy, critic, clip_range=0.2, ppo_epochs=2, mini_batch_size=0.25, vf_lambda=0.5, entropy_lambda=0.0, normalize_adv=False, max_grad_norm=0.5, metrics={'train': ['loss', 'surrogate_loss', 'value_loss', 'entropy']}, **kwargs)[source]¶

Bases: RL4COLitModule

An implementation of the Proximal Policy Optimization (PPO) algorithm (https://arxiv.org/abs/1707.06347) is presented with modifications for autoregressive decoding schemes.

In contrast to the original PPO algorithm, this implementation does not consider autoregressive decoding steps as part of the MDP transition. While many Neural Combinatorial Optimization (NCO) studies model decoding steps as transitions in a solution-construction MDP, we treat autoregressive solution construction as an algorithmic choice for tractable CO solution generation. This choice aligns with the Attention Model (AM) (https://openreview.net/forum?id=ByxBFsRqYm), which treats decoding steps as a single-step MDP in Equation 9.

Modeling autoregressive decoding steps as a single-step MDP introduces significant changes to the PPO implementation, including: - Generalized Advantage Estimation (GAE) (https://arxiv.org/abs/1506.02438) is not applicable since we are dealing with a single-step MDP. - The definition of policy entropy can differ from the commonly implemented manner.

The commonly implemented definition of policy entropy is the entropy of the policy distribution, given by:

\[H(\pi(x_t)) = - \sum_{a_t \in A_t} \pi(a_t|x_t) \log \pi(a_t|x_t) \]

where \(x_t\) represents the given state at step \(t\), \(A_t\) is the set of all (admisible) actions at step \(t\), and \(a_t\) is the action taken at step \(t\).

If we interpret autoregressive decoding steps as transition steps of an MDP, the entropy for the entire decoding process can be defined as the sum of entropies for each decoding step:

\[H(\pi) = \sum_t H(\pi(x_t)) \]

However, if we consider autoregressive decoding steps as an algorithmic choice, the entropy for the entire decoding process is defined as:

\[H(\pi) = - \sum_{a \in A} \pi(a|x) \log \pi(a|x) \]

where \(x\) represents the given CO problem instance, and \(A\) is the set of all feasible solutions.

Due to the intractability of computing the entropy of the policy distribution over all feasible solutions, we approximate it by computing the entropy over solutions generated by the policy itself. This approximation serves as a proxy for the second definition of entropy, utilizing Monte Carlo sampling.

It is worth noting that our modeling of decoding steps and the implementation of the PPO algorithm align with recent work in the Natural Language Processing (NLP) community, specifically RL with Human Feedback (RLHF) (e.g., https://github.com/lucidrains/PaLM-rlhf-pytorch).

configure_optimizers()[source]¶

Parameters:: parameters¶ – parameters to be optimized. If None, will use `self.policy.parameters()

on_train_epoch_end()[source]¶: ToDo: Add support for other schedulers.

shared_step(batch, batch_idx, phase)[source]¶: Shared step between train/val/test. To be implemented in subclass

REINFORCE¶

class rl4co.models.rl.reinforce.baselines.CriticBaseline(critic=None, **unused_kw)[source]¶

Bases: REINFORCEBaseline

Critic baseline: use critic network as baseline

Parameters:: critic¶ (Optional[Module]) – Critic network to use as baseline. If None, create a new critic network based on the environment

Initializes internal Module state, shared by both nn.Module and ScriptModule.

eval(x, c, env=None)[source]¶: Evaluate baseline

setup(model, env, **kwargs)[source]¶: To be called before training during setup phase This follow PyTorch Lightning’s setup() convention

class rl4co.models.rl.reinforce.baselines.ExponentialBaseline(beta=0.8, **kw)[source]¶

Bases: REINFORCEBaseline

Exponential baseline: return exponential moving average of reward as baseline

Parameters:: beta¶ – Beta value for the exponential moving average

Initializes internal Module state, shared by both nn.Module and ScriptModule.

eval(td, reward, env=None)[source]¶: Evaluate baseline

class rl4co.models.rl.reinforce.baselines.NoBaseline(*args, **kw)[source]¶

Bases: REINFORCEBaseline

No baseline: return 0 for baseline and neg_los

Initializes internal Module state, shared by both nn.Module and ScriptModule.

eval(td, reward, env=None)[source]¶: Evaluate baseline

class rl4co.models.rl.reinforce.baselines.REINFORCEBaseline(*args, **kw)[source]¶

Bases: Module

Base class for REINFORCE baselines

Initializes internal Module state, shared by both nn.Module and ScriptModule.

epoch_callback(*args, **kw)[source]¶: Callback at the end of each epoch For example, update baseline parameters and obtain baseline values

eval(td, reward, env=None)[source]¶: Evaluate baseline

setup(*args, **kw)[source]¶: To be called before training during setup phase This follow PyTorch Lightning’s setup() convention

wrap_dataset(dataset, *args, **kw)[source]¶: Wrap dataset with baseline-specific functionality

class rl4co.models.rl.reinforce.baselines.RolloutBaseline(bl_alpha=0.05, progress_bar=False, **kw)[source]¶

Bases: REINFORCEBaseline

Rollout baseline: use greedy rollout as baseline

Parameters:

bl_alpha¶ – Alpha value for the baseline T-test
progress_bar¶ – Whether to show progress bar for rollout

Initializes internal Module state, shared by both nn.Module and ScriptModule.

epoch_callback(model, env, batch_size=64, device='cpu', epoch=None, dataset_size=None)[source]¶: Challenges the current baseline with the model and replaces the baseline model if it is improved

eval(td, reward, env)[source]¶: Evaluate rollout baseline

Warning

This is not differentiable and should only be used for evaluation. Also, it is recommended to use the rollout method directly instead of this method.

rollout(model, env, batch_size=64, device='cpu', dataset=None)[source]¶: Rollout the model on the given dataset

setup(*args, **kw)[source]¶: To be called before training during setup phase This follow PyTorch Lightning’s setup() convention

wrap_dataset(dataset, env, batch_size=64, device='cpu', **kw)[source]¶: Wrap the dataset in a baseline dataset

Note

This is an alternative to eval that does not require the model to be passed at every call but just once. Values are added to the dataset. This also allows for larger batch sizes since we evauate the model without gradients.

class rl4co.models.rl.reinforce.baselines.SharedBaseline(*args, **kw)[source]¶

Bases: REINFORCEBaseline

Shared baseline: return mean of reward as baseline

Initializes internal Module state, shared by both nn.Module and ScriptModule.

eval(td, reward, env=None, on_dim=1)[source]¶: Evaluate baseline

class rl4co.models.rl.reinforce.baselines.WarmupBaseline(baseline, n_epochs=1, warmup_exp_beta=0.8, **kw)[source]¶

Bases: REINFORCEBaseline

Warmup baseline: return convex combination of baseline and exponential baseline

Parameters:

baseline¶ – Baseline to use after warmup
n_epochs¶ – Number of epochs to warmup
warmup_exp_beta¶ – Beta value for the exponential baseline during warmup

Initializes internal Module state, shared by both nn.Module and ScriptModule.

epoch_callback(*args, **kw)[source]¶: Callback at the end of each epoch For example, update baseline parameters and obtain baseline values

eval(td, reward, env=None)[source]¶: Evaluate baseline

setup(*args, **kw)[source]¶: To be called before training during setup phase This follow PyTorch Lightning’s setup() convention

wrap_dataset(dataset, *args, **kw)[source]¶: Wrap dataset with baseline-specific functionality

rl4co.models.rl.reinforce.baselines.get_reinforce_baseline(name, **kw)[source]¶: Get a REINFORCE baseline by name The rollout baseline default to warmup baseline with one epoch of exponential baseline and the greedy rollout

class rl4co.models.rl.reinforce.reinforce.REINFORCE(env, policy, baseline='rollout', baseline_kwargs={}, **kwargs)[source]¶

Bases: RL4COLitModule

REINFORCE algorithm, also known as policy gradients. See superclass RL4COLitModule for more details.

Parameters:

env¶ (RL4COEnvBase) – Environment to use for the algorithm
policy¶ (Module) – Policy to use for the algorithm
baseline¶ (Union[REINFORCEBaseline, str]) – REINFORCE baseline
baseline_kwargs¶ (dict) – Keyword arguments for baseline. Ignored if baseline is not a string
**kwargs¶ – Keyword arguments passed to the superclass

calculate_loss(td, batch, policy_out, reward=None, log_likelihood=None)[source]¶

Calculate loss for REINFORCE algorithm.

Parameters:

td¶ (TensorDict) – TensorDict containing the current state of the environment
batch¶ (TensorDict) – Batch of data. This is used to get the extra loss terms, e.g., REINFORCE baseline
policy_out¶ (dict) – Output of the policy network
reward¶ (Optional[Tensor]) – Reward tensor. If None, it is taken from policy_out
log_likelihood¶ (Optional[Tensor]) – Log-likelihood tensor. If None, it is taken from policy_out

classmethod load_from_checkpoint(checkpoint_path, map_location=None, hparams_file=None, strict=False, load_baseline=True, **kwargs)[source]¶: Load model from checkpoint/ :rtype: Self

Note

This is a modified version of load_from_checkpoint from pytorch_lightning.core.saving. It deals with matching keys for the baseline by first running setup

on_train_epoch_end()[source]¶: Callback for end of training epoch: we evaluate the baseline

post_setup_hook(stage='fit')[source]¶: Hook to be called after setup. Can be used to set up subclasses without overriding setup

set_decode_type_multistart(phase)[source]¶

Set decode type to multistart for train, val and test in policy. For example, if the decode type is greedy, it will be set to greedy_multistart.

Parameters:: phase¶ (str) – Phase to set decode type for. Must be one of train, val or test.

shared_step(batch, batch_idx, phase)[source]¶: Shared step between train/val/test. To be implemented in subclass

wrap_dataset(dataset)[source]¶: Wrap dataset from baseline evaluation. Used in greedy rollout baseline