Understanding Pytorch Eager and Graph Mode

PyTorch originally utilized an eager execution mode, which operates in a dynamic, or “define-by-run,” paradigm. This is the main operational mode in PyTorch. More recently, PyTorch has introduced a graph mode, or “define-and-run,” paradigm. This is mainly handled by PyTorch’s JIT (Just-In-Time) compiler, through TorchScript.

  1. Eager Mode (Define-by-run): In eager mode, the computation is performed as you write the code. This makes it very flexible, intuitive, and user-friendly, since the Python code and the actual computation directly correspond to each other. Eager execution allows for dynamic control flow in the network, including loops, ifs, and other Python control structures. It is a great fit for complex, imperative programs.
  2. Graph Mode (Define-and-run): Graph mode, on the other hand, builds the entire computation graph before running the computation. The graph-based execution in PyTorch is provided via TorchScript, which uses a Just-In-Time (JIT) compiler to convert PyTorch models to a graph representation. The major advantage of this mode is that it can perform various optimizations to speed up execution, and it allows models to be run in non-Python environments, which is crucial for deployment scenarios.

In other words, eager mode allows for more Pythonic and interactive programming, while graph mode can lead to better performance and deployment capabilities. PyTorch provides the flexibility to use both modes based on the needs of the project.

Let’s look at a simple example. Suppose you have a linear regression model defined as follows:

import torch

class MyModel(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyModel, self).__init__()
        
        # Define the layers here
        self.layer1 = torch.nn.Linear(input_size, hidden_size)
        self.layer2 = torch.nn.Linear(hidden_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, output_size)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        x = self.output_layer(x)
        return x

When you create an instance of this model and pass an input tensor to it, the operations defined in the forward method are run immediately:

model = LinearRegressionModel(1, 1)
x = torch.tensor([2.0])
y = model(x)  # The operations are run immediately here.

This is in contrast to a define-and-run system where the model(x) line wouldn’t actually perform any computation; instead, it would add operations to a computational graph to be run later.

The init() and forward() defines the model, and output = model(input) runs the model. These steps are clearly separate steps, then why calling it define-by-run?

“Define-by-run” refers to the execution paradigm used by PyTorch, where the computational graph of a model is defined on the fly as the operations are run. It means that each line of code that performs an operation on a tensor is executed immediately, instead of building a static graph to run later, which is the case in a “define-and-run” execution paradigm.

Consider the forward() function in your model. In a “define-and-run” paradigm (like the one TensorFlow 1.x used), operations in this function wouldn’t be run immediately. Instead, they would define a part of the computation graph, which would be executed later when you run the whole graph using a separate command.

However, in PyTorch’s “define-by-run” paradigm, the operations are actually executed as they are encountered in the code. For example, when you write x = self.layer1(x) in the forward() function and then call output = model(input), the self.layer1(x) operation is run immediately. This results in the dynamic creation of the computational graph, with each operation adding nodes to the graph as they are run.

So, while it’s correct that defining the __init__() and forward() methods and then calling the model with some data are separate steps, within each of those steps, the operations are performed immediately as they’re encountered, which is the essence of the “define-by-run” paradigm. This offers a lot of flexibility and ease-of-use, particularly when working with complex models that have dynamic behavior.

Do you mean the graph is created when the output = model(input) is called the first time with some data?

Yes, that’s correct. In PyTorch, the computational graph is created dynamically at runtime, when you actually run data through the model. So when you execute output = model(input), the operations defined in your model’s forward() function are performed on the input data, and the computational graph is built as these operations are executed.

This computational graph is essential for automatic differentiation. PyTorch’s autograd system uses this graph to compute gradients during backpropagation. Importantly, because the graph is generated dynamically (“define-by-run”), PyTorch allows for complex, dynamic control flows in your model (like loops and conditionals), because the graph can change from one run to the next.

In contrast, in a “define-and-run” system, the computational graph is built once, when you define the model, before any actual computations are performed. Then, this static graph can be run multiple times with different input data. This approach was used in TensorFlow 1.x, though TensorFlow 2.x has moved to a dynamic (“define-by-run”) system, like PyTorch, as its default.

This dynamic graph creation is one of the features that make PyTorch very flexible and intuitive for developing complex models.

So after the graph is built with the first run, what happens when I call model() again?

When you call model(input) again, PyTorch will create a new computational graph. This is because, in PyTorch’s “define-by-run” paradigm, the computational graph is built dynamically each time you perform a forward pass.

The new graph is constructed based on the operations defined in your model’s forward() method, just like in the first forward pass. However, the graph can potentially be different from the one in the first forward pass, since the operations can be dynamic (e.g., they can depend on Python control flow like if statements or for loops).

Importantly, the weights of the model (i.e., the parameters learned during training) are retained across different forward passes, assuming you’re using the same model instance. So if you’ve updated the model’s weights during training (using backpropagation and an optimizer step), those updated weights will be used in the subsequent forward pass when you call model(input) again.

This process of creating a new graph for each forward pass allows for dynamic computation, which is particularly useful for models with control flows that can change from one input to the next, such as recurrent neural networks (RNNs) or models that involve loops or conditional statements.

So at the second run, the graph built at the first run is discarded?

Yes, that’s correct. In PyTorch, the computational graph is discarded after each forward and backward pass (i.e., after the gradients have been computed). This is because PyTorch uses dynamic computational graphs (or “define-by-run” graphs), which means a new graph is created for each forward pass.

When you perform a backward pass using something like loss.backward(), PyTorch computes the gradients and then immediately discards the graph. This is done to save memory, since keeping around old graphs for which gradients have already been computed would be wasteful. This is especially important when working with large models and large batches of input data.

If you want to retain the graph for some reason (usually for higher-order derivatives or when you want to call backward more than once for a given computation), you can call loss.backward(retain_graph=True) to prevent the graph from being discarded. However, you should use this option sparingly, as retaining the graph requires more memory.

It’s important to note that, although the graph is discarded, the values of the model’s weights (parameters) are retained and updated during the optimization step. So the learning from one forward and backward pass is captured in the updated parameters, and those updated parameters are used in the next forward pass.

Aren’t parameters (weights) associated with the edges in the graph? How would it be meaningful to keep parameters while discarding the graph? How would the old parameters be used for a completely new graph?

Great question! It might seem confusing at first, but it comes down to how we define the computational graph and what we mean when we say that the graph is “discarded”.

In PyTorch, a computational graph is a dynamic entity that represents the operations and computations performed during the forward pass of your model for a specific set of inputs. It’s used to compute the gradients during the backward pass. Once the gradients are computed, the graph is discarded, which essentially means that the references to the intermediate tensors and operations are discarded, freeing up memory.

However, the parameters (weights and biases) of your model are not part of this dynamic computational graph in the sense that they are not discarded. They are part of your model instance and are stored separately. You can think of them as associated with the “nodes” (the layers or functions) of the computational graph rather than the edges.

So, when you perform a forward pass, PyTorch dynamically creates a computational graph where the nodes are the operations (like matrix multiplications, activations, etc.), and the edges represent the tensors flowing between these operations. The weights (and biases) of your model are used in these operations, and they influence the structure of the graph because they affect the computations that are performed. However, they are not discarded after the forward and backward passes.

This means that, when you perform another forward pass (thus creating a new computational graph), the operations in the graph will use the same weights as before, unless they have been updated (like during an optimization step in training).

So, the parameters (weights and biases) persist across different computational graphs, as they are part of your model and are not discarded like the graph itself. The new computational graph on each forward pass represents the computations performed by your model with the current parameters and the specific input data.

There are actually two separate but related structures to think about here: the model architecture and the computational graph.

The model architecture refers to the high-level structure of your model: the types of layers it has, the number of neurons in each layer, how the layers are connected, etc. This is defined in your __init__() and forward() methods in PyTorch. In your example, this would be the two layers with 3 and 1 neurons. This structure doesn’t change unless you redefine your model.

The computational graph, on the other hand, is a representation of the specific computations that happen when you pass data through your model. It includes the operations performed by each layer, the tensors that flow between layers, and how these operations are interconnected. This graph is dynamically built each time you perform a forward pass through your model with specific input data, and then discarded after each backward pass.

So while your model architecture remains the same across different runs, the computational graph is rebuilt for each individual forward pass. In this graph, the nodes represent the operations (like matrix multiplications, activations, etc.), and the edges represent the tensors flowing between these operations. The parameters (weights and biases) of your model are used in these operations, but are not discarded like the graph itself after the backward pass.

This dynamic computational graph approach is what allows PyTorch to handle variable-length inputs, complex control flow, and other dynamic behaviors in your model. It also allows PyTorch’s autograd system to compute gradients automatically, since it can use the graph to keep track of which operations were performed and in what order.

Let’s consider a simple two-layer model (an input layer and an output layer), and go through two runs.

Model Definition:

class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer1 = torch.nn.Linear(3, 3)  # Input layer
        self.layer2 = torch.nn.Linear(3, 1)  # Output layer

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = self.layer2(x)
        return x

Let’s say we create an instance of this model and pass it two different inputs at two different times.

First Run:

model = SimpleModel()

input1 = torch.tensor([1.0, 2.0, 3.0])
output1 = model(input1)

In this case, the computational graph would look something like this (simplified for clarity):

input1 --> Linear --> ReLU --> Linear --> output1

Each arrow represents a tensor operation, and the “Linear” and “ReLU” nodes use the parameters (weights and biases) of the model. After computing output1, if we call output1.backward(), PyTorch would use this graph to compute the gradients and then discard the graph.

Second Run:

input2 = torch.tensor([4.0, 5.0, 6.0])
output2 = model(input2)

For this second forward pass, PyTorch would construct a new computational graph:

input2 --> Linear --> ReLU --> Linear --> output2

Again, each arrow represents an operation, and the nodes use the same parameters as in the first run (unless they were updated in the meantime). After computing output2, if we call output2.backward(), PyTorch would use this new graph to compute the gradients and then discard it.

While these two graphs look structurally identical, they are separate instances. The tensors (input, output, and intermediates) in the first graph are distinct from the tensors in the second graph. However, the operations in the graphs are based on the same model parameters.

Also, it’s worth noting that these are simplified diagrams. A complete computational graph would include more detail, like separate nodes for each operation (e.g., the matrix multiplication and bias addition in a linear layer) and the individual components of more complex functions (like the comparison and multiplication operations within a ReLU).

Finally, these examples assume a model with a fixed structure. If your model included control flow that depends on the input (e.g., a loop that runs for a number of iterations based on the input data), then the computational graphs for different inputs could have different structures.

So, since the two graphs are structurally identical, the example doesn’t reflect the benefit of define-by-run, right?

Correct, the example I provided uses a simple feed-forward neural network, where the structure of the computational graph is the same for every input. This is often the case for many common types of models, like Convolutional Neural Networks (CNNs) and simple Multi-Layer Perceptrons (MLPs).

The real advantage of the “define-by-run” paradigm shines when your model has dynamic structure, such as loops or conditionals that depend on input data, or when different inputs should be processed in different ways. This dynamic behavior is common in certain types of models like Recurrent Neural Networks (RNNs), or models that work on data with a variable-length sequence or hierarchical structure.

For instance, in an RNN, the computational graph for a sentence with five words would be different from the graph for a sentence with ten words. The “define-by-run” approach allows the graph to be dynamically created for each individual input, handling this variability naturally.

Let’s take a simple example of a custom RNN (for simplicity, not using PyTorch’s built-in RNN layers):

class SimpleRNN(torch.nn.Module):
    def __init__(self):
        super(SimpleRNN, self).__init__()
        self.hidden_size = 10
        self.input_size = 20
        self.linear = torch.nn.Linear(self.hidden_size + self.input_size, self.hidden_size)

    def forward(self, input_sequence):
        hidden_state = torch.zeros(self.hidden_size)
        for input_t in input_sequence:
            combined = torch.cat([input_t, hidden_state])
            hidden_state = self.linear(combined)
        return hidden_state

Now, if you pass in an input sequence of length 5, the computational graph will have 5 steps of the loop. If you pass in a sequence of length 10, the graph will have 10 steps. So the structure of the computational graph changes dynamically based on the input data, which is where the “define-by-run” approach really shows its advantage.

Does the 5 or 10 steps here mean 5 or 10 layers in the neural network?

Not exactly. In the context of a Recurrent Neural Network (RNN), when we say “5 steps” or “10 steps,” we’re typically referring to the sequence length, not the number of layers in the neural network.

In an RNN, each “step” corresponds to one time-step in the input sequence. At each time-step, the RNN performs some computation on the current input and the previous hidden state, and produces an output and a new hidden state. The same weights (parameters) are used at each time-step, so we usually think of this as one “layer” being applied repeatedly, rather than multiple distinct layers.

So, if you have an input sequence of length 5, you would run the RNN for 5 steps, applying the same computation (the same “layer”) 5 times. If your input sequence is of length 10, you would run the RNN for 10 steps. The key here is that the structure of the computational graph depends on the length of the input sequence, which is where the “define-by-run” approach becomes valuable.

This flexibility allows RNNs to process sequences of varying lengths, which is critical for many tasks, like language modeling, where the input could be a sentence or paragraph of any length.

So to clarify, “5 steps” or “10 steps” refers to the number of time-steps in the sequence, not the number of layers in the neural network. The network itself typically has a fixed architecture with a certain number of layers, but the computations it performs can be unrolled over many steps for sequence data.

What’s a Backtrace

A backtrace is a summary of how a program got to a certain point. It’s often used in debugging to figure out what sequence of function calls and events led to a particular error or crash. The backtrace provides a list of function calls that are currently active in a thread.

Here’s a simplified example:

  1. main() (the starting point of the program)
  2. functionA() called by main
  3. functionB() called by functionA
  4. functionC() called by functionB

If functionC() causes an error, the backtrace would look something like this:

  1. functionC()
  2. functionB()
  3. functionA()
  4. main()

This backtrace tells you that functionC() was called by functionB(), which was called by functionA(), which was originally called by main(). This can be useful to understand the “path” taken through the program to arrive at the error.

In real-life situations, backtraces can be significantly more complex due to things like recursion, multiple threads, or complicated control flows. But the principle remains the same: it’s a way to trace the path of execution that led to a specific point in the code.