1 torch.nn.LSTM

torch.nn.LSTM是pytorch内置的LSTM模块。

对于torch.nn.LSTM输入序列的每一个元素,都使用以下经典的LSTM计算过程:

\begin{array}{c}
i_{t}=\sigma\left(W_{i i} x_{t}+b_{i i}+W_{h i} h_{t-1}+b_{h i}\right) \\
f_{t}=\sigma\left(W_{i f} x_{t}+b_{i f}+W_{h f} h_{t-1}+b_{h f}\right) \\
g_{t}=\tanh \left(W_{i g} x_{t}+b_{i g}+W_{h g} h_{t-1}+b_{h g}\right) \\
o_{t}=\sigma\left(W_{i o} x_{t}+b_{i o}+W_{h o} h_{t-1}+b_{h o}\right) \\
c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot g_{t} \\
h_{t}=o_{t} \odot \tanh \left(c_{t}\right)
\end{array}

其中,h_{t}是时间t的隐藏状态,c_{t}是时间t的细胞状态,x_{t}是时间t的输入,h_{t-1}是时间t-1的隐藏状态或者时间o的初始隐藏状态。i_{t}为输入门,f_{t}为遗忘门,g_{t}为单元门,o_{t}为输出门,\sigmasigmoid函数,\odot哈达玛积

在多层LSTM中,输入x_{t}^{(l)}的第l层(l\ge 2)是隐藏状态h_{t}^{(l-1)}的上一层乘以dropout\delta _{t}^{(l-1)},其中每一个\delta _{t}^{(l-1)}是一个伯努利随机变量。

如果设置projsize > 0,那么将会使用带投影的LSTM。这将会通过以下方式改变LSTM单元。

首先,h_{t}的维度将从hidden_size修改为proj_size(W_{hi}的维度也会改变);

其次,每一层的输出隐藏状态将乘以一个可学习的投影矩阵:h_{t}=W_{hr}h_{t},因此LSTM网络的输出也将会有不同的形状。有关形状变化的信息,可参考论文https://arxiv.org/abs/1402.1128

1.1 创建torch.nn.LSTM

函数形式

torch.nn.LSTM(*args, **kwargs)

函数参数

  • input_size:输入x的期望特征维度,指输入数据的大小,整个LSTM网络的输入为input(seq_len,batch,input_size),那么input_size参数就决定了每一个词的维度;
  • hidden_size:隐藏状态h的特征维度;
  • num_layer:循环层数,默认值为1。例如设置为2,则将堆叠两个LSTM形成一个LSTM,第二个LSTM接收第一个LSTM的输出并计算最终结果;
  • bias:默认为True。如果为False,则该层不使用偏差权重b_ih和b_hh;
  • batch_first:默认为False。如果为True,则输入和输出的形状从(seq,batch,feature)调整为(batch,seq,feature);
  • dropout:默认值为0。如果为非0值,则在除最后一层之外的每个LSTM层的输出上都加入Dropout层,dropout的概率为设置的非0值;
  • bidirectional:默认值为False。如果设置为True,则为双向LSTM;
  • proj_size:默认值为0。如果设置为非0值,则使用具有相应大小的投影LSTM;

1.2 使用torch.nn.LSTM

函数形式

output,(h_n,c_n) = LSTM(input,(h_0,c_0))

函数输入

  • input:非batch输入的Tensor形状为(L,H_{in}),当batch_first = False时,batch输入的Tensor形状为(L,N,H_{in}),当batch_first = True时,batch输入的Tensor形状为(N,L,H_{in})。input包含了输入序列的特征。input也可以是压缩的可变长度序列,详细信息可参考torch.nn.utils.rnn.pack_padded_sequence()torch.nn.utils.rnn.pack_sequence()
  • h_0:对于非batch输入的Tensor形状为(D * numlayers,H_{out}),batch输入的Tensor的形状为(D*numlayers,N,H_{out}),输出包含了序列中每个元素的最终隐藏状态。
  • c_0:对于非batch输入的Tensor形状为(D * numlayers,H_{cell}),batch输入的Tensor的形状为(D*numlayers,N,H_{cell}),输出包含了序列中每个元素的最终单元状态。如果没有提供(h_0,c_0),则默认为(0,0)。

上述的数学符号对应的含义为:

\begin{aligned}
N &=\text { batch size } \\
L &=\text { sequence length } \\
D &=2 \text { if bidirectional }=\text { True otherwise } 1 \\
H_{\text {in }} &=\text { input size } \\
H_{\text {cell }} &=\text { hidden size } \\
H_{\text {out }} &=\text { projsize if projsize }>0 \text { otherwise hidden size }
\end{aligned}

综上所述,如果是batch输出,torch.nn.LSTM的输入数据维度为,

  • input:(seq_len , batch_size , input_size)
  • h_0:(num_directions * num_layers , batch_size,hidden_size )
  • c_0:(num_directions * num_layers , batch_size , hidden_size)

我们可以以下面的这种方式来理解torch.nn.LSTM的输入input

  • seq_len为序列的个数,如果是文章,就是每一个句子的长度,一般来讲,这个长度是固定的;如果是股票数据,则表示特定的时间内,有多少条数据。这个参数也明确了有多少单元来处理输入的数据。
  • batch_size为输入的个数,如果是文章,即输入多少条句子;如果是股票数据,则表示多少个特定时间单位的数据;
  • input_size:输入元素的维度,如果是文章,则表明句子里面的词用多少维向量进行表示;如果是股票数据,则特定时间单位内,某个具体的时刻应该采集多少具体的值,比如最低价、最高价、均价、5日均价、10日均价等。

函数输出

  • output:非batch输入的Tensor形状为 (L,D * H_{out}),当batch_first = False时,batch输入的Tensor形状为(L,N,D * H_{out}),当batch_first = True时,batch输入的Tensor形状为(N,L,D * H_{out})。对于每一个时间t,包含LSTM最后一层的输出特征。如果LSTM将torch.nn.utils.rnn.pack_sequence()作为输入,则输出也将是一个压缩序列。

  • h_n:非bacth输入的Tensor形状为(D * numlayers,H_{out}),batch输入的Tensor形状为(D * numlayers,N,H_{out}),包含了序列中每个元素的最终隐藏状态。

  • c_nc_0:对于非batch输入的Tensor形状为(D * numlayers,H_{cell}),batch输入的Tensor的形状为(D * numlayers,N,H_{cell}),输出包含了序列中每个元素的最终单元状态。

综上所述,如果是batch输出,则torch.nn.LSTM的输出数据维度为,

  • output:(seq_len , batch_size , num_directions * hidden_size)
  • h_0:(num_directions * num_layers , batch_size,hidden_size )
  • c_0:(num_directions * num_layers , batch_size , hidden_size)

1.3 torch.nn.LSTM使用示例

假设一句话有5个单词,每个单词需要用10维的向量表示,batch_size为8,则

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn

if __name__ == '__main__':
    lstm_input = torch.randn(5,8,10) # lstm_input => (seq_len = 5,batch_size = 8,input_size = 10)

    lstm = nn.LSTM(10,20,1) # (input_size=10,hidden_size = 20,num_layers = 1)

    out,(h_n,c_n) = lstm(lstm_input) # out =>(seq_len = 5,batch_size = 8,D * hidden_size = 1 * 20)

    print(out.shape)

输出

torch.Size([5, 8, 20])

1.4 使用LSTM进行mnist数据集图像分类

import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
sequence_length = 28
input_size = 28
hidden_size = 128
num_layers = 2
num_classes = 10
batch_size = 100
num_epochs = 2
learning_rate = 0.01

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                          train=False, 
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

# Recurrent neural network (many-to-one)
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Set initial hidden and cell states 
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) 
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))  # out: tensor of shape (batch_size, seq_length, hidden_size)

        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out

model = RNN(input_size, hidden_size, num_layers, num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total)) 

# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')

输出

Done!
Epoch [1/2], Step [100/600], Loss: 0.5480
Epoch [1/2], Step [200/600], Loss: 0.4485
Epoch [1/2], Step [300/600], Loss: 0.2460
Epoch [1/2], Step [400/600], Loss: 0.1604
Epoch [1/2], Step [500/600], Loss: 0.2246
Epoch [1/2], Step [600/600], Loss: 0.1616
Epoch [2/2], Step [100/600], Loss: 0.0647
Epoch [2/2], Step [200/600], Loss: 0.1051
Epoch [2/2], Step [300/600], Loss: 0.1356
Epoch [2/2], Step [400/600], Loss: 0.0415
Epoch [2/2], Step [500/600], Loss: 0.0389
Epoch [2/2], Step [600/600], Loss: 0.0801
Test Accuracy of the model on the 10000 test images: 97.45 %

参考链接