Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG-Tutorial Python-php.cn

Deep Deterministic Policy Gradient (DDPG) ialah algoritma peneguhan dalam tanpa model yang diilhamkan oleh Deep Q-Network Ia berdasarkan Actor-Critic menggunakan kecerunan dasar Artikel ini akan menggunakan pytorch untuk melaksanakannya pelaksanaan dan penjelasan

Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG

Komponen utama DDPG ialah

Replay Buffer
Aktor -Pengkritik rangkaian saraf
Bunyi Penerokaan
Rangkaian sasaran
Kemas Kini Sasaran Lembut untuk Rangkaian Sasaran

Mari kita laksanakan langkah demi langkah:

Replay Buffer

DDPG menggunakan Replay Buffer untuk menyimpan proses dan ganjaran (Sₜ, aₜ, Rₜ, Sₜ+₁) yang disampel dengan menerokai persekitaran. Replay Buffer memainkan peranan penting dalam membantu ejen mempercepatkan pembelajaran dan kestabilan DDPG:

Meminimumkan korelasi antara sampel: menyimpan pengalaman lalu dalam Replay Buffer, dengan itu Membolehkan ejen belajar daripada pelbagai pengalaman.
Dayakan pembelajaran dasar luar talian: membenarkan ejen untuk mencuba peralihan daripada penimbal main semula dan bukannya peralihan pensampelan daripada dasar semasa.
Persampelan Cekap: Simpan pengalaman lepas dalam penimbal, membolehkan ejen belajar daripada pengalaman berbeza beberapa kali.

class Replay_buffer():
 '''
Code based on:
https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
Expects tuples of (state, next_state, action, reward, done)
'''
 def __init__(self, max_size=capacity):
 """Create Replay buffer.
Parameters
----------
size: int
Max number of transitions to store in the buffer. When the buffer
overflows the old memories are dropped.
"""
 self.storage = []
 self.max_size = max_size
 self.ptr = 0
 
 def push(self, data):
 if len(self.storage) == self.max_size:
 self.storage[int(self.ptr)] = data
 self.ptr = (self.ptr + 1) % self.max_size
 else:
 self.storage.append(data)
 
 def sample(self, batch_size):
 """Sample a batch of experiences.
Parameters
----------
batch_size: int
How many transitions to sample.
Returns
-------
state: np.array
batch of state or observations
action: np.array
batch of actions executed given a state
reward: np.array
rewards received as results of executing action
next_state: np.array
next state next state or observations seen after executing action
done: np.array
done[i] = 1 if executing ation[i] resulted in
the end of an episode and 0 otherwise.
"""
 ind = np.random.randint(0, len(self.storage), size=batch_size)
 state, next_state, action, reward, done = [], [], [], [], []
 
 for i in ind:
 st, n_st, act, rew, dn = self.storage[i]
 state.append(np.array(st, copy=False))
 next_state.append(np.array(n_st, copy=False))
 action.append(np.array(act, copy=False))
 reward.append(np.array(rew, copy=False))
 done.append(np.array(dn, copy=False))
 
 return np.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

Salin selepas log masuk

Rangkaian Neural Actor-Critic

Ini ialah pelaksanaan PyTorch bagi algoritma pembelajaran pengukuhan Actor-Critic. Kod ini mentakrifkan dua model rangkaian saraf, Pelakon dan Pengkritik.

Input model Aktor: keadaan persekitaran;

Input model Kritik: keadaan persekitaran dan tindakan; output model Kritik: nilai Q, iaitu jumlah ganjaran yang dijangkakan bagi pasangan keadaan-tindakan semasa.

class Actor(nn.Module):
 """
The Actor model takes in a state observation as input and
outputs an action, which is a continuous value.
 
It consists of four fully connected linear layers with ReLU activation functions and
a final output layer selects one single optimized action for the state
"""
 def __init__(self, n_states, action_dim, hidden1):
 super(Actor, self).__init__()
 self.net = nn.Sequential(
 nn.Linear(n_states, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, 1)
)
 
 def forward(self, state):
 return self.net(state)
 
 class Critic(nn.Module):
 """
The Critic model takes in both a state observation and an action as input and
outputs a Q-value, which estimates the expected total reward for the current state-action pair.
 
It consists of four linear layers with ReLU activation functions,
State and action inputs are concatenated before being fed into the first linear layer.
 
The output layer has a single output, representing the Q-value
"""
 def __init__(self, n_states, action_dim, hidden2):
 super(Critic, self).__init__()
 self.net = nn.Sequential(
 nn.Linear(n_states + action_dim, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, action_dim)
)
 
 def forward(self, state, action):
 return self.net(torch.cat((state, action), 1))

Salin selepas log masuk

Bunyi Penerokaan

Menambah hingar pada tindakan yang dipilih oleh Pelakon ialah teknik yang digunakan dalam DDPG untuk menggalakkan penerokaan dan menambah baik proses pembelajaran.

Bunyi Gaussian atau bunyi Ornstein-Uhlenbeck boleh digunakan. Kebisingan Gaussian adalah ringkas dan mudah untuk dilaksanakan, dan hingar Ornstein-Uhlenbeck menghasilkan hingar berkorelasi masa yang boleh membantu ejen meneroka ruang tindakan dengan lebih cekap. Tetapi turun naik hingar Ornstein-Uhlenbeck adalah lebih lancar dan kurang rawak daripada kaedah hingar Gaussian.

import numpy as np
 import random
 import copy
 
 class OU_Noise(object):
 """Ornstein-Uhlenbeck process.
code from :
https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
The OU_Noise class has four attributes
 
size: the size of the noise vector to be generated
mu: the mean of the noise, set to 0 by default
theta: the rate of mean reversion, controlling how quickly the noise returns to the mean
sigma: the volatility of the noise, controlling the magnitude of fluctuations
"""
 def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
 self.mu = mu * np.ones(size)
 self.theta = theta
 self.sigma = sigma
 self.seed = random.seed(seed)
 self.reset()
 
 def reset(self):
 """Reset the internal state (= noise) to mean (mu)."""
 self.state = copy.copy(self.mu)
 
 def sample(self):
 """Update internal state and return it as a noise sample.
This method uses the current state of the noise and generates the next sample
"""
 dx = self.theta * (self.mu - self.state) + self.sigma * np.array([np.random.normal() for _ in range(len(self.state))])
 self.state += dx
 return self.state

Salin selepas log masuk

Untuk menggunakan hingar Gaussian dalam DDPG, anda boleh menambah hingar Gaussian terus pada proses pemilihan tindakan ejen.

DDPG

DDPG (Deep Deterministic Policy Gradient) menggunakan dua set rangkaian neural Actor-Critic untuk penghampiran fungsi. Dalam DDPG, rangkaian sasaran ialah Actor-Critic, yang mempunyai struktur dan parameterisasi yang sama seperti rangkaian Actor-Critic.

Sepanjang tempoh latihan, ejen menggunakan rangkaian Actor-Criticnya untuk berinteraksi dengan persekitaran dan menyimpan tuple pengalaman (Sₜ, Aₜ, Rₜ, Sₜ+₁) dalam Penampan Replay. Ejen kemudian mengambil sampel daripada Penampan Replay dan mengemas kini rangkaian Actor-Critic dengan data. Daripada mengemas kini berat rangkaian sasaran dengan menyalin terus daripada rangkaian Actor-Critic, algoritma DDPG mengemas kini pemberat rangkaian sasaran secara perlahan melalui proses yang dipanggil pengemaskinian sasaran lembut.

Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG

Sasaran lembut dikemas kini sebagai sebahagian kecil daripada pemberat yang dipindahkan daripada rangkaian Actor-Critic ke rangkaian sasaran yang dipanggil kadar kemas kini sasaran (τ) .

Formula kemas kini sasaran lembut adalah seperti berikut:

Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG

Dengan menggunakan teknologi sasaran lembut, kestabilan pembelajaran boleh menjadi sangat baik. bertambah baik.

#Set Hyperparameters
 # Hyperparameters adapted for performance from
 capacity=1000000
 batch_size=64
 update_iteration=200
 tau=0.001 # tau for soft updating
 gamma=0.99 # discount factor
 directory = './'
 hidden1=20 # hidden layer for actor
 hidden2=64. #hiiden laye for critic
 
 class DDPG(object):
 def __init__(self, state_dim, action_dim):
 """
Initializes the DDPG agent.
Takes three arguments:
state_dim which is the dimensionality of the state space,
action_dim which is the dimensionality of the action space, and
max_action which is the maximum value an action can take.
 
Creates a replay buffer, an actor-critic networks and their corresponding target networks.
It also initializes the optimizer for both actor and critic networks alog with
counters to track the number of training iterations.
"""
 self.replay_buffer = Replay_buffer()
 
 self.actor = Actor(state_dim, action_dim, hidden1).to(device)
 self.actor_target = Actor(state_dim, action_dim,hidden1).to(device)
 self.actor_target.load_state_dict(self.actor.state_dict())
 self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=3e-3)
 
 self.critic = Critic(state_dim, action_dim,hidden2).to(device)
 self.critic_target = Critic(state_dim, action_dim,hidden2).to(device)
 self.critic_target.load_state_dict(self.critic.state_dict())
 self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=2e-2)
 # learning rate
 
 
 
 self.num_critic_update_iteration = 0
 self.num_actor_update_iteration = 0
 self.num_training = 0
 
 def select_action(self, state):
 """
takes the current state as input and returns an action to take in that state.
It uses the actor network to map the state to an action.
"""
 state = torch.FloatTensor(state.reshape(1, -1)).to(device)
 return self.actor(state).cpu().data.numpy().flatten()
 
 
 def update(self):
 """
updates the actor and critic networks using a batch of samples from the replay buffer.
For each sample in the batch, it computes the target Q value using the target critic network and the target actor network.
It then computes the current Q value
using the critic network and the action taken by the actor network.
 
It computes the critic loss as the mean squared error between the target Q value and the current Q value, and
updates the critic network using gradient descent.
 
It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and
updates the actor network using gradient ascent.
 
Finally, it updates the target networks using
soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts.
This process is repeated for a fixed number of iterations.
"""
 
 for it in range(update_iteration):
 # For each Sample in replay buffer batch
 state, next_state, action, reward, done = self.replay_buffer.sample(batch_size)
 state = torch.FloatTensor(state).to(device)
 action = torch.FloatTensor(action).to(device)
 next_state = torch.FloatTensor(next_state).to(device)
 done = torch.FloatTensor(1-done).to(device)
 reward = torch.FloatTensor(reward).to(device)
 
 # Compute the target Q value
 target_Q = self.critic_target(next_state, self.actor_target(next_state))
 target_Q = reward + (done * gamma * target_Q).detach()
 
 # Get current Q estimate
 current_Q = self.critic(state, action)
 
 # Compute critic loss
 critic_loss = F.mse_loss(current_Q, target_Q)
 
 # Optimize the critic
 self.critic_optimizer.zero_grad()
 critic_loss.backward()
 self.critic_optimizer.step()
 
 # Compute actor loss as the negative mean Q value using the critic network and the actor network
 actor_loss = -self.critic(state, self.actor(state)).mean()
 
 # Optimize the actor
 self.actor_optimizer.zero_grad()
 actor_loss.backward()
 self.actor_optimizer.step()
 
 
 """
Update the frozen target models using
soft updates, where
tau,a small fraction of the actor and critic network weights are transferred to their target counterparts.
"""
 for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
 target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
 
 for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
 target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
 
 
 self.num_actor_update_iteration += 1
 self.num_critic_update_iteration += 1
 def save(self):
 """
Saves the state dictionaries of the actor and critic networks to files
"""
 torch.save(self.actor.state_dict(), directory + 'actor.pth')
 torch.save(self.critic.state_dict(), directory + 'critic.pth')
 
 def load(self):
 """
Loads the state dictionaries of the actor and critic networks to files
"""
 self.actor.load_state_dict(torch.load(directory + 'actor.pth'))
 self.critic.load_state_dict(torch.load(directory + 'critic.pth'))

Salin selepas log masuk

Melatih DDPG

Di sini kami menggunakan "MountainCarContinuous-v0" OpenAI Gym untuk melatih model DDPG RL kami Persekitaran di sini menyediakan ruang tindakan dan pemerhatian berterusan, dan matlamatnya adalah untuk melatih ia secepat mungkin Biarkan kereta sampai ke puncak gunung.

Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG

Pelbagai parameter algoritma ditakrifkan di bawah, seperti bilangan maksimum masa latihan, hingar penerokaan dan selang rakaman, dsb. Menggunakan benih rawak tetap membolehkan proses itu diundur.

import gym
 
 # create the environment
 env_name='MountainCarContinuous-v0'
 env = gym.make(env_name)
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 
 # Define different parameters for training the agent
 max_episode=100
 max_time_steps=5000
 ep_r = 0
 total_step = 0
 score_hist=[]
 # for rensering the environmnet
 render=True
 render_interval=10
 # for reproducibility
 env.seed(0)
 torch.manual_seed(0)
 np.random.seed(0)
 #Environment action ans states
 state_dim = env.observation_space.shape[0]
 action_dim = env.action_space.shape[0]
 max_action = float(env.action_space.high[0])
 min_Val = torch.tensor(1e-7).float().to(device)
 
 # Exploration Noise
 exploration_noise=0.1
 exploration_noise=0.1 * max_action

Salin selepas log masuk

Mencipta contoh kelas ejen DDPG untuk melatih ejen untuk beberapa kali tertentu. Kaedah kemas kini() ejen dipanggil pada penghujung setiap pusingan untuk mengemas kini parameter, dan kaedah save() digunakan selepas setiap sepuluh pusingan untuk menyimpan parameter ejen ke fail.

# Create a DDPG instance
 agent = DDPG(state_dim, action_dim)
 
 # Train the agent for max_episodes
 for i in range(max_episode):
 total_reward = 0
 step =0
 state = env.reset()
 fort in range(max_time_steps):
 action = agent.select_action(state)
 # Add Gaussian noise to actions for exploration
 action = (action + np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)
 #action += ou_noise.sample()
 next_state, reward, done, info = env.step(action)
 total_reward += reward
 if render and i >= render_interval : env.render()
 agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))
 state = next_state
 if done:
 break
 step += 1
 
 score_hist.append(total_reward)
 total_step += step+1
 print("Episode: t{} Total Reward: t{:0.2f}".format( i, total_reward))
 agent.update()
 if i % 10 == 0:
 agent.save()
 env.close()

Salin selepas log masuk

Menguji DDPG

test_iteration=100
 
 for i in range(test_iteration):
 state = env.reset()
 for t in count():
 action = agent.select_action(state)
 next_state, reward, done, info = env.step(np.float32(action))
 ep_r += reward
 print(reward)
 env.render()
 if done:
 print("reward{}".format(reward))
 print("Episode t{}, the episode reward is t{:0.2f}".format(i, ep_r))
 ep_r = 0
 env.render()
 break
 state = next_state

Salin selepas log masuk

Kami menggunakan parameter berikut untuk menjadikan model menumpu:

Sampel hingar daripada taburan normal piawai dan bukannya pensampelan rawak.
Tukar pemalar poliak (tau) daripada 0.99 kepada 0.001
Ubah suai saiz lapisan tersembunyi rangkaian Kritik kepada [64,64]. Pengaktifan ReLU dialih keluar selepas lapisan kedua rangkaian Kritik. Tukar kepada (Linear, ReLU, Linear, Linear).
Tukar saiz penimbal maksimum kepada 1000000
Tukar saiz batch_size daripada 128 kepada 64

Kesan selepas latihan selama 75 pusingan adalah seperti berikut:

Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG

Ringkasan

Algoritma DDPG ialah algoritma Actor-Critic luar dasar tanpa model yang diilhamkan oleh Rangkaian Q (DQN) yang mendalam ) algoritma. Ia menggabungkan kelebihan kaedah kecerunan dasar dan Q-pembelajaran untuk mempelajari dasar deterministik dalam ruang tindakan berterusan.

Sama seperti DQN, ia menggunakan penimbal main semula untuk menyimpan pengalaman lalu dan rangkaian sasaran untuk melatih rangkaian, dengan itu meningkatkan kestabilan proses latihan.

Algoritma DDPG memerlukan penalaan hiperparameter yang teliti untuk prestasi optimum. Hiperparameter termasuk kadar pembelajaran, saiz kelompok, kadar kemas kini rangkaian sasaran dan parameter hingar pengesanan. Perubahan kecil dalam hiperparameter boleh memberi kesan yang ketara ke atas prestasi algoritma.

Atas ialah kandungan terperinci Pelaksanaan kod PyTorch dan penjelasan langkah demi langkah pembelajaran pengukuhan DDPG. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!