Let's play Blackjack with Monte Carlo

Now let's better understand Monte Carlo with the Blackjack game. Blackjack, also called 21, is a popular card game played in casinos. The goal of the game is to have a sum of all your cards close to 21 and not exceeding 21. The value of cards J, K, and Q is 10. The value of ace can be 1 or 11; this depends on player choice. The value of the rest of the cards (1 to 10) is the same as the numbers they show.

The rules of the game are very simple:

  • The game can be played with one or many players and one dealer.
  • Each player competes only with the dealer and not another player.
  • Initially, a player is given two cards. Both of these cards are face up, that is, visible to others.
  • A dealer is also given two cards. One card is face up and the other is face down. That is, the dealer only shows one of his cards.
  • If the sum of a player's cards is 21 immediately after receiving two cards (say a player has received a jack and ace which is 10+11 = 21), then it is called natural or Blackjack and the player wins.
  • If the dealer's sum of cards is also 21 immediately after receiving two cards, then it is called a draw as both of them have 21.
  • In each round, the player decides whether he needs another card or not to sum the cards close to 21.
  • If a player needs a card, then it is called a hit.
  • If a player doesn't need a card, then it is called a stand.
  • If a player's sum of cards exceeds 21, then it is called bust; then the dealer will win the game. 

Let's better understand Blackjack by playing. I'll let you be the player and I am the dealer: 

In the preceding diagram, we have one player and a dealer. Both of them are given two cards. Both of the player's two cards are face up (visible) while the dealer has one card face up (visible) and the other face down (invisible). In the first round, you have been given two cards, say a jack and a number 7, which is (10 + 7 = 17), and I as the dealer will only show you one card which is number 2. I have another card face down. Now you have to decide to either hit (need another card) or stand (don't need another card). If you choose to hit and receive number 3 you will get 10+7+3 = 20 which is close to 21 and you win:

But if you received a card, say number 7, then 10+7+7 = 24, which exceeds 21. Then it is called bust and you lose the game. If you decide to stand with your initial cards, then you have only 10 + 7 = 17. Then we will check the dealer's sum of cards. If it is greater than 17 and does not exceed 21 then the dealer wins, otherwise you win:

The rewards here are:

  • +1 if the player won the game
  • -1 if the player loses the game
  • 0 if the game is a draw

The possible actions are:

  • Hit: If the player needs a card
  • Stand: If the player doesn't need a card

The player has to decide the value of an ace. If the player's sum of cards is 10 and the player gets an ace after a hit, he can consider it as 11, and 10 + 11 = 21. But if the player's sum of cards is 15 and the player gets an ace after a hit, if he considers it as 11 and 15+11 = 26, then it's a bust. If the player has an ace we can call it a usable ace; the player can consider it as 11 without being bust. If the player is bust by considering the ace as 11, then it is called a nonusable ace

Now we will see how to implement Blackjack using the first visit Monte Carlo algorithm. 

First, we will import our necessary libraries:

import gym
from matplotlib import pyplot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from collections import defaultdict
from functools import partial
%matplotlib inline
plt.style.use('ggplot')

Now we will create the Blackjack environment using OpenAI's Gym:

env = gym.make('Blackjack-v0')

Then we define the policy function which takes the current state and checks if the score is greater than or equal to 2o; if it is, we return 0 or else we return 1.  That is, if the score is greater than or equal to 20, we stand (0) or else we hit (1):

def sample_policy(observation):
score, dealer_score, usable_ace = observation
return 0 if score >= 20 else 1

Now we will see how to generate an episode. An episode is a single round of a game. We will see it step by step and then look at the complete function.

We define states, actions, and rewards as a list and initiate the environment using env.reset and store an observation variable:

states, actions, rewards = [], [], []
observation = env.reset()

Then, until we reach the terminal state, that is, till the end of the episode, we do the following:

  1. Append the observation to the states list:
states.append(observation)
  1. Now, we create an action using our sample_policy function and append the actions to an action list:
action = sample_policy(observation)
actions.append(action)
  1. Then, for each step in the environment, we store the state, reward, and done (which specifies whether we reached terminal state) and we append the rewards to the reward list:
observation, reward, done, info = env.step(action)
rewards.append(reward)
  1. If we reached the terminal state, then we break:
if done:
break
  1. The complete generate_episode function is as follows:
def generate_episode(policy, env):
states, actions, rewards = [], [], []
observation = env.reset()
while True:
states.append(observation)
action = policy(observation)
actions.append(action)
observation, reward, done, info = env.step(action)
rewards.append(reward)
if done:
break

return states, actions, rewards

This is how we generate an episode. How can we play the game? For that, we need to know the value of each state. Now we will see how to get the value of each state using the first visit Monte Carlo method.

First, we initialize the empty value table as a dictionary for storing the values of each state:

value_table = defaultdict(float)

Then, for a certain number of episodes, we do the following:

  1. First, we generate an episode and store the states and rewards; we initialize returns as 0 which is the sum of rewards:
states, _, rewards = generate_episode(policy, env)
returns = 0
  1. Then for each step, we store the rewards to a variable R and states to S, and we calculate returns as a sum of rewards:
for t in range(len(states) - 1, -1, -1):
R = rewards[t]
S = states[t]
returns += R
  1. We now perform the first visit Monte Carlo; we check if the episode is being visited for the visit time. If it is, we simply take the average of returns and assign the value of the state as an average of returns:
if S not in states[:t]:
N[S] += 1
value_table[S] += (returns - V[S]) / N[S]
  1. Look at the complete function for better understanding:
def first_visit_mc_prediction(policy, env, n_episodes):
value_table = defaultdict(float)
N = defaultdict(int)

for _ in range(n_episodes):
states, _, rewards = generate_episode(policy, env)
returns = 0
for t in range(len(states) - 1, -1, -1):
R = rewards[t]
S = states[t]
returns += R
if S not in states[:t]:
N[S] += 1
value_table[S] += (returns - V[S]) / N[S]
return value_table
  1. We can get the value of each state:
value = first_visit_mc_prediction(sample_policy, env, n_episodes=500000)
  1. Let's see the value of a few states:
print(value)
defaultdict(float, {(4, 1, False): -1.024292170184644, (4, 2, False): -1.8670191351012455, (4, 3, False): 2.211363314854649, (4, 4, False): 16.903201033000823, (4, 5, False): -5.786238030898542, (4, 6, False): -16.218211752577602,

We can also plot the value of the state to see how it is converged, as follows:

The complete code is given as follows:

import numpy
import gym
from matplotlib import pyplot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from collections import defaultdict
from functools import partial
%matplotlib inline

plt.style.use('ggplot')

## Blackjack Environment

env = gym.make('Blackjack-v0')

env.action_space, env.observation_space

def sample_policy(observation):
score, dealer_score, usable_ace = observation
return 0 if score >= 20 else 1

def generate_episode(policy, env):
states, actions, rewards = [], [], []
observation = env.reset()
while True:
states.append(observation)
action = sample_policy(observation)
actions.append(action)
observation, reward, done, info = env.step(action)
rewards.append(reward)
if done:
break

return states, actions, rewards


def first_visit_mc_prediction(policy, env, n_episodes):
value_table = defaultdict(float)
N = defaultdict(int)

for _ in range(n_episodes):
states, _, rewards = generate_episode(policy, env)
returns = 0
for t in range(len(states) - 1, -1, -1):
R = rewards[t]
S = states[t]
returns += R
if S not in states[:t]:
N[S] += 1
value_table[S] += (returns - value_table[S]) / N[S]
return value_table

def plot_blackjack(V, ax1, ax2):
player_sum = numpy.arange(12, 21 + 1)
dealer_show = numpy.arange(1, 10 + 1)
usable_ace = numpy.array([False, True])

state_values = numpy.zeros((len(player_sum),
len(dealer_show),
len(usable_ace)))

for i, player in enumerate(player_sum):
for j, dealer in enumerate(dealer_show):
for k, ace in enumerate(usable_ace):
state_values[i, j, k] = V[player, dealer, ace]

X, Y = numpy.meshgrid(player_sum, dealer_show)

ax1.plot_wireframe(X, Y, state_values[:, :, 0])
ax2.plot_wireframe(X, Y, state_values[:, :, 1])
for ax in ax1, ax2:
ax.set_zlim(-1, 1)
ax.set_ylabel('player sum')
ax.set_xlabel('dealer showing')
ax.set_zlabel('state-value')

fig, axes = pyplot.subplots(nrows=2, figsize=(5, 8), subplot_kw={'projection': '3d'})
axes[0].set_title('value function without usable ace')
axes[1].set_title('value function with usable ace')
plot_blackjack(value, axes[0], axes[1])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.199.181