It's great that you can implement a deep Q-learning model to build a self-driving car. Really, once again, huge congratulations to you for that. But I also want you to be able to use deep Q-learning to solve a real-world business problem. With this next application, you'll be more than ready to add value to your work or business by leveraging AI. Even though we'll once again use a specific application, this chapter will provide you with a general AI framework, a blueprint containing the general steps of the process you have to follow when solving a real-world problem with deep Q-learning. This chapter is very important to you and for your career; I don't want you to close this book before you feel confident with the skills you'll learn here. Let's smash this next application together!
When I said we were going to solve a real-world business problem, I didn't overstate the problem; the problem we're about to tackle with deep Q-learning is very similar to the following, which was solved in the real world via deep Q-learning.
In 2016, DeepMind AI minimized a big part of Google's yearly costs by reducing the Google Data Center's cooling bill by 40% using their DQN AI model (deep Q-learning). Check the link here:
https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40
In this case study, we'll do something very similar. We'll set up our own server environment, and we'll build an AI that controls the cooling and heating of the server so that it stays in an optimal range of temperatures while using the minimum of energy, therefore minimizing the costs.
Just as the DeepMind AI did, our goal will be to achieve at least 40% energy savings! Are you ready for this? Let's bring it on!
As ever, my first question to you is: What's our first step?
I'm sure by this point I don't need to spell out the answer. Let's get straight to building our environment!
Before we define the states, actions, and rewards, we need to set up the server and explain how it operates. We'll do that in several steps:
Here is a list of all the parameters, which keep their values fixed, of the server environment:
Next, we'll list all the variables, which have values that fluctuate over time, of the server environment:
All these parameters and variables will be part of the environment, and will influence the actions of our AI.
Next, we'll explain the two core assumptions of the environment. It's important to understand that these assumptions are not AI related, but just used to simplify the environment so that we can focus on creating a functional AI solution.
We'll rely on the following two essential assumptions:
The temperature of the server can be approximated through Multiple Linear Regression, that is, by a linear function of the atmospheric temperature, the number of users and the rate of data transmission, like so:
server temperature = + atmospheric temperature + number of users + rate of data transmission
where , , , and .
The raison d'être of this assumption and the reason why , , and are intuitive to understand. It makes sense that when the atmospheric temperature increases, the temperature of the server increases. The more users that are connected to the server, the more energy the server has to spend handling them, and therefore the higher the temperature of the server will be. Finally, the more data is transmitted inside the server, the more energy the server has to spend processing it, and therefore the higher the temperature of the server will be.
For simplicity's sake, we can just suppose that these correlations are linear. However, you could absolutely run the same simulation by assuming they were quadratic or logarithmic, and altering the code to reflect those equations. This is just my simulation of a virtual server environment; feel free to tweak it as you like!
Let's assume further that after performing this Multiple Linear Regression, we obtained the following values of the coefficients: , , , and . Accordingly:
server temperature = atmospheric temperature + number of users + rate of data transmission
Now, if we were facing this problem in real life, we could get the dataset of temperatures for our server and calculate these values directly. Here, we're just assuming values that are easy to code and understand, because our goal in this chapter is not to perfectly model a real server; it's to go through the steps of solving a real-world problem with AI.
The energy spent by any cooling system, either our AI or the server's integrated cooling system that we'll compare our AI to, that changes the server's temperature from to within 1 unit of time (in our case 1 minute), can be approximated again through regression by a linear function of the server's absolute temperature change, as so:
where:
Let's explain why it intuitively makes sense to make this assumption with . That's simply because the more the AI or the old-fashioned integrated cooling system heats up or cools down the server, the more energy it spends to achieve that heat transfer.
For example, imagine the server suddenly has overheating issues and just reached C; then within one unit of time (1 minute), either system will need much more energy to bring the server's temperature back to its optimal temperature, C, than to bring it back to C.
For simplicity's sake, in this example we suppose that these correlations are linear, instead of calculating true values from a real dataset. In case you're wondering why we take the absolute value, that's simply because when the AI cools down the server, , so . Since an energy cost is always positive, we have to take the absolute value of .
Keeping our desired simplicity in mind, we'll assume that the results of the regression are and , so that we get the following final equation based on Assumption 2:
thus:
, that is, if the server is heated up,
, that is, if the server is cooled down.
Now we've got our assumptions covered, let's explain how we'll simulate the operation of the server, with users logging on and off and data coming in and out.
The number of users and the rate of data transmission will randomly fluctuate, to simulate the unpredictable user activity and data requirements of an actual server. This leads to randomness in the temperature. The AI needs to learn how much cooling or heating power it should transfer to the server so as to not deteriorate the server performance, and at the same time, expend as little energy as possible by optimizing its heat transfer.
Now that we have the full picture, I'll explain the overall functioning of the server and the AI inside this environment.
Inside a data center, we're dealing with a specific server that is controlled by the parameters and variables listed previously. Every minute, some new users log on to the server and some current users log off, therefore updating the number of active users in the server. Also, every minute some new data is transmitted into the server, and some existing data is transmitted outside the server, therefore updating the rate of data transmission happening inside the server.
Hence, based on Assumption 1 given earlier, the temperature of the server is updated every minute. Now please focus, because this is where you'll understand the huge role the AI has to play on the server.
Two possible systems can regulate the temperature of the server: the AI, or the server's integrated cooling system. The server's integrated cooling system is an unintelligent system that automatically brings the server's temperature back inside its optimal temperature range.
Every minute, the server's temperature is updated. If the server is using the integrated cooling system, that system watches to see what happens; that update can either leave the temperature within the range of optimal temperatures (), or move it outside this range. If it goes outside the optimal range, for example to C, the server's integrated cooling system automatically brings the temperature back to the closest bound of the optimal range, in this case C. For the purposes of our simulation, we're assuming that no matter how big the change in temperature is, the integrated cooling system can bring it back into the optimal range in under a minute. This is, obviously, an unrealistic assumption, but the purpose of this chapter is for you to build a functioning AI capable of solving the problem, not to perfectly simulate the thermal dynamics of a real server. Once we've completed our example together, I highly recommend that you tinker with the code and try to make it more realistic; for now, to keep things simple, we'll believe in our magically effective integrated cooling system.
If the server is instead using the AI, then in that case the server's integrated cooling system is deactivated and it is the AI itself that updates the temperature of the server to regulate it the best way. The AI changes the temperature after making some prior predictions, not in a purely deterministic way as with the unintelligent integrated cooling system. Before there's an update to the number of users and the rate of data transmission, causing a change in the temperature of the server, the AI predicts if it should cool down the server, do nothing, or heat up the server, and acts. Then the temperature change happens and the AI reiterates.
Since these two systems are distinct from one another, we can evaluate them separately to compare their performance; to train or run the AI on a server, while keeping track of how much energy the integrated cooling system would have used in the same circumstances.
That brings us to the energy. Remember that one primary goal of the AI is to lower the energy cost of running this server. Accordingly, our AI has to try and use less energy than the unintelligent cooling system would use on the server. Since, based on Assumption 2 given preceding, the energy spent on the server (by any system) is proportional to the change of temperature within one unit of time:
thus:
, that is, if the server is heated up,
, that is, if the server is cooled down,
then that means that the energy saved by the AI at each iteration t (each minute) is equal to the difference in absolute changes of temperatures caused in the server between the unintelligent server's integrated cooling system and the AI from t and t+1:
Energy saved by the AI between t and t+1
where:
The AIs goal is to save as much as it can every minute, therefore saving the maximum total energy over 1 full year of simulation, and eventually saving the business the maximum cost possible on their cooling/heating electricity bill. That's how we do business in the 21st century; with AI!
Now that we fully understand how our server environment works, and how it's simulated, it's time to proceed with what absolutely must be done when defining an AI environment. You know the next steps already:
Remember, when you're doing deep Q-learning, the input state is always a 1D vector. (Unless you are doing deep convolutional Q-learning, in which case the input state is a 2D image, but that's getting ahead of ourselves! Wait for Chapter 12, Deep Convolution Q-Learning). So, what will the input state vector be in this server environment? What information will it contain in order to describe well enough each state of the environment? These are the questions you must ask yourself when modeling an AI problem and building the environment. Try to answer these questions first on your own and figure out the input state vector in this case, and you can find out what we're using in the next paragraph. Hint: have a look again at the variable defined preceding.
The input state at time t is composed of the following three elements:
Thus, the input state will be an input vector of these three elements. Our future AI will take this vector as input, and will return an action to perform at each time, t. Speaking of the actions, what are they going to be? Let's find out.
To figure out which actions to perform, we need to remember the goal, which is to optimally regulate the temperature of the server. The actions are simply going to be the temperature changes that the AI can cause inside the server, in order to heat it up or cool it down. In deep Q-learning, the actions must always be discrete; they can't be plucked from a range, we need a defined number of possible actions. Therefore, we'll consider five possible temperature changes, from C to C, so that we end up with five possible actions that the AI can perform to regulate the temperature of the server:
Figure 1: Defining the actions
Great. Finally, let's see how we're going to reward and punish our AI.
You might have guessed from the earlier Overall functioning section what the reward is going to be. The reward at iteration t is the energy saved by the AI, with respect to how much energy the server's integrated cooling system would have spent; that is, the difference between the energy that the unintelligent cooling system would spend if the AI was deactivated, and the energy that the AI spends on the server:
Since according to Assumption 2, the energy spent is equal to the change of the temperature induced in the server (by any system, including the AI or the unintelligent cooling system):
thus:
, if the server is cooled down,
then we receive a reward at time t that is the difference in the change of temperature caused in the server between unintelligent cooling system (that is when there is no AI) and the AI:
Energy saved by the AI between t and t+1
where:
Important note: It's important to understand that the systems (our AI and the server's integrated cooling system) will be evaluated separately, in order to compute the rewards. Since at each time point the actions of the two different systems lead to different temperatures, we have to keep track of the two temperatures separately, as and . In other words, we're performing two separate simulations at the same time, following the same fluctuations of users and data; one for the AI, and one for the server's integrated cooling system.
To complete this section, we'll do a small simulation of 2 iterations (that is, 2 minutes) as an example to make everything crystal clear.
Let's say that we're at time pm, and that the temperature of the server is C, both with the AI and without it. At this exact time, the AI predicts an action: 0, 1, 2, 3 or 4. Since, right now, the server's temperature is outside the optimal temperature range, , the AI will probably predict actions 0, 1 or 2. Let's say that it predicts 1, which corresponds to cooling the server down by C. Therefore, between pm and pm, the AI makes the server's temperature go from to :
Thus, based on Assumption 2, the energy spent by the AI on the server is:
Now only one piece of information is missing to compute the reward: the energy that the server's integrated cooling system would have spent if the AI was deactivated between 4:00 pm and 4:01 pm. Remember that this unintelligent cooling system automatically brings the server's temperature back to the closest bound of the optimal temperature range . Since at pm the temperature was C, then the closest bound of the optimal temperature range at that time was C. Thus, the server's integrated cooling system would have changed the temperature from to , and the server's temperature change that would have occurred if there was no AI is:
Based on Assumption 2, the energy that the unintelligent cooling system would have spent if there was no AI is:
In conclusion, the reward the AI gets after playing this action at time pm is:
I'm sure you'll have noticed that as it stands, our AI system doesn't involve itself with the optimal range of temperatures for the server; as I've mentioned before, everything comes from the rewards, and the AI doesn't get any reward for being inside the optimal range or any penalty for being outside it. Once we've built the AI completely, I recommend that you play around with the code and try adding some rewards or penalties that get the AI to stick close to the optimal range; but for now, to keep things simple and get our AI up and running, we'll leave the reward as entirely linked to energy saved.
Then, between pm and pm, new things happen: some new users log on to the server, some existing users log off, some new data transmits into the server, and some existing data transmits out. Based on Assumption 1, these factors make the server's temperature change. Let's say that overall, they increase the server's temperature by C:
Now, remember that we're evaluating two systems separately: our AI, and the server's integrated cooling system. Therefore we must compute the two temperatures we would get with each of these two systems separately, one without the other, at pm. Let's start with the AI.
The temperature we get at pm when the AI is activated is:
And the temperature we get at pm if the AI is not activated is:
Now we have our two separate temperatures, which are = 31.5°C when the AI is activated, and = 29°C when the AI is not activated.
Let's simulate what happens between pm and pm. Again, our AI will make a prediction, and since the server is heating up, let's say it predicts action 0, which corresponds to cooling down the server by , bringing it down to . Therefore, the energy spent by the AI between pm and pm is:
Now regarding the server's integrated cooling system (that is, when there is no AI), since at pm we had , then the closest bound of the optimal range of temperatures is still , and so the energy that the server's unintelligent cooling system would spend between pm and pm is:
Hence the reward obtained between pm and pm, which is only and entirely based on the amount of energy saved, is:
Finally, the total reward obtained between pm and pm is:
That was an example of the whole process happening for two minutes. In our implementation we'll run the same process over 1000 epochs of 5-month periods for the training, and then, once our AI is trained, we'll run the same process over 1 full year of simulation for the testing.
Now that we've defined and built the environment in detail, it's time for our AI to take action! This is where deep Q-learning comes into play. Our model will be more advanced than the previous one because I'm introducing some new tricks, called dropout and early stopping, which are great techniques for you to have in your toolkit; they usually improve the training performance of deep Q-learning.
Don't forget, you'll also get an AI Blueprint, which will allow you to adapt what we do here to any other business problem that you want to solve with deep Q-learning.
Ready? Let's smash this.
Let's start by reminding ourselves of the whole deep Q-learning model, while adapting it to this case study, so that you don't have to scroll or turn many pages back into the previous chapters. Repetition is never bad; it sticks the knowledge into our heads more firmly. Here's the deep Q-learning algorithm for you again:
Initialization:
memory
in the code (the dqn.py
Python file in the Chapter 11
folder of the GitHub repo).max_memory
in the code (the dqn.py
Python file in the Chapter 11
folder of the GitHub repo).At each time t (each minute), we repeat the following process, until the end of the epoch:
And then finally we backpropagate this loss error back into the neural network, and through stochastic gradient descent we update the weights according to how much they contributed to the loss error.
I hope the refresher was refreshing! Let's move on to the brain of the outfit.
By the brain, I mean of course the artificial neural network of our AI.
Our brain will be a fully connected neural network, composed of two hidden layers, the first one with 64 neurons, and the second one with 32 neurons. As a reminder, this neural network takes as inputs the states of the environment, and returns as outputs the Q-values for each of the five possible actions.
This particular design of a neural network, with two hidden layers of 64 and 32 neurons respectively, is considered something of a classic architecture. It's suitable to solve a lot of problems, and it will work well for us here.
This artificial brain will be trained with a Mean Squared Error (MSE) loss, and an Adam
optimizer. The choice for the MSE loss is because we want to measure
and reduce the squared difference between the predicted value and the
target value, and the Adam
optimizer is a classic optimizer used, in practice, by default.
Here is what this artificial brain looks like:
Figure 2: The artificial brain of our AI
This artificial brain looks complex to create, but we can build it very easily thanks to the amazing Keras library. In
the last chapter, we used PyTorch because it's the neural network
library I'm more familiar with; but I want you to be able to use as many
AI tools as possible, so in this chapter we're going to power on with
Keras. Here's a preview of the full implementation containing the part
that builds this brain all by itself (taken from the brain_nodropout.py
file):
# BUILDING THE BRAIN
class Brain(object):
# BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD
def __init__(self, learning_rate = 0.001, number_actions = 5):
self.learning_rate = learning_rate
# BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE
states = Input(shape = (3,))
# BUILDING THE FULLY CONNECTED HIDDEN LAYERS
x = Dense(units = 64, activation = 'sigmoid')(states)
y = Dense(units = 32, activation = 'sigmoid')(x)
# BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER
q_values = Dense(units = number_actions, activation = 'softmax')(y)
# ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT
self.model = Model(inputs = states, outputs = q_values)
# COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER
self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate))
As you can see, it only takes a couple of lines of code, and I'll explain every line of that code to you in a later section. Now let's move on to the implementation.
This implementation will be divided into five parts, each part having its own Python file. You can find the full implementation in the Chapter 11
folder of the GitHub repository. These five parts constitute the
general AI framework, or AI Blueprint, that should be followed whenever
you build an environment to solve any business problem with deep
reinforcement learning.
Here they are, from Step 1 to Step 5:
environment.py
)brain_nodropout.py
or brain_dropout.py
)dqn.py
)training_noearlystopping.py
or training_earlystopping.py
)testing.py
)In order, those are the main steps of the general AI framework.
We'll follow this AI Blueprint to implement the AI for our specific case in the following five sections, each corresponding to one of these five main steps. Within each step, we'll distinguish the sub-steps that are still part of the general AI framework from the sub-steps that are specific to our project by writing the titles of the code sections in capital letters for all the sub-steps of the general AI framework, and in lowercase letters for all the sub-steps specific to our project.
That means that anytime you see a new code section where the title is written in capital letters, then it is the next sub-step of the general AI framework, which you should also follow when building an AI for your own business problem.
This next step, building the environment, is the largest Python implementation file for this project. Make sure you're rested and your batteries are recharged, and as soon as you are ready, let's tackle this together!
In this first step, we are going to build the environment inside a class. Why a class? Because we would like our environment to be an object which we can easily create with any values we choose for some parameters.
For example, we can create one environment object for a server that has a certain number of connected users and a certain rate of data at a specific time, and another environment object for a different server that has a different number of connected users and a different rate of data. Thanks to the advanced structure of this class, we can easily plug-and-play the environment objects we create on different servers which have their own parameters, regulating their temperatures with several different AIs, so that we can minimize the energy consumption of a whole data center, just as Google DeepMind did for Google's data centers with its DQN (deep Q-learning) algorithm.
This class follows the following sub-steps, which are part of the general AI Framework inside Step 1 – Building the environment:
You'll find the whole implementation of this Environment
class in this section. Remember the most important thing: all the code
sections with their titles written in capital letters are steps of the
general AI framework/Blueprint, and all the code sections having their
titles written in lowercase letters are specific to our case study.
The implementation of the environment has 144 lines of code. I won't explain each line of code for two reasons:
I'm confident you'll have no problems understanding it. Besides, the code section titles and the chosen variable names are clear enough to understand the structure and the flow of the code at face value. I'll walk you through the code broadly. Here we go!
First, we start building the Environment
class with its first method, the __init__
method, which introduces and initializes all the parameters and variables, as we described earlier:
# BUILDING THE ENVIRONMENT IN A CLASS
class Environment(object):
# INTRODUCING AND INITIALIZING ALL THE PARAMETERS AND VARIABLES OF THE ENVIRONMENT
def __init__(self, optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 10, initial_rate_data = 60):
self.monthly_atmospheric_temperatures = [1.0, 5.0, 7.0, 10.0, 11.0, 20.0, 23.0, 24.0, 22.0, 10.0, 5.0, 1.0]
self.initial_month = initial_month
self.atmospheric_temperature = self.monthly_atmospheric_temperatures[initial_month]
self.optimal_temperature = optimal_temperature
self.min_temperature = -20
self.max_temperature = 80
self.min_number_users = 10
self.max_number_users = 100
self.max_update_users = 5
self.min_rate_data = 20
self.max_rate_data = 300
self.max_update_data = 10
self.initial_number_users = initial_number_users
self.current_number_users = initial_number_users
self.initial_rate_data = initial_rate_data
self.current_rate_data = initial_rate_data
self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
self.temperature_ai = self.intrinsic_temperature
self.temperature_noai = (self.optimal_temperature[0] + self.optimal_temperature[1]) / 2.0
self.total_energy_ai = 0.0
self.total_energy_noai = 0.0
self.reward = 0.0
self.game_over = 0
self.train = 1
You'll notice the self.monthly_atmospheric_temperatures
variable; that's a list containing the average monthly atmospheric temperatures for each of the 12 months: 1°C in January, 5°C in February, 7°C in March, and so on.
The self.atmospheric_temperature
variable is the current average atmospheric temperature of the month
we're in during the simulation, and it's initialized as the atmospheric
temperature of the initial month, which we'll set later as January.
The self.game_over
variable tells the AI whether or not we should reset the temperature of
the server, in case it goes outside the allowed range of [-20°C, 80°C].
If it does, self.game_over
will be set equal to 1, otherwise it will remain at 0.
Finally, the self.train
variable tells us whether we're in training mode or inference mode. If we're in training mode, self.train = 1
. If we're in inference mode, self.train = 0
. The rest is just putting into code everything we defined in words at the beginning of this chapter.
Let's move on!
Now, we make the second method, update_env
, which updates the environment after the AI performs an action. This method takes three arguments as inputs:
direction
: A variable describing the direction of the heat transfer the AI imposes on the server, like so: if direction == 1
, the AI is heating up the server. If direction == -1
, the AI is cooling down the server. We'll need to have the value of this direction before calling the update_env
method, since this method is called after the action is performed.energy_ai
:
The energy spent by the AI to heat up or cool down the server at this
specific time when the action is played. Based on assumption 2, it will
be equal to the temperature change caused by the AI in the server.month
: Simply the month we're in at the specific time when the action is played.The first actions the program takes inside this method are to compute the reward. Indeed, right after the action is played, we can immediately deduce the reward, since it is the difference between the energy that the server's integrated system would spend if there was no AI, and the energy spent by the AI:
# MAKING A METHOD THAT UPDATES THE ENVIRONMENT RIGHT AFTER THE AI PLAYS AN ACTION
def update_env(self, direction, energy_ai, month):
# GETTING THE REWARD
# Computing the energy spent by the server's cooling system when there is no AI
energy_noai = 0
if (self.temperature_noai < self.optimal_temperature[0]):
energy_noai = self.optimal_temperature[0] - self.temperature_noai
self.temperature_noai = self.optimal_temperature[0]
elif (self.temperature_noai > self.optimal_temperature[1]):
energy_noai = self.temperature_noai - self.optimal_temperature[1]
self.temperature_noai = self.optimal_temperature[1]
# Computing the Reward
self.reward = energy_noai - energy_ai
# Scaling the Reward
self.reward = 1e-3 * self.reward
You have probably noticed that we choose to scale the reward at the end. In short, scaling is bringing the values (here the rewards) down into a short range. For example, normalization is a scaling technique where all the values are brought down into a range between 0 and 1. Another widely used scaling technique is standardization, which will be explained a bit later on.
Scaling is a common practice that is usually recommended in research papers when performing deep reinforcement learning, as it stabilizes training and improves the performance of the AI.
After getting the reward, we reach the next state. Remember that each state is composed of the following elements:
So, as we reach the next state, we update each of these elements one by one, following the sub-steps highlighted as comments in this next code section:
# GETTING THE NEXT STATE
# Updating the atmospheric temperature
self.atmospheric_temperature = self.monthly_atmospheric_temperatures[month]
# Updating the number of users
self.current_number_users += np.random.randint(-self.max_update_users, self.max_update_users)
if (self.current_number_users > self.max_number_users):
self.current_number_users = self.max_number_users
elif (self.current_number_users < self.min_number_users):
self.current_number_users = self.min_number_users
# Updating the rate of data
self.current_rate_data += np.random.randint(-self.max_update_data, self.max_update_data)
if (self.current_rate_data > self.max_rate_data):
self.current_rate_data = self.max_rate_data
elif (self.current_rate_data < self.min_rate_data):
self.current_rate_data = self.min_rate_data
# Computing the Delta of Intrinsic Temperature
past_intrinsic_temperature = self.intrinsic_temperature
self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
delta_intrinsic_temperature = self.intrinsic_temperature - past_intrinsic_temperature
# Computing the Delta of Temperature caused by the AI
if (direction == -1):
delta_temperature_ai = -energy_ai
elif (direction == 1):
delta_temperature_ai = energy_ai
# Updating the new Server's Temperature when there is the AI
self.temperature_ai += delta_intrinsic_temperature + delta_temperature_ai
# Updating the new Server's Temperature when there is no AI
self.temperature_noai += delta_intrinsic_temperature
Then, we update the self.game_over
variable if needed, that is, if the temperature of the server goes
outside the allowed range of [-20°C, 80°C]. This can happen if the
server temperature goes below the minimum temperature of -20°C, or if
the server temperature goes higher than the maximum temperature of 80°C.
Plus we do two extra things: we bring the server temperature back into
the optimal temperature range (closest bound), and since doing this spends some energy, we update the total energy spent by the AI (self.total_energy_ai
). That's exactly what is coded in the next code section:
# GETTING GAME OVER
if (self.temperature_ai < self.min_temperature):
if (self.train == 1):
self.game_over = 1
else:
self.total_energy_ai += self.optimal_temperature[0] - self.temperature_ai
self.temperature_ai = self.optimal_temperature[0]
elif (self.temperature_ai > self.max_temperature):
if (self.train == 1):
self.game_over = 1
else:
self.total_energy_ai += self.temperature_ai - self.optimal_temperature[1]
self.temperature_ai = self.optimal_temperature[1]
Now, I know it seems unrealistic for the server to snap right back to 24 degrees from 80, or to 18 from -20, but this is an action the magically efficient integrated cooling system we defined earlier is perfectly capable of. Think of it as the AI switching to the integrated system for a moment in the case of a temperature disaster. Once again, this is an area that will benefit enormously from your ongoing tinkering once we've got the AI up and running; after that, you can play around with these figures as you like in the interests of a more realistic server model.
Then, we update the two scores coming from the two separate simulations, which are:
self.total_energy_ai
: The total energy spent by the AIself.total_energy_noai
: The total energy spent by the server's integrated cooling system when there is no AI. # UPDATING THE SCORES
# Updating the Total Energy spent by the AI
self.total_energy_ai += energy_ai
# Updating the Total Energy spent by the server's cooling system when there is no AI
self.total_energy_noai += energy_noai
Then to improve the performance, we scale the next state by scaling each of its three elements (server temperature, number of users, and data transmission rate). To do so, we perform a simple standardization scaling technique, which simply consists of subtracting the minimum value of the variable, and then dividing by the maximum delta of the variable:
# SCALING THE NEXT STATE
scaled_temperature_ai = (self.temperature_ai - self.min_temperature) / (self.max_temperature - self.min_temperature)
scaled_number_users = (self.current_number_users - self.min_number_users) / (self.max_number_users - self.min_number_users)
scaled_rate_data = (self.current_rate_data - self.min_rate_data) / (self.max_rate_data - self.min_rate_data)
next_state = np.matrix([scaled_temperature_ai, scaled_number_users, scaled_rate_data])
Finally, we end this update_env
method by returning the next state, the reward received, and whether the game is over or not:
# RETURNING THE NEXT STATE, THE REWARD, AND GAME OVER
return next_state, self.reward, self.game_over
Great! We're done with this long, but important, method that updates the environment at each time step (each minute). Now there are two final and very easy methods to go: one that resets the environment, and one that gives us three pieces of information at any time: the current state, the last reward received, and whether or not the game is over.
Here's the reset
method, which resets the environment when a new training episode
starts, by resetting all the variables of the environment to their
originally initialized values:
# MAKING A METHOD THAT RESETS THE ENVIRONMENT
def reset(self, new_month):
self.atmospheric_temperature = self.monthly_atmospheric_temperatures[new_month]
self.initial_month = new_month
self.current_number_users = self.initial_number_users
self.current_rate_data = self.initial_rate_data
self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
self.temperature_ai = self.intrinsic_temperature
self.temperature_noai = (self.optimal_temperature[0] + self.optimal_temperature[1]) / 2.0
self.total_energy_ai = 0.0
self.total_energy_noai = 0.0
self.reward = 0.0
self.game_over = 0
self.train = 1
Finally, here's the observe
method, which lets us know at any given time the current state, the last reward received, and whether the game is over:
# MAKING A METHOD THAT GIVES US AT ANY TIME THE CURRENT STATE, THE LAST REWARD AND WHETHER THE GAME IS OVER
def observe(self):
scaled_temperature_ai = (self.temperature_ai - self.min_temperature) / (self.max_temperature - self.min_temperature)
scaled_number_users = (self.current_number_users - self.min_number_users) / (self.max_number_users - self.min_number_users)
scaled_rate_data = (self.current_rate_data - self.min_rate_data) / (self.max_rate_data - self.min_rate_data)
current_state = np.matrix([scaled_temperature_ai, scaled_number_users, scaled_rate_data])
return current_state, self.reward, self.game_over
Awesome! We're done with the first step of the implementation, building the environment. Now let's move on to the next step and start building the brain.
In this step, we're going to build the artificial brain of our AI, which is nothing other than a fully connected neural network. Here it is again:
Figure 3: The artificial brain of our AI
We'll build this artificial brain inside a class for the same reason as before, which is to allow us to create several artificial brains, for different servers inside a data center. Maybe some servers will need different artificial brains with different hyper-parameters than other servers. That's why, thanks to this class/object advanced Python structure, we can easily switch from one brain to another, to regulate the temperature of a new server that requires an AI with different neural network parameters. That's the beauty of Object-Oriented Programming (OOP).
We're building this artificial brain with the amazing Keras library. From this library, we use the Dense()
class to create
our two fully connected hidden layers, the first one from 64 hidden
neurons, and the second one from 32 neurons. Remember, this is a classic
neural network architecture often used by default, as common practice,
and seen in many research papers. At the end, we use the Dense()
class again to return the Q-values, which are the outputs of the artificial neural network.
Later on, when we code the training and testing
files, we'll use the argmax method to select the action that has the
maximum Q-value. Then, we assemble all the components of the brain,
including the inputs and outputs, by creating it as an object of the Model()
class (which is very useful in that we can save and load a model with
specific weights). Finally, we'll compile it with a mean squared error
loss and an Adam optimizer. I'll explain all this in more detail later.
Here are the new steps of the general AI framework:
The implementation of this is presented to you in a choice of two different files:
brain_nodropout.py
:
An implementation file that builds the artificial brain without the
dropout regularization technique (I'll explain what it is very soon).brain_dropout.py
: An implementation file that builds the artificial brain with the dropout regularization technique.First let me give you the implementation without dropout, and then I'll provide one with dropout and explain it.
Here is the full implementation of the artificial brain, without any dropout regularization technique:
# AI for Business - Minimize cost with Deep Q-Learning #1
# Building the Brain without Dropout #2
#3
# Importing the libraries #4
from keras.layers import Input, Dense #5
from keras.models import Model #6
from keras.optimizers import Adam #7
#8
# BUILDING THE BRAIN #9
#10
class Brain(object): #11
#12
# BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD #13
#14
def __init__(self, learning_rate = 0.001, number_actions = 5): #15
self.learning_rate = learning_rate #16
#17
# BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE #18
states = Input(shape = (3,)) #19
#20
# BUILDING THE FULLY CONNECTED HIDDEN LAYERS #21
x = Dense(units = 64, activation = 'sigmoid')(states) #22
y = Dense(units = 32, activation = 'sigmoid')(x) #23
#24
# BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER #25
q_values = Dense(units = number_actions, activation = 'softmax')(y) #26
#27
# ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT #28
self.model = Model(inputs = states, outputs = q_values) #29
#30
# COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER #31
self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate)) #32
Now, let's go through the code in detail.
Line 5: We import the Input
and Dense
classes from the layers
module in the keras
library. The Input
class allows us to build the input layer, and the Dense
class allows us to build the fully-connected layers.
Line 6: We import the Model
class from the models
module in the keras
library. It allows us to build the whole neural network model by assembling its different layers.
Line 7: We import the Adam
class from the optimizers
module in the keras
library. It allows us to use the Adam optimizer, used to update the
weights of the neural network through stochastic gradient descent, when
backpropagating the loss error in each iteration of the training.
Line 11: We introduce the Brain
class, which will contain not only the whole architecture of the
artificial neural network, but also the connection of the model to the
loss (Mean-Squared Error) and the Adam optimizer.
Line 15: We introduce the __init__
method, which will be the only method of this class. We define
the whole architecture of the neural network inside it, just
by creating successive variables which together assemble the neural
network. This method takes as inputs two arguments:
learning_rate
),
which is a measure of how fast you want the neural network to learn
(the higher the learning rate, the faster the neural network learns; but
at the cost of quality). The default value is 0.001
.number_actions
),
which is of course the number of actions that our AI can perform. Now
you might be thinking: why do we need to put that as an argument? Well
that's just in case you want to build another AI that can perform more
or fewer actions. In which case you would simply need to change the
value of the argument and that's it. Pretty practical, isn't it?Line 16: We create an object variable for the learning rate, self.learning_rate
, initialized as the value of the learning_rate
argument provided in the __init__
method (therefore the argument of the Brain
class when we create the object in the future).
Line 19: We create the input states layer, called states
, as an object of the Input
class. Into this Input
class we enter one argument, shape = (3,)
,
which simply tells that the input layer is a 1D vector composed of
three elements (the server temperature, the number of users, and the
data transmission rate).
Line 22: We create the first fully-connected hidden layer, called x
, as an object of the Dense
class, which takes as input two arguments:
units
: The number of hidden neurons we want to have in this first hidden layer. Here, we choose to have 64 hidden neurons.activation
:
The activation function used to pass on the signal when
forward-propagating the inputs into this first hidden layer. Here we
choose, by default, a sigmoid activation function, which is as follows:Figure 4: The sigmoid activation function
The ReLU activation function would also have worked
well here; I encourage you to experiment! Note also how the connection
from the input layer to this first hidden layer is made by calling the states
variable right after the Dense
class.
Line 23: We create the second fully-connected hidden layer, called y
, as an object of the Dense
class, which takes as input the same two arguments:
units
: The number of hidden neurons we want to have in this second hidden layer. This time we choose to have 32 hidden neurons.activation
:
The activation function used to pass on the signal when
forward-propagating the inputs into this first hidden layer. Here,
again, we choose a sigmoid activation function.Note once again how the connection from the first hidden layer to this second hidden layer is made by calling the x
variable right after the Dense
class.
Line 26: We create the output layer, called q_values
, fully connected to the second hidden layer, as an object of the Dense
class. This time, we input number_actions
units since the output layer contains the actions to play, and a softmax
activation function, as seen in Chapter 5, Your First AI Model – Beware the Bandits!, on the deep Q-learning theory.
Line 29: Using the Model
class, we assemble the successive layers of the neural network, by just inputting the states
as the inputs, and the q_values
as the outputs.
Line 32: Using the compile
method taken from the Model
class, we connect our model to the Mean-Squared Error loss and the Adam optimizer. The latter takes the learning_rate
argument as input.
It'll be valuable for you to add one more powerful technique to your toolkit: dropout.
Dropout is a regularization technique that prevents overfitting, which is the situation where the AI model performs well on the training set, but poorly on the test set. Dropout simply consists of deactivating a randomly selected portion of neurons during each step of forward- and back-propagation. That means not all the neurons learn the same way, which prevents the neural network from overfitting the training data.
Adding dropout is very easy with keras
. You simply need to call the Dropout
class right after the Dense
class, and input the proportion of neurons you want to deactivate, like so:
# AI for Business - Minimize cost with Deep Q-Learning
# Building the Brain with Dropout
# Importing the libraries
from keras.layers import Input, Dense, Dropout
from keras.models import Model
from keras.optimizers import Adam
# BUILDING THE BRAIN
class Brain(object):
# BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD
def __init__(self, learning_rate = 0.001, number_actions = 5):
self.learning_rate = learning_rate
# BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE
states = Input(shape = (3,))
# BUILDING THE FIRST FULLY CONNECTED HIDDEN LAYER WITH DROPOUT ACTIVATED
x = Dense(units = 64, activation = 'sigmoid')(states)
x = Dropout(rate = 0.1)(x)
# BUILDING THE SECOND FULLY CONNECTED HIDDEN LAYER WITH DROPOUT ACTIVATED
y = Dense(units = 32, activation = 'sigmoid')(x)
y = Dropout(rate = 0.1)(y)
# BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER
q_values = Dense(units = number_actions, activation = 'softmax')(y)
# ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT
self.model = Model(inputs = states, outputs = q_values)
# COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER
self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate))
Here, we apply dropout to the first and second fully-connected layers, by deactivating 10% of their neurons each. Now, let's move on to the next step of our general AI framework: Step 3 – Implementing the deep reinforcement learning algorithm.
In this new implementation (given in the dqn.py
file), we simply have to follow the deep Q-learning algorithm provided before. Hence, this implementation follows the following sub-steps, which are part of the general AI framework:
First, have a look at the whole code, and then I'll explain it line by line:
# AI for Business - Minimize cost with Deep Q-Learning #1
# Implementing Deep Q-Learning with Experience Replay #2
#3
# Importing the libraries #4
import numpy as np #5
#6
# IMPLEMENTING DEEP Q-LEARNING WITH EXPERIENCE REPLAY #7
#8
class DQN(object): #9
#10
# INTRODUCING AND INITIALIZING ALL THE PARAMETERS AND VARIABLES OF THE DQN #11
def __init__(self, max_memory = 100, discount = 0.9): #12
self.memory = list() #13
self.max_memory = max_memory #14
self.discount = discount #15
#16
# MAKING A METHOD THAT BUILDS THE MEMORY IN EXPERIENCE REPLAY #17
def remember(self, transition, game_over): #18
self.memory.append([transition, game_over]) #19
if len(self.memory) > self.max_memory: #20
del self.memory[0] #21
#22
# MAKING A METHOD THAT BUILDS TWO BATCHES OF INPUTS AND TARGETS BY EXTRACTING TRANSITIONS FROM THE MEMORY #23
def get_batch(self, model, batch_size = 10): #24
len_memory = len(self.memory) #25
num_inputs = self.memory[0][0][0].shape[1] #26
num_outputs = model.output_shape[-1] #27
inputs = np.zeros((min(len_memory, batch_size), num_inputs)) #28
targets = np.zeros((min(len_memory, batch_size), num_outputs)) #29
for i, idx in enumerate(np.random.randint(0, len_memory, size = min(len_memory, batch_size))): #30
current_state, action, reward, next_state = self.memory[idx][0] #31
game_over = self.memory[idx][1] #32
inputs[i] = current_state #33
targets[i] = model.predict(current_state)[0] #34
Q_sa = np.max(model.predict(next_state)[0]) #35
if game_over: #36
targets[i, action] = reward #37
else: #38
targets[i, action] = reward + self.discount * Q_sa #39
return inputs, targets #40
Line 5: We import the numpy
library, because we'll be working with numpy
arrays.
Line 9: We introduce the DQN
class (DQN stands for Deep Q-Network), which contains the main parts of the deep Q-Learning algorithm, including experience replay.
Line 12: We introduce the __init__
method, which creates the three following object variables of the DQN
model: the experience replay memory, the capacity (maximum size of the
memory), and the discount factor in the formula of the target. It takes
as arguments max_memory
(the capacity) and discount
(the discount factor), in case we want to build other experience replay memories
with different capacities, or if we want to change the value of the
discount factor in the computation of the target. The default values of
these arguments are respectively 100
and 0.9
,
which were chosen arbitrarily and turned out to work quite well; these
are good arguments to experiment with, to see what difference it makes
when you set them differently.
Line 13: We create the experience replay memory object variable, self.memory
, and we initialize it as an empty list.
Line 14: We create the object variable for the memory capacity, self.max_memory
, and we initialize it as the value of the max_memory
argument.
Line 15: We create the object variable for the discount factor, self.discount
, and we initialize it as the value of the discount
argument.
Line 18: We introduce the remember
method, which takes as input a transition to be added to the memory, and game_over
, which states whether or not this transition leads the server's temperature to go outside of the allowed range of temperatures.
Line 19: Using the append
function called from the memory
list, we add the transition with the game_over
boolean into the memory (in the last position).
Line 20: If, after adding this transition, the size of the memory exceeds the memory capacity (self.max_memory
).
Line 21: We delete the first element of the memory.
Line 24: We introduce the get_batch
method, which takes as inputs the model we built in the previous Python file (model
) and a batch size (batch_size
), and builds two batches of inputs and targets by extracting 10
transitions from the memory (if the batch size is 10).
Line 25: We get the current number of elements in the memory and put it into a new variable, len_memory
.
Line 26: We get the
number of elements in the input state vector (which is 3), but instead
of directly entering 3, we access this number from the shape
attribute of the input state vector element of the memory, which we get by taking the [0][0][0]
indexes. Each element of the memory is structured as follows:
[[current_state
, action
, reward
, next_state
], game_over
]
Thus in [0][0][0]
, the first [0]
corresponds to the first element of the memory (meaning the first transition), the second [0]
corresponds to the tuple [current_state
, action
, reward
, next_state
], and so the third [0]
corresponds to the current_state
element of that tuple. Hence, self.memory[0][0][0]
corresponds to the first current state, and by adding .shape[1]
we get the number of elements in that input state vector. You might be
wondering why we didn't enter 3 directly; that's because we want to
generalize this code to any input state vector dimension you might want
to have in your environment. For example, you might want to consider an
input state with more information about your server, such as the
humidity. Thanks to this line of code, you won't have to change anything
regarding your new number of state elements.
Line 27: We get the
number of elements of the model output, meaning the number of actions.
Just like on the previous line, instead of entering directly 5, we
generalize by accessing this from the shape
attribute called from our model
object of the Model
class. -1
means that we get the last index of that shape
attribute, where the number of actions is contained.
Line 28: We introduce and initialize the batch of inputs as a numpy
array, of batch_size
= 10 rows and 3 columns
corresponding to input state elements, with only zeros. If the memory
doesn't have 10 transitions yet, the number of rows will just be the
length of the memory.
If the memory already has at least 10 transitions, what we get with this line of code is the following:
Figure 5: Batch of inputs (1/2)
Line 29: We introduce and initialize the batch of targets as a numpy
array of batch_size
= 10 rows and 5 columns corresponding to the five possible actions,
with only zeros. Just like before, if the memory doesn't have 10
transitions yet, the number of rows will just be the length of the
memory. If the memory already has at least 10 transitions, what we get
with this line of code is the following:
Figure 6: Batch of targets (1/3)
Line 30: We do a double iteration inside the same for
loop. The first iterative variable i
goes from 0 to the batch size (or up to len_memory
if len_memory
< batch_size
):
i
= 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
That way, i
will iterate each element of the batch. The second iterative variable idx
takes 10 random indexes of the memory, in order to extract 10 random transitions from the memory. Inside the for
loop, we populate the two batches of inputs and targets with their right values by iterating through each of their elements.
Line 31: We get the transition of the sampled index idx
from the memory, composed of the current state, the action, the reward, and the next state. The reason we add [0]
is because an element of the memory is structured as follows:
[[current_state
, action
, reward
, next_state
], game_over
]
We'll get the game_over
value separately, in the next line of code.
Line 32: We get the game_over
value corresponding to that same index idx
of the memory. As you can see, this time we add [1]
on the end to get the second element of a memory element:
[[current_state
, action
, reward
, next_state
], game_over
]
Line 33: We populate the batch of inputs with all the current states, in order to get this at the end of the for
loop:
Figure 7: Batch of inputs (2/2)
Line 34: Now we start populating the batch of targets with the right values. First, we populate it with all the Q-values
that the model predicts for the different state-action pairs: (current
state, action 0), (current state, action 1), (current state, action 2),
(current state, action 3), and (current state, action 4). Thus we first
get this (at the end of the for
loop):
Figure 8: Batch of targets (2/3)
Remember that for the action that is played, the formula of the target must be this one:
What we do in the following lines of code is to put this formula into the column of each action that was played within the 10 selected transitions. In other words, we get this:
Figure 9: Batch of targets (3/3)
In that example, Action 1 was performed in the first transition (Target 1), Action 3 was performed in the second transition (Target 2), Action 0 was performed in the third transition (Target 3), and so on. Let's populate this in the following lines of code.
Line 35: We first start getting the part of the formula of the target:
Line 36: We check if game_over
= 1, meaning that the server has gone outside the allowed range of
server temperatures. Because if it has, there's actually no next state
(because we basically reset the
environment by putting the server's temperature back into the optimal
range so we start from a new state); and therefore we shouldn't consider
.
Line 37: In that case, we only keep the part of the target.
Line 38: However, if the game is not over (game_over
= 0)...
Line 39: We keep the whole formula of the target, but of course only for the action that was performed, meaning here:
Hence, we get the following batch of targets, as you saw earlier:
Figure 10: Batch of targets (3/3)
Line 40: At last, we return
the final batches of inputs
and targets
.
That was epic—you've successfully created an artificial brain. Now that you've done it, we're ready to start the training.
Now that our AI has a fully functional brain, it's time to train it. That's exactly what we do in this fourth Python implementation. You actually have a choice of two files to use for this:
training_noearlystopping.py
, which trains your AI on a full 1000 epochs of 5-months period.training_earlystopping.py
, which
trains your AI on 1000 epochs as well, but which can stop the training
early if the performance no longer improves over the iterations. This
technique is called early stopping.Both these implementations are long, but very
simple. We start by setting all the parameters, then we build the
environment by creating an object of the Environment()
class, then we build the brain of the AI by creating an object of the Brain()
class, then we build the deep Q-learning model by creating an object of the DQN()
class, and finally we launch the training connecting all these objects together over 1000 epochs of 5-month periods.
You'll notice in the training loop that we also do some exploration when performing the actions, performing some random actions from time to time. In our case, this will be done 30% of the time, since we use an exploration parameter , and then we force the AI to perform a random action when we draw a random value between 0 and 1 that is below . The reason we do some exploration is because it improves the deep reinforcement learning process, as we discussed in Chapter 9, Going Pro with Artificial Brains – Deep Q-Learning, and the reason we don't use Softmax in this project is just to give you a look at how to implement a different exploration method.
Later, you'll be introduced to another little improvement in the training_noearlystopping.py
file, where we use an early stopping technique which stops the training early if there's no improvement in the performance.
Let's highlight the new steps which still belong to our general AI framework/Blueprint:
Environment
class.Brain
class.DQN
model by creating an object of the DQN
class.for
loop over 100 epochs of 5-month periods.Ready to implement this? Maybe get a good coffee or tea first because this is going
to be a bit long (88 lines of code, but easy ones!). We'll start
without early stopping and then at the end I'll explain how to add the
early stopping technique. The file to follow along with is training_noearlystopping.py
. Since this is pretty long, let's do it section by section this time, starting with the first one:
# AI for Business - Minimize cost with Deep Q-Learning #1
# Training the AI without Early Stopping #2
#3
# Importing the libraries and the other python files #4
import os #5
import numpy as np #6
import random as rn #7
import environment #8
import brain_nodropout #9
import dqn #10
Line 5: We import the os
library, which will be used to set a seed for reproducibility so that
if you run the training several times, you'll get the same result each
time. You can, of course, choose to remove this when you tinker with the
code yourself!
Line 6: We import the numpy
library, since we'll work with numpy
arrays.
Line 7: We import the random
library, which we'll use to do some exploration.
Line 8: We import the environment.py
file, implemented in Step 1, which contains the whole defined environment.
Line 9: We import the brain_nodropout.py
file, our artificial brain without dropout that we implemented in Step 2. This contains the whole neural network of our AI.
Line 10: We import the dqn.py
file implemented in Step 3, which contains the main parts of the deep Q-learning algorithm, including experience replay.
Moving on to the next section:
# Setting seeds for reproducibility #12
os.environ['PYTHONHASHSEED'] = '0' #13
np.random.seed(42) #14
rn.seed(12345) #15
#16
# SETTING THE PARAMETERS #17
epsilon = .3 #18
number_actions = 5 #19
direction_boundary = (number_actions - 1) / 2 #20
number_epochs = 100 #21
max_memory = 3000 #22
batch_size = 512 #23
temperature_step = 1.5 #24
#25
# BUILDING THE ENVIRONMENT BY SIMPLY CREATING AN OBJECT OF THE ENVIRONMENT CLASS #26
env = environment.Environment(optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 20, initial_rate_data = 30) #27
#28
# BUILDING THE BRAIN BY SIMPLY CREATING AN OBJECT OF THE BRAIN CLASS #29
brain = brain_nodropout.Brain(learning_rate = 0.00001, number_actions = number_actions) #30
#31
# BUILDING THE DQN MODEL BY SIMPLY CREATING AN OBJECT OF THE DQN CLASS #32
dqn = dqn.DQN(max_memory = max_memory, discount = 0.9) #33
#34
# CHOOSING THE MODE #35
train = True #36
Lines 13, 14, and 15: We set seeds for reproducibility, to get the same results after several rounds of training. This is really only important so you can reproduce your findings—if you don't need to do that, some people prefer them and others don't. If you don't want the seeds you can just remove them.
Line 18: We introduce the exploration factor , and we set it to 0.3
,
meaning that there will be 30% of exploration (performing random
actions) vs. 70% of exploitation (performing the actions of the AI).
Line 19: We set the number of actions to 5
.
Line 20: We set the direction boundary, meaning the action index below which we cool down the server, and above which we heat up the server. Since actions 0 and 1 cool down the server, and actions 3 and 4 heat up the server, that direction boundary is (5-1)/2 = 2, which corresponds to the action that transfers no heat to the server (action 2).
Line 21: We set the number of training epochs to 100
.
Line 22: We set the memory capacity, meaning its maximum size, to 3000
.
Line 23: We set the batch size to 512
.
Line 24: We introduce
the temperature step, meaning the absolute temperature change that the
AI cause onto the server by playing actions 0, 1, 3, or 4. And that's
of course 1.5
°C.
Line 27: We create the environment
object, as an instance of the Environment
class which we call from the environment
file. Inside this Environment
class, we enter all the arguments of the init
method:
optimal_temperature = (18.0, 24.0),
initial_month = 0,
initial_number_users = 20,
initial_rate_data = 30
Line 30: We create the brain
object as an instance of the Brain
class, which we call from the brain_nodropout
file. Inside this Brain
class, we enter all the arguments of the init
method:
learning_rate = 0.00001,
number_actions = number_actions
Line 33: We create the dqn
object as an instance of the DQN
class, which we call from the dqn
file. Inside this DQN
class we enter all the arguments of the init
method:
max_memory = max_memory,
discount = 0.9
Line 36: We set the training mode to True
, because the next code section will contain the big for
loop that performs all the training.
All good so far? Don't forget to take a break or a step back by reading the previous paragraphs again anytime you feel a bit overwhelmed or lost.
Now let's begin the big training loop; that's the last code section of this file:
# TRAINING THE AI #38
env.train = train #39
model = brain.model #40
if (env.train): #41
# STARTING THE LOOP OVER ALL THE EPOCHS (1 Epoch = 5 Months) #42
for epoch in range(1, number_epochs): #43
# INITIALIAZING ALL THE VARIABLES OF BOTH THE ENVIRONMENT AND THE TRAINING LOOP #44
total_reward = 0 #45
loss = 0. #46
new_month = np.random.randint(0, 12) #47
env.reset(new_month = new_month) #48
game_over = False #49
current_state, _, _ = env.observe() #50
timestep = 0 #51
# STARTING THE LOOP OVER ALL THE TIMESTEPS (1 Timestep = 1 Minute) IN ONE EPOCH #52
while ((not game_over) and timestep <= 5 * 30 * 24 * 60): #53
# PLAYING THE NEXT ACTION BY EXPLORATION #54
if np.random.rand() <= epsilon: #55
action = np.random.randint(0, number_actions) #56
if (action - direction_boundary < 0): #57
direction = -1 #58
else: #59
direction = 1 #60
energy_ai = abs(action - direction_boundary) * temperature_step #61
# PLAYING THE NEXT ACTION BY INFERENCE #62
else: #63
q_values = model.predict(current_state) #64
action = np.argmax(q_values[0]) #65
if (action - direction_boundary < 0): #66
direction = -1 #67
else: #68
direction = 1 #69
energy_ai = abs(action - direction_boundary) * temperature_step #70
# UPDATING THE ENVIRONMENT AND REACHING THE NEXT STATE #71
next_state, reward, game_over = env.update_env(direction, energy_ai, ( new_month + int(timestep/(30*24*60)) ) % 12) #72
total_reward += reward #73
# STORING THIS NEW TRANSITION INTO THE MEMORY #74
dqn.remember([current_state, action, reward, next_state], game_over) #75
# GATHERING IN TWO SEPARATE BATCHES THE INPUTS AND THE TARGETS #76
inputs, targets = dqn.get_batch(model, batch_size = batch_size) #77
# COMPUTING THE LOSS OVER THE TWO WHOLE BATCHES OF INPUTS AND TARGETS #78
loss += model.train_on_batch(inputs, targets) #79
timestep += 1 #80
current_state = next_state #81
# PRINTING THE TRAINING RESULTS FOR EACH EPOCH #82
print("
") #83
print("Epoch: {:03d}/{:03d}".format(epoch, number_epochs)) #84
print("Total Energy spent with an AI: {:.0f}".format(env.total_energy_ai)) #85
print("Total Energy spent with no AI: {:.0f}".format(env.total_energy_noai)) #86
# SAVING THE MODEL #87
model.save("model.h5") #88
Line 39: We set the env.train
object variable (this is a variable of our environment
object) to the value of the train
variable entered just before, which is of course equal to True
, meaning we are indeed in training mode.
Line 40: We get the model from our brain
object. This model contains the whole architecture of the neural
network, plus its optimizer. It also has extra practical tools, like for
example the save
and load
methods, which will allow us respectively to save the weights after the training or load them anytime in the future.
Line 41: If we are in training mode…
Line 43: We start the main training for
loop, iterating the training epochs from 1 to 100.
Line 45: We set the total reward (total reward accumulated over the training iterations) to 0
.
Line 46: We set the loss to 0
(0
because the loss will be a float
).
Line 47: We set the starting month of the training, called new_month
, to a random integer between 0 and 11. For example, if the random integer is 2, we start the training in March.
Line 48: By calling the reset
method from our env
object of the Environment
class built in Step 1, we reset the environment starting from that new_month
.
Line 49: We set the game_over
variable to False
, because we're starting in the allowed range of server temperatures.
Line 50: By calling the observe
method from our env
object of the Environment
class built in Step 1, we get the current state only, which is our starting state.
Line 51: We set the first timestep
to 0
. This is the first minute of the training.
Line 53: We start the while
loop that will iterate all the timesteps (minutes) for the whole period
of the epoch, which is 5 months. Therefore, we iterate through 5 * 30 * 24 * 60
minutes; that is, 216,000 timesteps.
If, however, during those timesteps we go outside the allowed range of server temperatures (that is, if game_over
= 1), then we stop the epoch and we start a new one.
Lines 55 to 61 make sure the AI performs a random action 30% of the time. This is exploration. The trick to it in this case is to sample a random number between 0 and 1, and if this random number is between 0 and 0.3, the AI performs a random action. That means the AI will perform a random action 30% of the time, because this sampled number has a 30% chance to be between 0 and 0.3.
Line 55: If a sampled number between 0 and 1 is below ...
Line 56: ... we play a random action index from 0 to 4.
Line 57: Now that
we've just performed an action, we compute the direction and the energy
spent; remember that they're are the required arguments of the update_env
method of the Environment
class, which we'll call later to update the environment. The AI
distinguishes between two cases by checking if the action is below or
above the direction boundary of 2. If the action is below the direction
boundary of 2, meaning the AI cools down the server...
Line 58: ...then the heating direction is equal to -1
(cooling down).
Line 59 and 60: Else the heating direction is equal to +1
(heating up).
Line 61: We compute the energy spent by the AI onto the server, which according to Assumption 2 is:
|action - direction_boundary| * temperature_step = |action - 2| * 1.5 Joules
For example, if the action is 4, then the AI heats up the server by 3°C, and so according to Assumption 2 the energy spent is 3 Joules. And we check indeed that |4-2|*1.5 = 3.
Line 63: Now we play the actions by inference, meaning directly from our AI's predictions. The inference starts from the else
statement, which corresponds to the if
statement of line 55. This else
corresponds to the situation where the sampled number is between 0.3 and 1, which happens 70% of the time.
Line 64: By calling the predict
method from our model
object (predict
is a pre-built method of the Model
class), we get the five predicted Q-values from our AI model.
Line 65: Using the argmax
function from numpy
, we select the action that has the maximum Q-value among the five predicted ones at Line 64.
Lines 66 to 70: We do exactly the same as in Lines 57 to 61, but this time with the action performed by inference.
Line 72: Now we have everything ready to update the environment. We call the big update_env
method made in the Environment
class of Step 1, by inputting the heating direction, the energy spent
by the AI, and the month we're in at that specific timestep of the while
loop. We get in return the next state, the reward received, and whether
the game is over (that is, whether or not we went outside the optimal
range of server temperatures).
Line 73: We add this last reward received to the total reward.
Line 75: By calling the remember
method from our dqn
object of the DQN
class built in Step 3, we store the new transition [[current_state
, action
, reward
, next_state
], game_over
] into the memory.
Line 77: By calling the get_batch
method from our dqn
object of the DQN
class built in Step 3, we create two separate batches of inputs
and targets
, each one having 512 elements (since batch_size
= 512).
Line 79: By calling the train_on_batch
method from our model
object (train_on_batch
is a pre-built method of the Model
class), we compute the loss error between the predictions and the
targets over the whole batch. As a reminder, this loss error is the
mean-squared error loss. Then in this same line, we add this loss error
to the total loss of the epoch, in case we want to check how this total
loss evolves over the epochs during the training.
Line 80: We increment the timestep
.
Line 81: We update the current state, which becomes the new state reached.
Line 83: We print a new line to separate out the training results so we can look them over easily.
Line 84: We print the epoch reached (the one we are in at this specific moment of the main training for
loop).
Line 85: We print the total energy spent by the AI over that specific epoch (the one we are in at this specific moment of the main training for
loop).
Line 86: We print the total energy spent by the server's integrated cooling system over that same specific epoch.
Line 88: We save the model's weights at the end of the training, in order to load them in the future, anytime we want to use our pre-trained model to regulate a server's temperature.
That's it for training our AI without early stopping; now let's have a look at what you'd need to change to implement it.
Now open the training_earlystopping.py
file. Compare it to the previous file; all the lines of code from 1 to 40 are the same. Then, in the last code section, TRAINING THE AI
,
we have the same process, to which is added the early stopping
technique. As a reminder, it consists of stopping the training if
there's no more improvement of the performance, which could be assessed
two different ways:
Let's see how we do this.
First, we introduce four new variables just before the main training for
loop:
# TRAINING THE AI #38
env.train = train #39
model = brain.model #40
early_stopping = True #41
patience = 10 #42
best_total_reward = -np.inf #43
patience_count = 0 #44
if (env.train): #45
# STARTING THE LOOP OVER ALL THE EPOCHS (1 Epoch = 5 Months) #46
for epoch in range(1, number_epochs): #47
Line 41: We introduce a new variable, early_stopping
, which is set equal to True
if we decide to activate the early stopping technique, meaning if we
decide to stop the training when the performance no longer improves.
Line 42: We introduce a new variable, patience
,
which is the number of epochs we wait without performance improvement
before stopping the training. Here we choose a patience of 10
epochs, which means that if the best total reward of an epoch doesn't
increase during the next 10 epochs, we will stop the training.
Line 43: We introduce a new variable, best_total_reward
,
which is the best total reward recorded over a full epoch. If we don't
beat that best total reward before 10 epochs go by, the training stops.
It's initialized to -np.inf
, which represents -infinity
. That's just
a trick to say that nothing can be lower than that best total reward at
the beginning. Then as soon as we get the first total reward over the
first epoch, best_total_reward
becomes that first total reward.
Line 44: We introduce a new variable, patience_count
, which is a counter starting from 0
, and is incremented by 1 each time the total reward of an epoch doesn't beat the best total reward. If patience_count
reaches 10 (the patience), we stop the training. And if one epoch beats the best total reward, patience_count
is reset to 0.
Then, the main training for
loop is the same as before, but just before saving the model we add the following:
# EARLY STOPPING #91
if (early_stopping): #32
if (total_reward <= best_total_reward): #93
patience_count += 1 #94
elif (total_reward > best_total_reward): #95
best_total_reward = total_reward #96
patience_count = 0 #97
if (patience_count >= patience): #98
print("Early Stopping") #99
break #100
# SAVING THE MODEL #101
model.save("model.h5") #102
Line 92: If the early_stopping
variable is True
, meaning if the early stopping technique is activated…
Line 93: And if the total reward of the current epoch (we are still in the main training for
loop that iterates the epochs) is lower than the best total reward of an epoch obtained so far…
Line 94: ...we increment the patience_count
variable by 1
.
Line 95: However, if the total reward of the current epoch is higher than the best total reward of an epoch obtained so far…
Line 96: ...we update the best total reward, which becomes that new total reward of the current epoch.
Line 97: ...and we reset the patience_count
variable to 0
.
Line 98: Then in a new if
condition, we check that if the patience_count
variable goes higher than the patience of 10…
Line 99: ...we print Early Stopping
,
Line 100: ...and we stop the main training for
loop with a break
statement.
That's the whole thing. Easy and intuitive, right? Now you know how to implement early stopping.
After executing the code (I'll explain how to run this in a bit), we'll already see some good performances from our AI during the training, spending less energy than the server's integrated cooling system most of the time. But that's only training; now we need to see if we get good performance from the AI on a new 1-year simulation. That's where our next and final Python file comes into play.
Now we need to test the performance of our AI in a brand-new situation. To do so, we run a 1-year simulation in inference mode, meaning that there's no training happening at any time. Our AI only returns predictions over a full year of simulation. Then, thanks to our environment object, in the end we'll be able to see the total energy spent by the AI over the full year, as well as the total energy that would have been spent in the exact same year by the server's integrated cooling system. Finally, we compare these two total energies spent, by computing their relative difference (in %) which shows us precisely the total energy saved by the AI. Buckle up for the final results—we'll reveal them very soon!
In terms of the AI blueprint, for the testing
implementation we have almost the same process as the training
implementation, except that this time we don't need to create a brain
object nor a DQN
model object; and, of course, we won't run the deep Q-learning process
over some training epochs. However, we do have to create a new environment
object, and instead of creating a brain
,
we'll load our artificial brain with its pre-trained weights from the
previous training that we executed in Step 4 – Training the AI. Let's
take a look at the final sub-steps of this final part of
the AI framework/Blueprint:
Environment
class.The implementation is a piece of cake to understand. It's actually the same as the training file, except that:
brain
object from the Brain
class, we load the pre-trained weights resulting from the training.for
loop. You've got this!Have a look at the full testing implementation in the following code:
# AI for Business - Minimize cost with Deep Q-Learning
# Testing the AI
# Installing Keras
# conda install -c conda-forge keras
# Importing the libraries and the other python files
import os
import numpy as np
import random as rn
from keras.models import load_model
import environment
# Setting seeds for reproducibility
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(42)
rn.seed(12345)
# SETTING THE PARAMETERS
number_actions = 5
direction_boundary = (number_actions - 1) / 2
temperature_step = 1.5
# BUILDING THE ENVIRONMENT BY SIMPLY CREATING AN OBJECT OF THE ENVIRONMENT CLASS
env = environment.Environment(optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 20, initial_rate_data = 30)
# LOADING A PRE-TRAINED BRAIN
model = load_model("model.h5")
# CHOOSING THE MODE
train = False
# RUNNING A 1 YEAR SIMULATION IN INFERENCE MODE
env.train = train
current_state, _, _ = env.observe()
for timestep in range(0, 12 * 30 * 24 * 60):
q_values = model.predict(current_state)
action = np.argmax(q_values[0])
if (action - direction_boundary < 0):
direction = -1
else:
direction = 1
energy_ai = abs(action - direction_boundary) * temperature_step
next_state, reward, game_over = env.update_env(direction, energy_ai, int(timestep / (30 * 24 * 60)))
current_state = next_state
# PRINTING THE TRAINING RESULTS FOR EACH EPOCH
print("
")
print("Total Energy spent with an AI: {:.0f}".format(env.total_energy_ai))
print("Total Energy spent with no AI: {:.0f}".format(env.total_energy_noai))
print("ENERGY SAVED: {:.0f} %".format((env.total_energy_noai - env.total_energy_ai) / env.total_energy_noai * 100))
Everything's more or less the same as before; we just removed the parts related to the training.
Given the different files we have, make sure to understand that there are four possible ways to run the program:
Then, for each of these four combinations, the way to run this is the same: we first execute the training file, and then the testing file. In this demo section, we'll execute the 4th option, with both dropout and early stopping.
Now how do we run this? We have two options: with or without Google Colab.
I'll explain how to do it on Google Colab, and I'll
even give you a Google Colab file where you only have to hit the play
button. For those of you who want to execute this without Colab, on your
favorite Python IDE, or through the terminal, let me explain how it's
done. It's easy; you just need to download the main repository from
GitHub, then in your Python IDE set the right working directory folder,
which is the Chapter 11
folder, and then run the following two files in this order:
training_earlystopping.py
, inside which you should make sure to import brain_dropout
at line 9. This will execute the training, and you'll have to wait until that finishes (which will take about 10 minutes).testing.py
, which will test the model on one full year of data.Now, back to Google Colab. First, open a new Colaboratory file, and call it Deep Q-Learning for Business. Then add all your files from the Chapter 11
folder of GitHub into this Colaboratory file, right here:
Figure 11: Google Colab – Step 1
Unfortunately, it's not easy to add the different files manually. You can only do this by using the os
library, which
we won't bother with. Instead, copy-paste the five Python
implementations in five different cells of our Colaboratory file, in the
following order:
environment.py
implementation.brain_dropout.py
implementation.dqn.py
implementation.training_earlystopping.py
implementation.testing.py
implementation.Here's what it looks like, after adding some snazzy titles:
Figure 12: Google Colab – Step 2
Figure 13: Google Colab – Step 3
Figure 14: Google Colab – Step 4
Figure 15: Google Colab – Step 5
Figure 16: Google Colab – Step 6
Now before we execute each of these cells in the order one through five, we need to remove the import
commands of the Python files. The reason for this is that now that the implementations are
in cells, they're like a single Python implementation, and we don't
have to import the interdependent files in every single cell. First,
remove the following three different rows in the training file:
Figure 17: Google Colab – Step 7
After doing that, we end up with this:
Figure 18: Google Colab – Step 8
Then, since we removed these imports, we also have to remove the three filenames for the environment
, the brain
, and the dqn
, when creating the objects:
First the environment:
Figure 19: Google Colab – Step 9
Then the brain:
Figure 20: Google Colab – Step 10
And finally the dqn:
Figure 21: Google Colab – Step 11
Now the training file's good to go. In the testing file, we just have to remove two things, the environment
import at line 12:
Figure 22: Google Colab – Step 12
and the environment.
at row 25:
Figure 23: Google Colab – Step 13
That's it; now you're all set! You're ready to literally hit the play button on each of the cells from top to the bottom.
First, execute the first cell. After executing it, no output is displayed. That's fine!
Then execute the second cell:
Using TensorFlow backend.
After executing it, you can see the output Using TensorFlow backend.
Then execute the third cell, after which no output is displayed.
Now it gets a bit exciting! You're about to execute the training, and follow the training performance in real time. Do this by executing the fourth cell. After executing it, the training launches, and you should see the following results:
Figure 24: The output
Don't worry about those warnings, everything's running the way it should. Since early stopping is activated, you'll reach the end of the training way before the 100 epochs, at the 15th epoch:
Figure 25: The output at the 15th epoch
Note that the pre-trained weights are saved in Files, in the model.h5
file:
Figure 26: The model.h5 file
The training results look promising. Most of the time the AI spends less energy than the alternative server's integrated cooling system. Check that this is still the case with a full test, on one new year of simulation.
Execute the final cell and when it finishes running, (which takes approximately 3 minutes), you obtain in the printed results that the total energy consumption saved by the AI is…
Total Energy spent with an AI: 261985
Total Energy spent with no AI: 1978293
ENERGY SAVED: 87%
Total Energy saved by the AI = 87%
That's a lot of energy saved! Google DeepMind achieved similarly impressive results in 2016. If you look up the results by searching "DeepMind reduces Google cooling bill," you'll see that the result they achieved was 40%. Not bad! Of course, let's be critical: their server/ data center environment is much more complex than our server environment and has many more parameters, so even though they have one of the most talented AI teams in the world, they could only reduce the cooling bill by less than 50%.
Our environment's very simple, and if you dig into it (which I recommend you do) you'll likely find that the variations of users and data, and therefore the variation of temperature, follow a uniform distribution. Accordingly, the server's temperature usually stays around the optimal range of temperatures. The AI understands that well, and thus chooses most of the time to take no action and cause no change of temperature, thus consuming very little energy.
I highly recommend that you play around with your server cooling model; make it as complex as you like, and try out different rewards to see if you can cause different behaviors.
Even though our environment is simple, you can be proud of your achievement. What matters is that you were able to build a deep Q-learning model for a real-world business problem. The environment itself is less important; what's most important is that you know how to connect a deep reinforcement learning model to an environment, and how to train the model inside.
Now, after your successes with the self-driving car plus this business application, you know how to do just that!
What we've built is excellent for our business
client, as our AI will seriously reduce their costs. Remember that
thanks to our object-oriented structure (working with classes and
objects), we could very easily take the objects created in this
implementation for one server, and then plug them into other servers, so
that in the end we end up lowering the total energy consumption of a
whole data center! That's how Google saved billions of dollars in
energy-related costs, thanks to the DQN
model built by their DeepMind AI.
My heartiest congratulations to you for smashing this new application. You've just made huge progress with your AI skills.
Finally, here's the link to the Colaboratory file with this whole implementation as promised. You don't have to install anything, Keras and NumPy are already pre-installed (this is the beauty of Google Colab!):
https://colab.research.google.com/drive/1KGAoT7S60OC3UGHNnrr_FuN5Hcil0cHk
Before we finish this chapter and move onto the world of deep convolutional Q-learning, let me give you a useful recap of the whole general AI blueprint when building a deep reinforcement learning model.
Let's recap the whole AI Blueprint, so that you can print it out and put it on your wall.
Step 1: Building the environment
Step 2: Building the brain
Step 3: Implementing the deep reinforcement learning algorithm
DQN
model.Step 4: Training the AI
Environment
class built in Step 1.Brain
class built in Step 2.DQN
model by creating an object of the DQN
class built in Step 3.for
loop over a chosen number of epochs.Step 5: Testing the AI
Environment
class built in Step 1.In this chapter you re-applied deep Q-learning to a new business problem. You were supposed to find the best strategy to cool down and heat up the server. Before you started defining the AI strategy, you had to make some assumptions about your environment, for example the way the temperature is calculated. As inputs to your ANN, you had information about the server at any given time, like the temperature and data transmission. As outputs, your AI predicted whether to cool down or heat up our server by a certain amount. The reward was the energy saved with respect to the other, traditional cooling system. Your AI was able to save 87% energy.
3.139.81.210