11

AI for Business – Minimize Costs with Deep Q-Learning

It's great that you can implement a deep Q-learning model to build a self-driving car. Really, once again, huge congratulations to you for that. But I also want you to be able to use deep Q-learning to solve a real-world business problem. With this next application, you'll be more than ready to add value to your work or business by leveraging AI. Even though we'll once again use a specific application, this chapter will provide you with a general AI framework, a blueprint containing the general steps of the process you have to follow when solving a real-world problem with deep Q-learning. This chapter is very important to you and for your career; I don't want you to close this book before you feel confident with the skills you'll learn here. Let's smash this next application together!

Problem to solve

When I said we were going to solve a real-world business problem, I didn't overstate the problem; the problem we're about to tackle with deep Q-learning is very similar to the following, which was solved in the real world via deep Q-learning.

In 2016, DeepMind AI minimized a big part of Google's yearly costs by reducing the Google Data Center's cooling bill by 40% using their DQN AI model (deep Q-learning). Check the link here:

https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40

In this case study, we'll do something very similar. We'll set up our own server environment, and we'll build an AI that controls the cooling and heating of the server so that it stays in an optimal range of temperatures while using the minimum of energy, therefore minimizing the costs.

Just as the DeepMind AI did, our goal will be to achieve at least 40% energy savings! Are you ready for this? Let's bring it on!

As ever, my first question to you is: What's our first step?

I'm sure by this point I don't need to spell out the answer. Let's get straight to building our environment!

Building the environment

Before we define the states, actions, and rewards, we need to set up the server and explain how it operates. We'll do that in several steps:

  1. First, we'll list all the environment parameters and variables by which the server is controlled.
  2. After that we'll set the essential assumptions of the problem, on which your AI will rely to provide a solution.
  3. Then we'll specify how you'll simulate the whole process.
  4. Finally, we'll explain the overall functioning of the server, and how the AI plays its role.

Parameters and variables of the server environment

Here is a list of all the parameters, which keep their values fixed, of the server environment:

  1. The average atmospheric temperature for each month.
  2. The optimal temperature range of the server, which we'll set as .
  3. The minimum temperature, below which the server fails to operate, which we'll set as .
  4. The maximum temperature, above which the server fails to operate, which we'll set as .
  5. The minimum number of users in the server, which we'll set as 10.
  6. The maximum number of users in the server, which we'll set as 100.
  7. The maximum change of users in the server per minute, which we'll set as 5; so every minute, the server can only have a change of 5 extra users or 5 fewer users at most.
  8. The minimum rate of data transmission in the server, which we'll set as 20.
  9. The maximum rate of data transmission in the server, which we'll set as 300.
  10. The maximum change of the rate of data transmission per minute, which we'll set as 10; so every minute, the rate of data transmission can only change by a maximum value of 10 in either direction.

Next, we'll list all the variables, which have values that fluctuate over time, of the server environment:

  1. The temperature of the server at a given minute.
  2. The number of users connected to the server at a given minute.
  3. The rate of data transmission at a given minute.
  4. The energy spent by the AI onto the server (to cool it down or heat it up) at a given minute.
  5. The energy that would be spent by the server's integrated cooling system to automatically bring the server's temperature back to the optimal range, whenever the server's temperature goes outside this optimal range. This is to keep track of how much energy a non-AI system would use, so we can compare our AI system to it.

All these parameters and variables will be part of the environment, and will influence the actions of our AI.

Next, we'll explain the two core assumptions of the environment. It's important to understand that these assumptions are not AI related, but just used to simplify the environment so that we can focus on creating a functional AI solution.

Assumptions of the server environment

We'll rely on the following two essential assumptions:

Assumption 1 – We can approximate the server temperature

The temperature of the server can be approximated through Multiple Linear Regression, that is, by a linear function of the atmospheric temperature, the number of users and the rate of data transmission, like so:

server temperature = + atmospheric temperature + number of users + rate of data transmission

where , , , and .

The raison d'être of this assumption and the reason why , , and are intuitive to understand. It makes sense that when the atmospheric temperature increases, the temperature of the server increases. The more users that are connected to the server, the more energy the server has to spend handling them, and therefore the higher the temperature of the server will be. Finally, the more data is transmitted inside the server, the more energy the server has to spend processing it, and therefore the higher the temperature of the server will be.

For simplicity's sake, we can just suppose that these correlations are linear. However, you could absolutely run the same simulation by assuming they were quadratic or logarithmic, and altering the code to reflect those equations. This is just my simulation of a virtual server environment; feel free to tweak it as you like!

Let's assume further that after performing this Multiple Linear Regression, we obtained the following values of the coefficients: , , , and . Accordingly:

server temperature = atmospheric temperature + number of users + rate of data transmission

Now, if we were facing this problem in real life, we could get the dataset of temperatures for our server and calculate these values directly. Here, we're just assuming values that are easy to code and understand, because our goal in this chapter is not to perfectly model a real server; it's to go through the steps of solving a real-world problem with AI.

Assumption 2 – We can approximate the energy costs

The energy spent by any cooling system, either our AI or the server's integrated cooling system that we'll compare our AI to, that changes the server's temperature from to within 1 unit of time (in our case 1 minute), can be approximated again through regression by a linear function of the server's absolute temperature change, as so:

where:

  1. is the energy spent by the system on the server between times t and t+1 minute.
  2. is the change in the server's temperature caused by the system, between times t and t+1 minute.
  3. is the temperature of the server at time t.
  4. is the temperature of the server at time t+1 minute.
  5. .
  6. .

Let's explain why it intuitively makes sense to make this assumption with . That's simply because the more the AI or the old-fashioned integrated cooling system heats up or cools down the server, the more energy it spends to achieve that heat transfer.

For example, imagine the server suddenly has overheating issues and just reached C; then within one unit of time (1 minute), either system will need much more energy to bring the server's temperature back to its optimal temperature, C, than to bring it back to C.

For simplicity's sake, in this example we suppose that these correlations are linear, instead of calculating true values from a real dataset. In case you're wondering why we take the absolute value, that's simply because when the AI cools down the server, , so . Since an energy cost is always positive, we have to take the absolute value of .

Keeping our desired simplicity in mind, we'll assume that the results of the regression are and , so that we get the following final equation based on Assumption 2:

thus:

, that is, if the server is heated up,

, that is, if the server is cooled down.

Now we've got our assumptions covered, let's explain how we'll simulate the operation of the server, with users logging on and off and data coming in and out.

Simulation

The number of users and the rate of data transmission will randomly fluctuate, to simulate the unpredictable user activity and data requirements of an actual server. This leads to randomness in the temperature. The AI needs to learn how much cooling or heating power it should transfer to the server so as to not deteriorate the server performance, and at the same time, expend as little energy as possible by optimizing its heat transfer.

Now that we have the full picture, I'll explain the overall functioning of the server and the AI inside this environment.

Overall functioning

Inside a data center, we're dealing with a specific server that is controlled by the parameters and variables listed previously. Every minute, some new users log on to the server and some current users log off, therefore updating the number of active users in the server. Also, every minute some new data is transmitted into the server, and some existing data is transmitted outside the server, therefore updating the rate of data transmission happening inside the server.

Hence, based on Assumption 1 given earlier, the temperature of the server is updated every minute. Now please focus, because this is where you'll understand the huge role the AI has to play on the server.

Two possible systems can regulate the temperature of the server: the AI, or the server's integrated cooling system. The server's integrated cooling system is an unintelligent system that automatically brings the server's temperature back inside its optimal temperature range.

Every minute, the server's temperature is updated. If the server is using the integrated cooling system, that system watches to see what happens; that update can either leave the temperature within the range of optimal temperatures (), or move it outside this range. If it goes outside the optimal range, for example to C, the server's integrated cooling system automatically brings the temperature back to the closest bound of the optimal range, in this case C. For the purposes of our simulation, we're assuming that no matter how big the change in temperature is, the integrated cooling system can bring it back into the optimal range in under a minute. This is, obviously, an unrealistic assumption, but the purpose of this chapter is for you to build a functioning AI capable of solving the problem, not to perfectly simulate the thermal dynamics of a real server. Once we've completed our example together, I highly recommend that you tinker with the code and try to make it more realistic; for now, to keep things simple, we'll believe in our magically effective integrated cooling system.

If the server is instead using the AI, then in that case the server's integrated cooling system is deactivated and it is the AI itself that updates the temperature of the server to regulate it the best way. The AI changes the temperature after making some prior predictions, not in a purely deterministic way as with the unintelligent integrated cooling system. Before there's an update to the number of users and the rate of data transmission, causing a change in the temperature of the server, the AI predicts if it should cool down the server, do nothing, or heat up the server, and acts. Then the temperature change happens and the AI reiterates.

Since these two systems are distinct from one another, we can evaluate them separately to compare their performance; to train or run the AI on a server, while keeping track of how much energy the integrated cooling system would have used in the same circumstances.

That brings us to the energy. Remember that one primary goal of the AI is to lower the energy cost of running this server. Accordingly, our AI has to try and use less energy than the unintelligent cooling system would use on the server. Since, based on Assumption 2 given preceding, the energy spent on the server (by any system) is proportional to the change of temperature within one unit of time:

thus:

, that is, if the server is heated up,

, that is, if the server is cooled down,

then that means that the energy saved by the AI at each iteration t (each minute) is equal to the difference in absolute changes of temperatures caused in the server between the unintelligent server's integrated cooling system and the AI from t and t+1:

Energy saved by the AI between t and t+1

where:

  1. is the change of temperature that the server's integrated cooling system would cause in the server during the iteration t, that is, from t to t+1 minute.
  2. is the change of temperature that the AI would cause in the server during the iteration t, that is, from t to t+1 minute.

The AIs goal is to save as much as it can every minute, therefore saving the maximum total energy over 1 full year of simulation, and eventually saving the business the maximum cost possible on their cooling/heating electricity bill. That's how we do business in the 21st century; with AI!

Now that we fully understand how our server environment works, and how it's simulated, it's time to proceed with what absolutely must be done when defining an AI environment. You know the next steps already:

  1. Defining the states.
  2. Defining the actions.
  3. Defining the rewards.

Defining the states

Remember, when you're doing deep Q-learning, the input state is always a 1D vector. (Unless you are doing deep convolutional Q-learning, in which case the input state is a 2D image, but that's getting ahead of ourselves! Wait for Chapter 12, Deep Convolution Q-Learning). So, what will the input state vector be in this server environment? What information will it contain in order to describe well enough each state of the environment? These are the questions you must ask yourself when modeling an AI problem and building the environment. Try to answer these questions first on your own and figure out the input state vector in this case, and you can find out what we're using in the next paragraph. Hint: have a look again at the variable defined preceding.

The input state at time t is composed of the following three elements:

  1. The temperature of the server at time t
  2. The number of users in the server at time t
  3. The rate of data transmission in the server at time t

Thus, the input state will be an input vector of these three elements. Our future AI will take this vector as input, and will return an action to perform at each time, t. Speaking of the actions, what are they going to be? Let's find out.

Defining the actions

To figure out which actions to perform, we need to remember the goal, which is to optimally regulate the temperature of the server. The actions are simply going to be the temperature changes that the AI can cause inside the server, in order to heat it up or cool it down. In deep Q-learning, the actions must always be discrete; they can't be plucked from a range, we need a defined number of possible actions. Therefore, we'll consider five possible temperature changes, from C to C, so that we end up with five possible actions that the AI can perform to regulate the temperature of the server:

Figure 1: Defining the actions

Great. Finally, let's see how we're going to reward and punish our AI.

Defining the rewards

You might have guessed from the earlier Overall functioning section what the reward is going to be. The reward at iteration t is the energy saved by the AI, with respect to how much energy the server's integrated cooling system would have spent; that is, the difference between the energy that the unintelligent cooling system would spend if the AI was deactivated, and the energy that the AI spends on the server:

Since according to Assumption 2, the energy spent is equal to the change of the temperature induced in the server (by any system, including the AI or the unintelligent cooling system):

thus:

, if the server is heated up,

, if the server is cooled down,

then we receive a reward at time t that is the difference in the change of temperature caused in the server between unintelligent cooling system (that is when there is no AI) and the AI:

Energy saved by the AI between t and t+1

where:

  1. is the change of temperature that the server's integrated cooling system would cause in the server during the iteration t, that is, from t to t+1 minute.
  2. is the change of temperature that the AI would cause in the server during the iteration t, that is, from t to t+1 minute.

Important note: It's important to understand that the systems (our AI and the server's integrated cooling system) will be evaluated separately, in order to compute the rewards. Since at each time point the actions of the two different systems lead to different temperatures, we have to keep track of the two temperatures separately, as and . In other words, we're performing two separate simulations at the same time, following the same fluctuations of users and data; one for the AI, and one for the server's integrated cooling system.

To complete this section, we'll do a small simulation of 2 iterations (that is, 2 minutes) as an example to make everything crystal clear.

Final simulation example

Let's say that we're at time pm, and that the temperature of the server is C, both with the AI and without it. At this exact time, the AI predicts an action: 0, 1, 2, 3 or 4. Since, right now, the server's temperature is outside the optimal temperature range, , the AI will probably predict actions 0, 1 or 2. Let's say that it predicts 1, which corresponds to cooling the server down by C. Therefore, between pm and pm, the AI makes the server's temperature go from to :

Thus, based on Assumption 2, the energy spent by the AI on the server is:

Now only one piece of information is missing to compute the reward: the energy that the server's integrated cooling system would have spent if the AI was deactivated between 4:00 pm and 4:01 pm. Remember that this unintelligent cooling system automatically brings the server's temperature back to the closest bound of the optimal temperature range . Since at pm the temperature was C, then the closest bound of the optimal temperature range at that time was C. Thus, the server's integrated cooling system would have changed the temperature from to , and the server's temperature change that would have occurred if there was no AI is:

Based on Assumption 2, the energy that the unintelligent cooling system would have spent if there was no AI is:

In conclusion, the reward the AI gets after playing this action at time pm is:

I'm sure you'll have noticed that as it stands, our AI system doesn't involve itself with the optimal range of temperatures for the server; as I've mentioned before, everything comes from the rewards, and the AI doesn't get any reward for being inside the optimal range or any penalty for being outside it. Once we've built the AI completely, I recommend that you play around with the code and try adding some rewards or penalties that get the AI to stick close to the optimal range; but for now, to keep things simple and get our AI up and running, we'll leave the reward as entirely linked to energy saved.

Then, between pm and pm, new things happen: some new users log on to the server, some existing users log off, some new data transmits into the server, and some existing data transmits out. Based on Assumption 1, these factors make the server's temperature change. Let's say that overall, they increase the server's temperature by C:

Now, remember that we're evaluating two systems separately: our AI, and the server's integrated cooling system. Therefore we must compute the two temperatures we would get with each of these two systems separately, one without the other, at pm. Let's start with the AI.

The temperature we get at pm when the AI is activated is:

And the temperature we get at pm if the AI is not activated is:

Now we have our two separate temperatures, which are = 31.5°C when the AI is activated, and = 29°C when the AI is not activated.

Let's simulate what happens between pm and pm. Again, our AI will make a prediction, and since the server is heating up, let's say it predicts action 0, which corresponds to cooling down the server by , bringing it down to . Therefore, the energy spent by the AI between pm and pm is:

Now regarding the server's integrated cooling system (that is, when there is no AI), since at pm we had , then the closest bound of the optimal range of temperatures is still , and so the energy that the server's unintelligent cooling system would spend between pm and pm is:

Hence the reward obtained between pm and pm, which is only and entirely based on the amount of energy saved, is:

Finally, the total reward obtained between pm and pm is:

That was an example of the whole process happening for two minutes. In our implementation we'll run the same process over 1000 epochs of 5-month periods for the training, and then, once our AI is trained, we'll run the same process over 1 full year of simulation for the testing.

Now that we've defined and built the environment in detail, it's time for our AI to take action! This is where deep Q-learning comes into play. Our model will be more advanced than the previous one because I'm introducing some new tricks, called dropout and early stopping, which are great techniques for you to have in your toolkit; they usually improve the training performance of deep Q-learning.

Don't forget, you'll also get an AI Blueprint, which will allow you to adapt what we do here to any other business problem that you want to solve with deep Q-learning.

Ready? Let's smash this.

AI solution

Let's start by reminding ourselves of the whole deep Q-learning model, while adapting it to this case study, so that you don't have to scroll or turn many pages back into the previous chapters. Repetition is never bad; it sticks the knowledge into our heads more firmly. Here's the deep Q-learning algorithm for you again:

Initialization:

  1. The memory of the experience replay is initialized to an empty list, called memory in the code (the dqn.py Python file in the Chapter 11 folder of the GitHub repo).
  2. We choose a maximum size for the memory, called max_memory in the code (the dqn.py Python file in the Chapter 11 folder of the GitHub repo).

At each time t (each minute), we repeat the following process, until the end of the epoch:

  1. We predict the Q-values of the current state . Since five actions can be performed (0 == Cooling 3°C, 1 == Cooling 1.5°C, 2 == No Heat Transfer, 3 == Heating 1.5°C, 4 == Heating 3°C), we get five predicted Q-values.
  2. We perform the action selected by the argmax method, which simply consists of selecting the action that has the highest of the five predicted Q-values:
  3. We get the reward , which is the difference .
  4. We reach the next state , which is composed of the three following elements:
    • The temperature of the server at time t+1
    • The number of users in the server at time t+1
    • The rate of data transmission in the server at time t+1
  5. We append the transition in the memory.
  6. We take a random batch of transitions. For all the transitions of the random batch B:
    • We get the predictions:
    • We get the targets:
    • We compute the loss between the predictions and the targets over the whole batch B:

And then finally we backpropagate this loss error back into the neural network, and through stochastic gradient descent we update the weights according to how much they contributed to the loss error.

I hope the refresher was refreshing! Let's move on to the brain of the outfit.

The brain

By the brain, I mean of course the artificial neural network of our AI.

Our brain will be a fully connected neural network, composed of two hidden layers, the first one with 64 neurons, and the second one with 32 neurons. As a reminder, this neural network takes as inputs the states of the environment, and returns as outputs the Q-values for each of the five possible actions.

This particular design of a neural network, with two hidden layers of 64 and 32 neurons respectively, is considered something of a classic architecture. It's suitable to solve a lot of problems, and it will work well for us here.

This artificial brain will be trained with a Mean Squared Error (MSE) loss, and an Adam optimizer. The choice for the MSE loss is because we want to measure and reduce the squared difference between the predicted value and the target value, and the Adam optimizer is a classic optimizer used, in practice, by default.

Here is what this artificial brain looks like:

Figure 2: The artificial brain of our AI

This artificial brain looks complex to create, but we can build it very easily thanks to the amazing Keras library. In the last chapter, we used PyTorch because it's the neural network library I'm more familiar with; but I want you to be able to use as many AI tools as possible, so in this chapter we're going to power on with Keras. Here's a preview of the full implementation containing the part that builds this brain all by itself (taken from the brain_nodropout.py file):

# BUILDING THE BRAIN
class Brain(object):
    
    # BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD
    
    def __init__(self, learning_rate = 0.001, number_actions = 5):
        self.learning_rate = learning_rate
        
        # BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE
        states = Input(shape = (3,))
        
        # BUILDING THE FULLY CONNECTED HIDDEN LAYERS
        x = Dense(units = 64, activation = 'sigmoid')(states)
        y = Dense(units = 32, activation = 'sigmoid')(x)
        
        # BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER
        q_values = Dense(units = number_actions, activation = 'softmax')(y)
        
        # ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT
        self.model = Model(inputs = states, outputs = q_values)
        
        # COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER
        self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate))

As you can see, it only takes a couple of lines of code, and I'll explain every line of that code to you in a later section. Now let's move on to the implementation.

Implementation

This implementation will be divided into five parts, each part having its own Python file. You can find the full implementation in the Chapter 11 folder of the GitHub repository. These five parts constitute the general AI framework, or AI Blueprint, that should be followed whenever you build an environment to solve any business problem with deep reinforcement learning.

Here they are, from Step 1 to Step 5:

  • Step 1: Building the environment (environment.py)
  • Step 2: Building the brain (brain_nodropout.py or brain_dropout.py)
  • Step 3: Implementing the deep reinforcement learning algorithm, which in our case is a deep Q-learning model (dqn.py)
  • Step 4: Training the AI (training_noearlystopping.py or training_earlystopping.py)
  • Step 5: Testing the AI (testing.py)

In order, those are the main steps of the general AI framework.

We'll follow this AI Blueprint to implement the AI for our specific case in the following five sections, each corresponding to one of these five main steps. Within each step, we'll distinguish the sub-steps that are still part of the general AI framework from the sub-steps that are specific to our project by writing the titles of the code sections in capital letters for all the sub-steps of the general AI framework, and in lowercase letters for all the sub-steps specific to our project.

That means that anytime you see a new code section where the title is written in capital letters, then it is the next sub-step of the general AI framework, which you should also follow when building an AI for your own business problem.

This next step, building the environment, is the largest Python implementation file for this project. Make sure you're rested and your batteries are recharged, and as soon as you are ready, let's tackle this together!

Step 1 – Building the environment

In this first step, we are going to build the environment inside a class. Why a class? Because we would like our environment to be an object which we can easily create with any values we choose for some parameters.

For example, we can create one environment object for a server that has a certain number of connected users and a certain rate of data at a specific time, and another environment object for a different server that has a different number of connected users and a different rate of data. Thanks to the advanced structure of this class, we can easily plug-and-play the environment objects we create on different servers which have their own parameters, regulating their temperatures with several different AIs, so that we can minimize the energy consumption of a whole data center, just as Google DeepMind did for Google's data centers with its DQN (deep Q-learning) algorithm.

This class follows the following sub-steps, which are part of the general AI Framework inside Step 1 – Building the environment:

  • Step 1-1: Introducing and initializing all the parameters and variables of the environment.
  • Step 1-2: Making a method that updates the environment right after the AI plays an action.
  • Step 1-3: Making a method that resets the environment.
  • Step 1-4: Making a method that gives us at any time the current state, the last reward obtained, and whether the game is over.

You'll find the whole implementation of this Environment class in this section. Remember the most important thing: all the code sections with their titles written in capital letters are steps of the general AI framework/Blueprint, and all the code sections having their titles written in lowercase letters are specific to our case study.

The implementation of the environment has 144 lines of code. I won't explain each line of code for two reasons:

  1. It would make this chapter really overwhelming.
  2. The code is very simple, is commented on for clarity, and just creates everything we've defined so far in this chapter.

I'm confident you'll have no problems understanding it. Besides, the code section titles and the chosen variable names are clear enough to understand the structure and the flow of the code at face value. I'll walk you through the code broadly. Here we go!

First, we start building the Environment class with its first method, the __init__ method, which introduces and initializes all the parameters and variables, as we described earlier:

# BUILDING THE ENVIRONMENT IN A CLASS
class Environment(object):
    
    # INTRODUCING AND INITIALIZING ALL THE PARAMETERS AND VARIABLES OF THE ENVIRONMENT
    
    def __init__(self, optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 10, initial_rate_data = 60):
        self.monthly_atmospheric_temperatures = [1.0, 5.0, 7.0, 10.0, 11.0, 20.0, 23.0, 24.0, 22.0, 10.0, 5.0, 1.0]
        self.initial_month = initial_month
        self.atmospheric_temperature = self.monthly_atmospheric_temperatures[initial_month]
        self.optimal_temperature = optimal_temperature
        self.min_temperature = -20
        self.max_temperature = 80
        self.min_number_users = 10
        self.max_number_users = 100
        self.max_update_users = 5
        self.min_rate_data = 20
        self.max_rate_data = 300
        self.max_update_data = 10
        self.initial_number_users = initial_number_users
        self.current_number_users = initial_number_users
        self.initial_rate_data = initial_rate_data
        self.current_rate_data = initial_rate_data
        self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
        self.temperature_ai = self.intrinsic_temperature
        self.temperature_noai = (self.optimal_temperature[0] + self.optimal_temperature[1]) / 2.0
        self.total_energy_ai = 0.0
        self.total_energy_noai = 0.0
        self.reward = 0.0
        self.game_over = 0
        self.train = 1

You'll notice the self.monthly_atmospheric_temperatures variable; that's a list containing the average monthly atmospheric temperatures for each of the 12 months: 1°C in January, 5°C in February, 7°C in March, and so on.

The self.atmospheric_temperature variable is the current average atmospheric temperature of the month we're in during the simulation, and it's initialized as the atmospheric temperature of the initial month, which we'll set later as January.

The self.game_over variable tells the AI whether or not we should reset the temperature of the server, in case it goes outside the allowed range of [-20°C, 80°C]. If it does, self.game_over will be set equal to 1, otherwise it will remain at 0.

Finally, the self.train variable tells us whether we're in training mode or inference mode. If we're in training mode, self.train = 1. If we're in inference mode, self.train = 0. The rest is just putting into code everything we defined in words at the beginning of this chapter.

Let's move on!

Now, we make the second method, update_env, which updates the environment after the AI performs an action. This method takes three arguments as inputs:

  1. direction: A variable describing the direction of the heat transfer the AI imposes on the server, like so: if direction == 1, the AI is heating up the server. If direction == -1, the AI is cooling down the server. We'll need to have the value of this direction before calling the update_env method, since this method is called after the action is performed.
  2. energy_ai: The energy spent by the AI to heat up or cool down the server at this specific time when the action is played. Based on assumption 2, it will be equal to the temperature change caused by the AI in the server.
  3. month: Simply the month we're in at the specific time when the action is played.

The first actions the program takes inside this method are to compute the reward. Indeed, right after the action is played, we can immediately deduce the reward, since it is the difference between the energy that the server's integrated system would spend if there was no AI, and the energy spent by the AI:

    # MAKING A METHOD THAT UPDATES THE ENVIRONMENT RIGHT AFTER THE AI PLAYS AN ACTION
    
    def update_env(self, direction, energy_ai, month):
        
        # GETTING THE REWARD
        
        # Computing the energy spent by the server's cooling system when there is no AI
        energy_noai = 0
        if (self.temperature_noai < self.optimal_temperature[0]):
            energy_noai = self.optimal_temperature[0] - self.temperature_noai
            self.temperature_noai = self.optimal_temperature[0]
        elif (self.temperature_noai > self.optimal_temperature[1]):
            energy_noai = self.temperature_noai - self.optimal_temperature[1]
            self.temperature_noai = self.optimal_temperature[1]
        # Computing the Reward
        self.reward = energy_noai - energy_ai
        # Scaling the Reward
        self.reward = 1e-3 * self.reward

You have probably noticed that we choose to scale the reward at the end. In short, scaling is bringing the values (here the rewards) down into a short range. For example, normalization is a scaling technique where all the values are brought down into a range between 0 and 1. Another widely used scaling technique is standardization, which will be explained a bit later on.

Scaling is a common practice that is usually recommended in research papers when performing deep reinforcement learning, as it stabilizes training and improves the performance of the AI.

After getting the reward, we reach the next state. Remember that each state is composed of the following elements:

  1. The temperature of the server at time t
  2. The number of users in the server at time t
  3. The rate of data transmission in the server at time t

So, as we reach the next state, we update each of these elements one by one, following the sub-steps highlighted as comments in this next code section:

        # GETTING THE NEXT STATE
        
        # Updating the atmospheric temperature
        self.atmospheric_temperature = self.monthly_atmospheric_temperatures[month]
        # Updating the number of users
        self.current_number_users += np.random.randint(-self.max_update_users, self.max_update_users)
        if (self.current_number_users > self.max_number_users):
            self.current_number_users = self.max_number_users
        elif (self.current_number_users < self.min_number_users):
            self.current_number_users = self.min_number_users
        # Updating the rate of data
        self.current_rate_data += np.random.randint(-self.max_update_data, self.max_update_data)
        if (self.current_rate_data > self.max_rate_data):
            self.current_rate_data = self.max_rate_data
        elif (self.current_rate_data < self.min_rate_data):
            self.current_rate_data = self.min_rate_data
        # Computing the Delta of Intrinsic Temperature
        past_intrinsic_temperature = self.intrinsic_temperature
        self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
        delta_intrinsic_temperature = self.intrinsic_temperature - past_intrinsic_temperature
        # Computing the Delta of Temperature caused by the AI
        if (direction == -1):
            delta_temperature_ai = -energy_ai
        elif (direction == 1):
            delta_temperature_ai = energy_ai
        # Updating the new Server's Temperature when there is the AI
        self.temperature_ai += delta_intrinsic_temperature + delta_temperature_ai
        # Updating the new Server's Temperature when there is no AI
        self.temperature_noai += delta_intrinsic_temperature

Then, we update the self.game_over variable if needed, that is, if the temperature of the server goes outside the allowed range of [-20°C, 80°C]. This can happen if the server temperature goes below the minimum temperature of -20°C, or if the server temperature goes higher than the maximum temperature of 80°C. Plus we do two extra things: we bring the server temperature back into the optimal temperature range (closest bound), and since doing this spends some energy, we update the total energy spent by the AI (self.total_energy_ai). That's exactly what is coded in the next code section:

        # GETTING GAME OVER
        
        if (self.temperature_ai < self.min_temperature):
            if (self.train == 1):
                self.game_over = 1
            else:
                self.total_energy_ai += self.optimal_temperature[0] - self.temperature_ai
                self.temperature_ai = self.optimal_temperature[0]
        elif (self.temperature_ai > self.max_temperature):
            if (self.train == 1):
                self.game_over = 1
            else:
                self.total_energy_ai += self.temperature_ai - self.optimal_temperature[1]
                self.temperature_ai = self.optimal_temperature[1]

Now, I know it seems unrealistic for the server to snap right back to 24 degrees from 80, or to 18 from -20, but this is an action the magically efficient integrated cooling system we defined earlier is perfectly capable of. Think of it as the AI switching to the integrated system for a moment in the case of a temperature disaster. Once again, this is an area that will benefit enormously from your ongoing tinkering once we've got the AI up and running; after that, you can play around with these figures as you like in the interests of a more realistic server model.

Then, we update the two scores coming from the two separate simulations, which are:

  1. self.total_energy_ai: The total energy spent by the AI
  2. self.total_energy_noai: The total energy spent by the server's integrated cooling system when there is no AI.
        # UPDATING THE SCORES
        
        # Updating the Total Energy spent by the AI
        self.total_energy_ai += energy_ai
        # Updating the Total Energy spent by the server's cooling system when there is no AI
        self.total_energy_noai += energy_noai

Then to improve the performance, we scale the next state by scaling each of its three elements (server temperature, number of users, and data transmission rate). To do so, we perform a simple standardization scaling technique, which simply consists of subtracting the minimum value of the variable, and then dividing by the maximum delta of the variable:

        # SCALING THE NEXT STATE
        
        scaled_temperature_ai = (self.temperature_ai - self.min_temperature) / (self.max_temperature - self.min_temperature)
        scaled_number_users = (self.current_number_users - self.min_number_users) / (self.max_number_users - self.min_number_users)
        scaled_rate_data = (self.current_rate_data - self.min_rate_data) / (self.max_rate_data - self.min_rate_data)
        next_state = np.matrix([scaled_temperature_ai, scaled_number_users, scaled_rate_data])

Finally, we end this update_env method by returning the next state, the reward received, and whether the game is over or not:

        # RETURNING THE NEXT STATE, THE REWARD, AND GAME OVER
        
        return next_state, self.reward, self.game_over

Great! We're done with this long, but important, method that updates the environment at each time step (each minute). Now there are two final and very easy methods to go: one that resets the environment, and one that gives us three pieces of information at any time: the current state, the last reward received, and whether or not the game is over.

Here's the reset method, which resets the environment when a new training episode starts, by resetting all the variables of the environment to their originally initialized values:

    # MAKING A METHOD THAT RESETS THE ENVIRONMENT
    
    def reset(self, new_month):
        self.atmospheric_temperature = self.monthly_atmospheric_temperatures[new_month]
        self.initial_month = new_month
        self.current_number_users = self.initial_number_users
        self.current_rate_data = self.initial_rate_data
        self.intrinsic_temperature = self.atmospheric_temperature + 1.25 * self.current_number_users + 1.25 * self.current_rate_data
        self.temperature_ai = self.intrinsic_temperature
        self.temperature_noai = (self.optimal_temperature[0] + self.optimal_temperature[1]) / 2.0
        self.total_energy_ai = 0.0
        self.total_energy_noai = 0.0
        self.reward = 0.0
        self.game_over = 0
        self.train = 1

Finally, here's the observe method, which lets us know at any given time the current state, the last reward received, and whether the game is over:

    # MAKING A METHOD THAT GIVES US AT ANY TIME THE CURRENT STATE, THE LAST REWARD AND WHETHER THE GAME IS OVER
    
    def observe(self):
        scaled_temperature_ai = (self.temperature_ai - self.min_temperature) / (self.max_temperature - self.min_temperature)
        scaled_number_users = (self.current_number_users - self.min_number_users) / (self.max_number_users - self.min_number_users)
        scaled_rate_data = (self.current_rate_data - self.min_rate_data) / (self.max_rate_data - self.min_rate_data)
        current_state = np.matrix([scaled_temperature_ai, scaled_number_users, scaled_rate_data])
        return current_state, self.reward, self.game_over

Awesome! We're done with the first step of the implementation, building the environment. Now let's move on to the next step and start building the brain.

Step 2 – Building the brain

In this step, we're going to build the artificial brain of our AI, which is nothing other than a fully connected neural network. Here it is again:

Figure 3: The artificial brain of our AI

We'll build this artificial brain inside a class for the same reason as before, which is to allow us to create several artificial brains, for different servers inside a data center. Maybe some servers will need different artificial brains with different hyper-parameters than other servers. That's why, thanks to this class/object advanced Python structure, we can easily switch from one brain to another, to regulate the temperature of a new server that requires an AI with different neural network parameters. That's the beauty of Object-Oriented Programming (OOP).

We're building this artificial brain with the amazing Keras library. From this library, we use the Dense() class to create our two fully connected hidden layers, the first one from 64 hidden neurons, and the second one from 32 neurons. Remember, this is a classic neural network architecture often used by default, as common practice, and seen in many research papers. At the end, we use the Dense() class again to return the Q-values, which are the outputs of the artificial neural network.

Later on, when we code the training and testing files, we'll use the argmax method to select the action that has the maximum Q-value. Then, we assemble all the components of the brain, including the inputs and outputs, by creating it as an object of the Model() class (which is very useful in that we can save and load a model with specific weights). Finally, we'll compile it with a mean squared error loss and an Adam optimizer. I'll explain all this in more detail later.

Here are the new steps of the general AI framework:

  • Step 2-1: Build the input layer, composed of the input states.
  • Step 2-2: Build a defined number of hidden layers with a defined number of neurons inside each layer, fully connected to the input layer and between each other.
  • Step 2-3: Build the output layer, fully connected to the last hidden layer.
  • Step 2-4: Assemble the full architecture inside a model object.
  • Step 2-5: Compile the model with a mean squared error loss function and a chosen optimizer.

The implementation of this is presented to you in a choice of two different files:

  1. brain_nodropout.py: An implementation file that builds the artificial brain without the dropout regularization technique (I'll explain what it is very soon).
  2. brain_dropout.py: An implementation file that builds the artificial brain with the dropout regularization technique.

First let me give you the implementation without dropout, and then I'll provide one with dropout and explain it.

Without dropout

Here is the full implementation of the artificial brain, without any dropout regularization technique:

# AI for Business - Minimize cost with Deep Q-Learning   #1
# Building the Brain without Dropout   #2
#3
# Importing the libraries   #4
from keras.layers import Input, Dense   #5
from keras.models import Model   #6
from keras.optimizers import Adam   #7
   #8
# BUILDING THE BRAIN   #9
   #10
class Brain(object):   #11
    #12
    # BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD   #13
    #14
    def __init__(self, learning_rate = 0.001, number_actions = 5):   #15
        self.learning_rate = learning_rate   #16
        #17
        # BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE   #18
        states = Input(shape = (3,))   #19
        #20
        # BUILDING THE FULLY CONNECTED HIDDEN LAYERS   #21
        x = Dense(units = 64, activation = 'sigmoid')(states)   #22
        y = Dense(units = 32, activation = 'sigmoid')(x)   #23
        #24
        # BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER   #25
        q_values = Dense(units = number_actions, activation = 'softmax')(y)   #26
        #27
        # ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT   #28
        self.model = Model(inputs = states, outputs = q_values)   #29
        #30
        # COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER   #31
        self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate))   #32

Now, let's go through the code in detail.

Line 5: We import the Input and Dense classes from the layers module in the keras library. The Input class allows us to build the input layer, and the Dense class allows us to build the fully-connected layers.

Line 6: We import the Model class from the models module in the keras library. It allows us to build the whole neural network model by assembling its different layers.

Line 7: We import the Adam class from the optimizers module in the keras library. It allows us to use the Adam optimizer, used to update the weights of the neural network through stochastic gradient descent, when backpropagating the loss error in each iteration of the training.

Line 11: We introduce the Brain class, which will contain not only the whole architecture of the artificial neural network, but also the connection of the model to the loss (Mean-Squared Error) and the Adam optimizer.

Line 15: We introduce the __init__ method, which will be the only method of this class. We define the whole architecture of the neural network inside it, just by creating successive variables which together assemble the neural network. This method takes as inputs two arguments:

  1. The learning rate (learning_rate), which is a measure of how fast you want the neural network to learn (the higher the learning rate, the faster the neural network learns; but at the cost of quality). The default value is 0.001.
  2. The number of actions (number_actions), which is of course the number of actions that our AI can perform. Now you might be thinking: why do we need to put that as an argument? Well that's just in case you want to build another AI that can perform more or fewer actions. In which case you would simply need to change the value of the argument and that's it. Pretty practical, isn't it?

Line 16: We create an object variable for the learning rate, self.learning_rate, initialized as the value of the learning_rate argument provided in the __init__ method (therefore the argument of the Brain class when we create the object in the future).

Line 19: We create the input states layer, called states, as an object of the Input class. Into this Input class we enter one argument, shape = (3,), which simply tells that the input layer is a 1D vector composed of three elements (the server temperature, the number of users, and the data transmission rate).

Line 22: We create the first fully-connected hidden layer, called x, as an object of the Dense class, which takes as input two arguments:

  1. units: The number of hidden neurons we want to have in this first hidden layer. Here, we choose to have 64 hidden neurons.
  2. activation: The activation function used to pass on the signal when forward-propagating the inputs into this first hidden layer. Here we choose, by default, a sigmoid activation function, which is as follows:

Figure 4: The sigmoid activation function

The ReLU activation function would also have worked well here; I encourage you to experiment! Note also how the connection from the input layer to this first hidden layer is made by calling the states variable right after the Dense class.

Line 23: We create the second fully-connected hidden layer, called y, as an object of the Dense class, which takes as input the same two arguments:

  1. units: The number of hidden neurons we want to have in this second hidden layer. This time we choose to have 32 hidden neurons.
  2. activation: The activation function used to pass on the signal when forward-propagating the inputs into this first hidden layer. Here, again, we choose a sigmoid activation function.

Note once again how the connection from the first hidden layer to this second hidden layer is made by calling the x variable right after the Dense class.

Line 26: We create the output layer, called q_values, fully connected to the second hidden layer, as an object of the Dense class. This time, we input number_actions units since the output layer contains the actions to play, and a softmax activation function, as seen in Chapter 5, Your First AI Model – Beware the Bandits!, on the deep Q-learning theory.

Line 29: Using the Model class, we assemble the successive layers of the neural network, by just inputting the states as the inputs, and the q_values as the outputs.

Line 32: Using the compile method taken from the Model class, we connect our model to the Mean-Squared Error loss and the Adam optimizer. The latter takes the learning_rate argument as input.

With dropout

It'll be valuable for you to add one more powerful technique to your toolkit: dropout.

Dropout is a regularization technique that prevents overfitting, which is the situation where the AI model performs well on the training set, but poorly on the test set. Dropout simply consists of deactivating a randomly selected portion of neurons during each step of forward- and back-propagation. That means not all the neurons learn the same way, which prevents the neural network from overfitting the training data.

Adding dropout is very easy with keras. You simply need to call the Dropout class right after the Dense class, and input the proportion of neurons you want to deactivate, like so:

# AI for Business - Minimize cost with Deep Q-Learning
# Building the Brain with Dropout
# Importing the libraries
from keras.layers import Input, Dense, Dropout
from keras.models import Model
from keras.optimizers import Adam
# BUILDING THE BRAIN
class Brain(object):
    
    # BUILDING A FULLY CONNECTED NEURAL NETWORK DIRECTLY INSIDE THE INIT METHOD
    
    def __init__(self, learning_rate = 0.001, number_actions = 5):
        self.learning_rate = learning_rate
        
        # BUILDING THE INPUT LAYER COMPOSED OF THE INPUT STATE
        states = Input(shape = (3,))
        
        # BUILDING THE FIRST FULLY CONNECTED HIDDEN LAYER WITH DROPOUT ACTIVATED
        x = Dense(units = 64, activation = 'sigmoid')(states)
        x = Dropout(rate = 0.1)(x)
        
        # BUILDING THE SECOND FULLY CONNECTED HIDDEN LAYER WITH DROPOUT ACTIVATED
        y = Dense(units = 32, activation = 'sigmoid')(x)
        y = Dropout(rate = 0.1)(y)
        
        # BUILDING THE OUTPUT LAYER, FULLY CONNECTED TO THE LAST HIDDEN LAYER
        q_values = Dense(units = number_actions, activation = 'softmax')(y)
        
        # ASSEMBLING THE FULL ARCHITECTURE INSIDE A MODEL OBJECT
        self.model = Model(inputs = states, outputs = q_values)
        
        # COMPILING THE MODEL WITH A MEAN-SQUARED ERROR LOSS AND A CHOSEN OPTIMIZER
        self.model.compile(loss = 'mse', optimizer = Adam(lr = learning_rate))

Here, we apply dropout to the first and second fully-connected layers, by deactivating 10% of their neurons each. Now, let's move on to the next step of our general AI framework: Step 3 – Implementing the deep reinforcement learning algorithm.

Step 3 – Implementing the deep reinforcement learning algorithm

In this new implementation (given in the dqn.py file), we simply have to follow the deep Q-learning algorithm provided before. Hence, this implementation follows the following sub-steps, which are part of the general AI framework:

  • Step 3-1: Introduce and initialize all the parameters and variables of the deep Q-learning model.
  • Step 3-2: Make a method that builds the memory in experience replay.
  • Step 3-3: Make a method that builds and returns two batches of 10 inputs and 10 targets.

First, have a look at the whole code, and then I'll explain it line by line:

# AI for Business - Minimize cost with Deep Q-Learning   #1
# Implementing Deep Q-Learning with Experience Replay   #2
#3
# Importing the libraries   #4
import numpy as np   #5
#6
# IMPLEMENTING DEEP Q-LEARNING WITH EXPERIENCE REPLAY   #7
#8
class DQN(object):   #9
    #10
    # INTRODUCING AND INITIALIZING ALL THE PARAMETERS AND VARIABLES OF THE DQN   #11
    def __init__(self, max_memory = 100, discount = 0.9):   #12
        self.memory = list()   #13
        self.max_memory = max_memory   #14
        self.discount = discount   #15
#16
    # MAKING A METHOD THAT BUILDS THE MEMORY IN EXPERIENCE REPLAY   #17
    def remember(self, transition, game_over):   #18
        self.memory.append([transition, game_over])   #19
        if len(self.memory) > self.max_memory:   #20
            del self.memory[0]   #21
#22
    # MAKING A METHOD THAT BUILDS TWO BATCHES OF INPUTS AND TARGETS BY EXTRACTING TRANSITIONS FROM THE MEMORY   #23
    def get_batch(self, model, batch_size = 10):   #24
        len_memory = len(self.memory)   #25
        num_inputs = self.memory[0][0][0].shape[1]   #26
        num_outputs = model.output_shape[-1]   #27
        inputs = np.zeros((min(len_memory, batch_size), num_inputs))   #28
        targets = np.zeros((min(len_memory, batch_size), num_outputs))   #29
        for i, idx in enumerate(np.random.randint(0, len_memory, size = min(len_memory, batch_size))):   #30
            current_state, action, reward, next_state = self.memory[idx][0]   #31
            game_over = self.memory[idx][1]   #32
            inputs[i] = current_state   #33
            targets[i] = model.predict(current_state)[0]   #34
            Q_sa = np.max(model.predict(next_state)[0])   #35
            if game_over:   #36
                targets[i, action] = reward   #37
            else:   #38
                targets[i, action] = reward + self.discount * Q_sa   #39
        return inputs, targets   #40

Line 5: We import the numpy library, because we'll be working with numpy arrays.

Line 9: We introduce the DQN class (DQN stands for Deep Q-Network), which contains the main parts of the deep Q-Learning algorithm, including experience replay.

Line 12: We introduce the __init__ method, which creates the three following object variables of the DQN model: the experience replay memory, the capacity (maximum size of the memory), and the discount factor in the formula of the target. It takes as arguments max_memory (the capacity) and discount (the discount factor), in case we want to build other experience replay memories with different capacities, or if we want to change the value of the discount factor in the computation of the target. The default values of these arguments are respectively 100 and 0.9, which were chosen arbitrarily and turned out to work quite well; these are good arguments to experiment with, to see what difference it makes when you set them differently.

Line 13: We create the experience replay memory object variable, self.memory, and we initialize it as an empty list.

Line 14: We create the object variable for the memory capacity, self.max_memory, and we initialize it as the value of the max_memory argument.

Line 15: We create the object variable for the discount factor, self.discount, and we initialize it as the value of the discount argument.

Line 18: We introduce the remember method, which takes as input a transition to be added to the memory, and game_over, which states whether or not this transition leads the server's temperature to go outside of the allowed range of temperatures.

Line 19: Using the append function called from the memory list, we add the transition with the game_over boolean into the memory (in the last position).

Line 20: If, after adding this transition, the size of the memory exceeds the memory capacity (self.max_memory).

Line 21: We delete the first element of the memory.

Line 24: We introduce the get_batch method, which takes as inputs the model we built in the previous Python file (model) and a batch size (batch_size), and builds two batches of inputs and targets by extracting 10 transitions from the memory (if the batch size is 10).

Line 25: We get the current number of elements in the memory and put it into a new variable, len_memory.

Line 26: We get the number of elements in the input state vector (which is 3), but instead of directly entering 3, we access this number from the shape attribute of the input state vector element of the memory, which we get by taking the [0][0][0] indexes. Each element of the memory is structured as follows:

[[current_state, action, reward, next_state], game_over]

Thus in [0][0][0], the first [0] corresponds to the first element of the memory (meaning the first transition), the second [0] corresponds to the tuple [current_state, action, reward, next_state], and so the third [0] corresponds to the current_state element of that tuple. Hence, self.memory[0][0][0] corresponds to the first current state, and by adding .shape[1] we get the number of elements in that input state vector. You might be wondering why we didn't enter 3 directly; that's because we want to generalize this code to any input state vector dimension you might want to have in your environment. For example, you might want to consider an input state with more information about your server, such as the humidity. Thanks to this line of code, you won't have to change anything regarding your new number of state elements.

Line 27: We get the number of elements of the model output, meaning the number of actions. Just like on the previous line, instead of entering directly 5, we generalize by accessing this from the shape attribute called from our model object of the Model class. -1 means that we get the last index of that shape attribute, where the number of actions is contained.

Line 28: We introduce and initialize the batch of inputs as a numpy array, of batch_size = 10 rows and 3 columns corresponding to input state elements, with only zeros. If the memory doesn't have 10 transitions yet, the number of rows will just be the length of the memory.

If the memory already has at least 10 transitions, what we get with this line of code is the following:

Figure 5: Batch of inputs (1/2)

Line 29: We introduce and initialize the batch of targets as a numpy array of batch_size = 10 rows and 5 columns corresponding to the five possible actions, with only zeros. Just like before, if the memory doesn't have 10 transitions yet, the number of rows will just be the length of the memory. If the memory already has at least 10 transitions, what we get with this line of code is the following:

Figure 6: Batch of targets (1/3)

Line 30: We do a double iteration inside the same for loop. The first iterative variable i goes from 0 to the batch size (or up to len_memory if len_memory < batch_size):

i = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

That way, i will iterate each element of the batch. The second iterative variable idx takes 10 random indexes of the memory, in order to extract 10 random transitions from the memory. Inside the for loop, we populate the two batches of inputs and targets with their right values by iterating through each of their elements.

Line 31: We get the transition of the sampled index idx from the memory, composed of the current state, the action, the reward, and the next state. The reason we add [0] is because an element of the memory is structured as follows:

[[current_state, action, reward, next_state], game_over]

We'll get the game_over value separately, in the next line of code.

Line 32: We get the game_over value corresponding to that same index idx of the memory. As you can see, this time we add [1] on the end to get the second element of a memory element:

[[current_state, action, reward, next_state], game_over]

Line 33: We populate the batch of inputs with all the current states, in order to get this at the end of the for loop:

Figure 7: Batch of inputs (2/2)

Line 34: Now we start populating the batch of targets with the right values. First, we populate it with all the Q-values that the model predicts for the different state-action pairs: (current state, action 0), (current state, action 1), (current state, action 2), (current state, action 3), and (current state, action 4). Thus we first get this (at the end of the for loop):

Figure 8: Batch of targets (2/3)

Remember that for the action that is played, the formula of the target must be this one:

What we do in the following lines of code is to put this formula into the column of each action that was played within the 10 selected transitions. In other words, we get this:

Figure 9: Batch of targets (3/3)

In that example, Action 1 was performed in the first transition (Target 1), Action 3 was performed in the second transition (Target 2), Action 0 was performed in the third transition (Target 3), and so on. Let's populate this in the following lines of code.

Line 35: We first start getting the part of the formula of the target:

Line 36: We check if game_over = 1, meaning that the server has gone outside the allowed range of server temperatures. Because if it has, there's actually no next state (because we basically reset the environment by putting the server's temperature back into the optimal range so we start from a new state); and therefore we shouldn't consider .

Line 37: In that case, we only keep the part of the target.

Line 38: However, if the game is not over (game_over = 0)...

Line 39: We keep the whole formula of the target, but of course only for the action that was performed, meaning here:

Hence, we get the following batch of targets, as you saw earlier:

Figure 10: Batch of targets (3/3)

Line 40: At last, we return the final batches of inputs and targets.

That was epic—you've successfully created an artificial brain. Now that you've done it, we're ready to start the training.

Step 4: Training the AI

Now that our AI has a fully functional brain, it's time to train it. That's exactly what we do in this fourth Python implementation. You actually have a choice of two files to use for this:

  1. training_noearlystopping.py, which trains your AI on a full 1000 epochs of 5-months period.
  2. training_earlystopping.py, which trains your AI on 1000 epochs as well, but which can stop the training early if the performance no longer improves over the iterations. This technique is called early stopping.

Both these implementations are long, but very simple. We start by setting all the parameters, then we build the environment by creating an object of the Environment() class, then we build the brain of the AI by creating an object of the Brain() class, then we build the deep Q-learning model by creating an object of the DQN() class, and finally we launch the training connecting all these objects together over 1000 epochs of 5-month periods.

You'll notice in the training loop that we also do some exploration when performing the actions, performing some random actions from time to time. In our case, this will be done 30% of the time, since we use an exploration parameter , and then we force the AI to perform a random action when we draw a random value between 0 and 1 that is below . The reason we do some exploration is because it improves the deep reinforcement learning process, as we discussed in Chapter 9, Going Pro with Artificial Brains – Deep Q-Learning, and the reason we don't use Softmax in this project is just to give you a look at how to implement a different exploration method.

Later, you'll be introduced to another little improvement in the training_noearlystopping.py file, where we use an early stopping technique which stops the training early if there's no improvement in the performance.

Let's highlight the new steps which still belong to our general AI framework/Blueprint:

  • Step 4-1: Building the environment by creating an object of the Environment class.
  • Step 4-2: Building the artificial brain by creating an object of the Brain class.
  • Step 4-3: Building the DQN model by creating an object of the DQN class.
  • Step 4-4: Selecting the training mode.
  • Step 4-5: Starting the training with a for loop over 100 epochs of 5-month periods.
  • Step 4-6: During each epoch we repeat the whole deep Q-learning process, while also doing some exploration 30% of the time.

No early stopping

Ready to implement this? Maybe get a good coffee or tea first because this is going to be a bit long (88 lines of code, but easy ones!). We'll start without early stopping and then at the end I'll explain how to add the early stopping technique. The file to follow along with is training_noearlystopping.py. Since this is pretty long, let's do it section by section this time, starting with the first one:

# AI for Business - Minimize cost with Deep Q-Learning   #1
# Training the AI without Early Stopping   #2
#3
# Importing the libraries and the other python files   #4
import os   #5
import numpy as np   #6
import random as rn   #7
import environment   #8
import brain_nodropout   #9
import dqn   #10

Line 5: We import the os library, which will be used to set a seed for reproducibility so that if you run the training several times, you'll get the same result each time. You can, of course, choose to remove this when you tinker with the code yourself!

Line 6: We import the numpy library, since we'll work with numpy arrays.

Line 7: We import the random library, which we'll use to do some exploration.

Line 8: We import the environment.py file, implemented in Step 1, which contains the whole defined environment.

Line 9: We import the brain_nodropout.py file, our artificial brain without dropout that we implemented in Step 2. This contains the whole neural network of our AI.

Line 10: We import the dqn.py file implemented in Step 3, which contains the main parts of the deep Q-learning algorithm, including experience replay.

Moving on to the next section:

# Setting seeds for reproducibility   #12
os.environ['PYTHONHASHSEED'] = '0'   #13
np.random.seed(42)   #14
rn.seed(12345)   #15
#16
# SETTING THE PARAMETERS   #17
epsilon = .3   #18
number_actions = 5   #19
direction_boundary = (number_actions - 1) / 2   #20
number_epochs = 100   #21
max_memory = 3000   #22
batch_size = 512   #23
temperature_step = 1.5   #24
#25
# BUILDING THE ENVIRONMENT BY SIMPLY CREATING AN OBJECT OF THE ENVIRONMENT CLASS   #26
env = environment.Environment(optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 20, initial_rate_data = 30)   #27
#28
# BUILDING THE BRAIN BY SIMPLY CREATING AN OBJECT OF THE BRAIN CLASS   #29
brain = brain_nodropout.Brain(learning_rate = 0.00001, number_actions = number_actions)   #30
#31
# BUILDING THE DQN MODEL BY SIMPLY CREATING AN OBJECT OF THE DQN CLASS   #32
dqn = dqn.DQN(max_memory = max_memory, discount = 0.9)   #33
#34
# CHOOSING THE MODE   #35
train = True   #36

Lines 13, 14, and 15: We set seeds for reproducibility, to get the same results after several rounds of training. This is really only important so you can reproduce your findings—if you don't need to do that, some people prefer them and others don't. If you don't want the seeds you can just remove them.

Line 18: We introduce the exploration factor , and we set it to 0.3, meaning that there will be 30% of exploration (performing random actions) vs. 70% of exploitation (performing the actions of the AI).

Line 19: We set the number of actions to 5.

Line 20: We set the direction boundary, meaning the action index below which we cool down the server, and above which we heat up the server. Since actions 0 and 1 cool down the server, and actions 3 and 4 heat up the server, that direction boundary is (5-1)/2 = 2, which corresponds to the action that transfers no heat to the server (action 2).

Line 21: We set the number of training epochs to 100.

Line 22: We set the memory capacity, meaning its maximum size, to 3000.

Line 23: We set the batch size to 512.

Line 24: We introduce the temperature step, meaning the absolute temperature change that the AI cause onto the server by playing actions 0, 1, 3, or 4. And that's of course 1.5°C.

Line 27: We create the environment object, as an instance of the Environment class which we call from the environment file. Inside this Environment class, we enter all the arguments of the init method:

optimal_temperature = (18.0, 24.0),
initial_month = 0,
initial_number_users = 20,
initial_rate_data = 30

Line 30: We create the brain object as an instance of the Brain class, which we call from the brain_nodropout file. Inside this Brain class, we enter all the arguments of the init method:

learning_rate = 0.00001,
number_actions = number_actions

Line 33: We create the dqn object as an instance of the DQN class, which we call from the dqn file. Inside this DQN class we enter all the arguments of the init method:

max_memory = max_memory,
discount = 0.9

Line 36: We set the training mode to True, because the next code section will contain the big for loop that performs all the training.

All good so far? Don't forget to take a break or a step back by reading the previous paragraphs again anytime you feel a bit overwhelmed or lost.

Now let's begin the big training loop; that's the last code section of this file:

# TRAINING THE AI   #38
env.train = train   #39
model = brain.model   #40
if (env.train):   #41
    # STARTING THE LOOP OVER ALL THE EPOCHS (1 Epoch = 5 Months)   #42
    for epoch in range(1, number_epochs):   #43
        # INITIALIAZING ALL THE VARIABLES OF BOTH THE ENVIRONMENT AND THE TRAINING LOOP   #44
        total_reward = 0   #45
        loss = 0.   #46
        new_month = np.random.randint(0, 12)   #47
        env.reset(new_month = new_month)   #48
        game_over = False   #49
        current_state, _, _ = env.observe()   #50
        timestep = 0   #51
        # STARTING THE LOOP OVER ALL THE TIMESTEPS (1 Timestep = 1 Minute) IN ONE EPOCH   #52
        while ((not game_over) and timestep <= 5 * 30 * 24 * 60):   #53
            # PLAYING THE NEXT ACTION BY EXPLORATION   #54
            if np.random.rand() <= epsilon:   #55
                action = np.random.randint(0, number_actions)   #56
                if (action - direction_boundary < 0):   #57
                    direction = -1   #58
                else:   #59
                    direction = 1   #60
                energy_ai = abs(action - direction_boundary) * temperature_step   #61
            # PLAYING THE NEXT ACTION BY INFERENCE   #62
            else:   #63
                q_values = model.predict(current_state)   #64
                action = np.argmax(q_values[0])   #65
                if (action - direction_boundary < 0):   #66
                    direction = -1   #67
                else:   #68
                    direction = 1   #69
                energy_ai = abs(action - direction_boundary) * temperature_step   #70
            # UPDATING THE ENVIRONMENT AND REACHING THE NEXT STATE   #71
            next_state, reward, game_over = env.update_env(direction, energy_ai, ( new_month + int(timestep/(30*24*60)) ) % 12)   #72
            total_reward += reward   #73
            # STORING THIS NEW TRANSITION INTO THE MEMORY   #74
            dqn.remember([current_state, action, reward, next_state], game_over)   #75
            # GATHERING IN TWO SEPARATE BATCHES THE INPUTS AND THE TARGETS   #76
            inputs, targets = dqn.get_batch(model, batch_size = batch_size)   #77
            # COMPUTING THE LOSS OVER THE TWO WHOLE BATCHES OF INPUTS AND TARGETS   #78
            loss += model.train_on_batch(inputs, targets)   #79
            timestep += 1   #80
            current_state = next_state   #81
        # PRINTING THE TRAINING RESULTS FOR EACH EPOCH   #82
        print("
")   #83
        print("Epoch: {:03d}/{:03d}".format(epoch, number_epochs))   #84
        print("Total Energy spent with an AI: {:.0f}".format(env.total_energy_ai))   #85
        print("Total Energy spent with no AI: {:.0f}".format(env.total_energy_noai))   #86
        # SAVING THE MODEL   #87
        model.save("model.h5")   #88

Line 39: We set the env.train object variable (this is a variable of our environment object) to the value of the train variable entered just before, which is of course equal to True, meaning we are indeed in training mode.

Line 40: We get the model from our brain object. This model contains the whole architecture of the neural network, plus its optimizer. It also has extra practical tools, like for example the save and load methods, which will allow us respectively to save the weights after the training or load them anytime in the future.

Line 41: If we are in training mode…

Line 43: We start the main training for loop, iterating the training epochs from 1 to 100.

Line 45: We set the total reward (total reward accumulated over the training iterations) to 0.

Line 46: We set the loss to 0 (0 because the loss will be a float).

Line 47: We set the starting month of the training, called new_month, to a random integer between 0 and 11. For example, if the random integer is 2, we start the training in March.

Line 48: By calling the reset method from our env object of the Environment class built in Step 1, we reset the environment starting from that new_month.

Line 49: We set the game_over variable to False, because we're starting in the allowed range of server temperatures.

Line 50: By calling the observe method from our env object of the Environment class built in Step 1, we get the current state only, which is our starting state.

Line 51: We set the first timestep to 0. This is the first minute of the training.

Line 53: We start the while loop that will iterate all the timesteps (minutes) for the whole period of the epoch, which is 5 months. Therefore, we iterate through 5 * 30 * 24 * 60 minutes; that is, 216,000 timesteps.

If, however, during those timesteps we go outside the allowed range of server temperatures (that is, if game_over = 1), then we stop the epoch and we start a new one.

Lines 55 to 61 make sure the AI performs a random action 30% of the time. This is exploration. The trick to it in this case is to sample a random number between 0 and 1, and if this random number is between 0 and 0.3, the AI performs a random action. That means the AI will perform a random action 30% of the time, because this sampled number has a 30% chance to be between 0 and 0.3.

Line 55: If a sampled number between 0 and 1 is below ...

Line 56: ... we play a random action index from 0 to 4.

Line 57: Now that we've just performed an action, we compute the direction and the energy spent; remember that they're are the required arguments of the update_env method of the Environment class, which we'll call later to update the environment. The AI distinguishes between two cases by checking if the action is below or above the direction boundary of 2. If the action is below the direction boundary of 2, meaning the AI cools down the server...

Line 58: ...then the heating direction is equal to -1 (cooling down).

Line 59 and 60: Else the heating direction is equal to +1 (heating up).

Line 61: We compute the energy spent by the AI onto the server, which according to Assumption 2 is:

|action - direction_boundary| * temperature_step = |action - 2| * 1.5 Joules

For example, if the action is 4, then the AI heats up the server by 3°C, and so according to Assumption 2 the energy spent is 3 Joules. And we check indeed that |4-2|*1.5 = 3.

Line 63: Now we play the actions by inference, meaning directly from our AI's predictions. The inference starts from the else statement, which corresponds to the if statement of line 55. This else corresponds to the situation where the sampled number is between 0.3 and 1, which happens 70% of the time.

Line 64: By calling the predict method from our model object (predict is a pre-built method of the Model class), we get the five predicted Q-values from our AI model.

Line 65: Using the argmax function from numpy, we select the action that has the maximum Q-value among the five predicted ones at Line 64.

Lines 66 to 70: We do exactly the same as in Lines 57 to 61, but this time with the action performed by inference.

Line 72: Now we have everything ready to update the environment. We call the big update_env method made in the Environment class of Step 1, by inputting the heating direction, the energy spent by the AI, and the month we're in at that specific timestep of the while loop. We get in return the next state, the reward received, and whether the game is over (that is, whether or not we went outside the optimal range of server temperatures).

Line 73: We add this last reward received to the total reward.

Line 75: By calling the remember method from our dqn object of the DQN class built in Step 3, we store the new transition [[current_state, action, reward, next_state], game_over] into the memory.

Line 77: By calling the get_batch method from our dqn object of the DQN class built in Step 3, we create two separate batches of inputs and targets, each one having 512 elements (since batch_size = 512).

Line 79: By calling the train_on_batch method from our model object (train_on_batch is a pre-built method of the Model class), we compute the loss error between the predictions and the targets over the whole batch. As a reminder, this loss error is the mean-squared error loss. Then in this same line, we add this loss error to the total loss of the epoch, in case we want to check how this total loss evolves over the epochs during the training.

Line 80: We increment the timestep.

Line 81: We update the current state, which becomes the new state reached.

Line 83: We print a new line to separate out the training results so we can look them over easily.

Line 84: We print the epoch reached (the one we are in at this specific moment of the main training for loop).

Line 85: We print the total energy spent by the AI over that specific epoch (the one we are in at this specific moment of the main training for loop).

Line 86: We print the total energy spent by the server's integrated cooling system over that same specific epoch.

Line 88: We save the model's weights at the end of the training, in order to load them in the future, anytime we want to use our pre-trained model to regulate a server's temperature.

That's it for training our AI without early stopping; now let's have a look at what you'd need to change to implement it.

Early stopping

Now open the training_earlystopping.py file. Compare it to the previous file; all the lines of code from 1 to 40 are the same. Then, in the last code section, TRAINING THE AI, we have the same process, to which is added the early stopping technique. As a reminder, it consists of stopping the training if there's no more improvement of the performance, which could be assessed two different ways:

  1. If the total reward of an epoch no longer increases much over the epochs.
  2. If the total loss of an epoch no longer decreases much over the epochs.

Let's see how we do this.

First, we introduce four new variables just before the main training for loop:

# TRAINING THE AI   #38
env.train = train   #39
model = brain.model   #40
early_stopping = True   #41
patience = 10   #42
best_total_reward = -np.inf   #43
patience_count = 0   #44
if (env.train):   #45
    # STARTING THE LOOP OVER ALL THE EPOCHS (1 Epoch = 5 Months)   #46
    for epoch in range(1, number_epochs):   #47

Line 41: We introduce a new variable, early_stopping, which is set equal to True if we decide to activate the early stopping technique, meaning if we decide to stop the training when the performance no longer improves.

Line 42: We introduce a new variable, patience, which is the number of epochs we wait without performance improvement before stopping the training. Here we choose a patience of 10 epochs, which means that if the best total reward of an epoch doesn't increase during the next 10 epochs, we will stop the training.

Line 43: We introduce a new variable, best_total_reward, which is the best total reward recorded over a full epoch. If we don't beat that best total reward before 10 epochs go by, the training stops. It's initialized to -np.inf, which represents -infinity. That's just a trick to say that nothing can be lower than that best total reward at the beginning. Then as soon as we get the first total reward over the first epoch, best_total_reward becomes that first total reward.

Line 44: We introduce a new variable, patience_count, which is a counter starting from 0, and is incremented by 1 each time the total reward of an epoch doesn't beat the best total reward. If patience_count reaches 10 (the patience), we stop the training. And if one epoch beats the best total reward, patience_count is reset to 0.

Then, the main training for loop is the same as before, but just before saving the model we add the following:

       # EARLY STOPPING   #91
        if (early_stopping):   #32
            if (total_reward <= best_total_reward):   #93
                patience_count += 1   #94
            elif (total_reward > best_total_reward):   #95
                best_total_reward = total_reward   #96
                patience_count = 0   #97
            if (patience_count >= patience):   #98
                print("Early Stopping")   #99
                break   #100
        # SAVING THE MODEL   #101
        model.save("model.h5")   #102

Line 92: If the early_stopping variable is True, meaning if the early stopping technique is activated…

Line 93: And if the total reward of the current epoch (we are still in the main training for loop that iterates the epochs) is lower than the best total reward of an epoch obtained so far…

Line 94: ...we increment the patience_count variable by 1.

Line 95: However, if the total reward of the current epoch is higher than the best total reward of an epoch obtained so far…

Line 96: ...we update the best total reward, which becomes that new total reward of the current epoch.

Line 97: ...and we reset the patience_count variable to 0.

Line 98: Then in a new if condition, we check that if the patience_count variable goes higher than the patience of 10…

Line 99: ...we print Early Stopping,

Line 100: ...and we stop the main training for loop with a break statement.

That's the whole thing. Easy and intuitive, right? Now you know how to implement early stopping.

After executing the code (I'll explain how to run this in a bit), we'll already see some good performances from our AI during the training, spending less energy than the server's integrated cooling system most of the time. But that's only training; now we need to see if we get good performance from the AI on a new 1-year simulation. That's where our next and final Python file comes into play.

Step 5 – Testing the AI

Now we need to test the performance of our AI in a brand-new situation. To do so, we run a 1-year simulation in inference mode, meaning that there's no training happening at any time. Our AI only returns predictions over a full year of simulation. Then, thanks to our environment object, in the end we'll be able to see the total energy spent by the AI over the full year, as well as the total energy that would have been spent in the exact same year by the server's integrated cooling system. Finally, we compare these two total energies spent, by computing their relative difference (in %) which shows us precisely the total energy saved by the AI. Buckle up for the final results—we'll reveal them very soon!

In terms of the AI blueprint, for the testing implementation we have almost the same process as the training implementation, except that this time we don't need to create a brain object nor a DQN model object; and, of course, we won't run the deep Q-learning process over some training epochs. However, we do have to create a new environment object, and instead of creating a brain, we'll load our artificial brain with its pre-trained weights from the previous training that we executed in Step 4 – Training the AI. Let's take a look at the final sub-steps of this final part of the AI framework/Blueprint:

  • Step 5-1: Build a new environment by creating an object of the Environment class.
  • Step 5-2: Load the artificial brain with its pre-trained weights from the previous training.
  • Step 5-3: Choose the inference mode.
  • Step 5-4: Start the 1-year simulation.
  • Step 5-5: In each iteration (each minute), our AI only performs the action that results from its prediction, and no exploration or deep Q-learning training happens whatsoever.

The implementation is a piece of cake to understand. It's actually the same as the training file, except that:

  1. Instead of creating a brain object from the Brain class, we load the pre-trained weights resulting from the training.
  2. Instead of running a training loop over 100 epochs of 5-month periods, we run an inference loop over a single 12-month period. Inside this inference loop, you'll find exactly the same code as the inference part of the training for loop. You've got this!

Have a look at the full testing implementation in the following code:

# AI for Business - Minimize cost with Deep Q-Learning
# Testing the AI

# Installing Keras
# conda install -c conda-forge keras

# Importing the libraries and the other python files
import os
import numpy as np
import random as rn
from keras.models import load_model
import environment

# Setting seeds for reproducibility
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(42)
rn.seed(12345)

# SETTING THE PARAMETERS
number_actions = 5
direction_boundary = (number_actions - 1) / 2
temperature_step = 1.5

# BUILDING THE ENVIRONMENT BY SIMPLY CREATING AN OBJECT OF THE ENVIRONMENT CLASS
env = environment.Environment(optimal_temperature = (18.0, 24.0), initial_month = 0, initial_number_users = 20, initial_rate_data = 30)

# LOADING A PRE-TRAINED BRAIN
model = load_model("model.h5")

# CHOOSING THE MODE
train = False

# RUNNING A 1 YEAR SIMULATION IN INFERENCE MODE
env.train = train
current_state, _, _ = env.observe()
for timestep in range(0, 12 * 30 * 24 * 60):
    q_values = model.predict(current_state)
    action = np.argmax(q_values[0])
    if (action - direction_boundary < 0):
        direction = -1
    else:
        direction = 1
    energy_ai = abs(action - direction_boundary) * temperature_step
    next_state, reward, game_over = env.update_env(direction, energy_ai, int(timestep / (30 * 24 * 60)))
    current_state = next_state

# PRINTING THE TRAINING RESULTS FOR EACH EPOCH
print("
")
print("Total Energy spent with an AI: {:.0f}".format(env.total_energy_ai))
print("Total Energy spent with no AI: {:.0f}".format(env.total_energy_noai))
print("ENERGY SAVED: {:.0f} %".format((env.total_energy_noai - env.total_energy_ai) / env.total_energy_noai * 100))

Everything's more or less the same as before; we just removed the parts related to the training.

The demo

Given the different files we have, make sure to understand that there are four possible ways to run the program:

  1. Without dropout and without early stopping
  2. Without dropout and with early stopping
  3. With dropout and without early stopping
  4. With dropout and with early stopping

Then, for each of these four combinations, the way to run this is the same: we first execute the training file, and then the testing file. In this demo section, we'll execute the 4th option, with both dropout and early stopping.

Now how do we run this? We have two options: with or without Google Colab.

I'll explain how to do it on Google Colab, and I'll even give you a Google Colab file where you only have to hit the play button. For those of you who want to execute this without Colab, on your favorite Python IDE, or through the terminal, let me explain how it's done. It's easy; you just need to download the main repository from GitHub, then in your Python IDE set the right working directory folder, which is the Chapter 11 folder, and then run the following two files in this order:

  1. training_earlystopping.py, inside which you should make sure to import brain_dropout at line 9. This will execute the training, and you'll have to wait until that finishes (which will take about 10 minutes).
  2. testing.py, which will test the model on one full year of data.

Now, back to Google Colab. First, open a new Colaboratory file, and call it Deep Q-Learning for Business. Then add all your files from the Chapter 11 folder of GitHub into this Colaboratory file, right here:

Figure 11: Google Colab – Step 1

Unfortunately, it's not easy to add the different files manually. You can only do this by using the os library, which we won't bother with. Instead, copy-paste the five Python implementations in five different cells of our Colaboratory file, in the following order:

  1. A first cell containing the whole environment.py implementation.
  2. A second cell containing the whole brain_dropout.py implementation.
  3. A third cell containing the whole dqn.py implementation.
  4. A fourth cell containing the whole training_earlystopping.py implementation.
  5. And a last cell containing the whole testing.py implementation.

Here's what it looks like, after adding some snazzy titles:

Figure 12: Google Colab – Step 2

Figure 13: Google Colab – Step 3

Figure 14: Google Colab – Step 4

Figure 15: Google Colab – Step 5

Figure 16: Google Colab – Step 6

Now before we execute each of these cells in the order one through five, we need to remove the import commands of the Python files. The reason for this is that now that the implementations are in cells, they're like a single Python implementation, and we don't have to import the interdependent files in every single cell. First, remove the following three different rows in the training file:

Figure 17: Google Colab – Step 7

After doing that, we end up with this:

Figure 18: Google Colab – Step 8

Then, since we removed these imports, we also have to remove the three filenames for the environment, the brain, and the dqn, when creating the objects:

First the environment:

Figure 19: Google Colab – Step 9

Then the brain:

Figure 20: Google Colab – Step 10

And finally the dqn:

Figure 21: Google Colab – Step 11

Now the training file's good to go. In the testing file, we just have to remove two things, the environment import at line 12:

Figure 22: Google Colab – Step 12

and the environment. at row 25:

Figure 23: Google Colab – Step 13

That's it; now you're all set! You're ready to literally hit the play button on each of the cells from top to the bottom.

First, execute the first cell. After executing it, no output is displayed. That's fine!

Then execute the second cell:

Using TensorFlow backend.

After executing it, you can see the output Using TensorFlow backend.

Then execute the third cell, after which no output is displayed.

Now it gets a bit exciting! You're about to execute the training, and follow the training performance in real time. Do this by executing the fourth cell. After executing it, the training launches, and you should see the following results:

Figure 24: The output

Don't worry about those warnings, everything's running the way it should. Since early stopping is activated, you'll reach the end of the training way before the 100 epochs, at the 15th epoch:

Figure 25: The output at the 15th epoch

Note that the pre-trained weights are saved in Files, in the model.h5 file:

Figure 26: The model.h5 file

The training results look promising. Most of the time the AI spends less energy than the alternative server's integrated cooling system. Check that this is still the case with a full test, on one new year of simulation.

Execute the final cell and when it finishes running, (which takes approximately 3 minutes), you obtain in the printed results that the total energy consumption saved by the AI is…

Total Energy spent with an AI: 261985
Total Energy spent with no AI: 1978293
ENERGY SAVED: 87%

Total Energy saved by the AI = 87%

That's a lot of energy saved! Google DeepMind achieved similarly impressive results in 2016. If you look up the results by searching "DeepMind reduces Google cooling bill," you'll see that the result they achieved was 40%. Not bad! Of course, let's be critical: their server/ data center environment is much more complex than our server environment and has many more parameters, so even though they have one of the most talented AI teams in the world, they could only reduce the cooling bill by less than 50%.

Our environment's very simple, and if you dig into it (which I recommend you do) you'll likely find that the variations of users and data, and therefore the variation of temperature, follow a uniform distribution. Accordingly, the server's temperature usually stays around the optimal range of temperatures. The AI understands that well, and thus chooses most of the time to take no action and cause no change of temperature, thus consuming very little energy.

I highly recommend that you play around with your server cooling model; make it as complex as you like, and try out different rewards to see if you can cause different behaviors.

Even though our environment is simple, you can be proud of your achievement. What matters is that you were able to build a deep Q-learning model for a real-world business problem. The environment itself is less important; what's most important is that you know how to connect a deep reinforcement learning model to an environment, and how to train the model inside.

Now, after your successes with the self-driving car plus this business application, you know how to do just that!

What we've built is excellent for our business client, as our AI will seriously reduce their costs. Remember that thanks to our object-oriented structure (working with classes and objects), we could very easily take the objects created in this implementation for one server, and then plug them into other servers, so that in the end we end up lowering the total energy consumption of a whole data center! That's how Google saved billions of dollars in energy-related costs, thanks to the DQN model built by their DeepMind AI.

My heartiest congratulations to you for smashing this new application. You've just made huge progress with your AI skills.

Finally, here's the link to the Colaboratory file with this whole implementation as promised. You don't have to install anything, Keras and NumPy are already pre-installed (this is the beauty of Google Colab!):

https://colab.research.google.com/drive/1KGAoT7S60OC3UGHNnrr_FuN5Hcil0cHk

Before we finish this chapter and move onto the world of deep convolutional Q-learning, let me give you a useful recap of the whole general AI blueprint when building a deep reinforcement learning model.

Recap – The general AI framework/Blueprint

Let's recap the whole AI Blueprint, so that you can print it out and put it on your wall.

Step 1: Building the environment

  • Step 1-1: Introducing and initializing all the parameters and variables of the environment.
  • Step 1-2: Making a method that updates the environment right after the AI plays an action.
  • Step 1-3: Making a method that resets the environment.
  • Step 1-4: Making a method that gives us at any time the current state, the last reward obtained, and whether the game is over.

Step 2: Building the brain

  • Step 2-1: Building the input layer composed of the input states.
  • Step 2-2: Building the hidden layers with a chosen number of these layers and neurons inside each, fully connected to the input layer and between each other.
  • Step 2-3: Building the output layer, fully connected to the last hidden layer.
  • Step 2-4: Assembling the full architecture inside a model object.
  • Step 2-5: Compiling the model with a mean squared error loss function and a chosen optimizer (a good one is Adam).

Step 3: Implementing the deep reinforcement learning algorithm

  • Step 3-1: Introducing and initializing all the parameters and variables of the DQN model.
  • Step 3-2: Making a method that builds the memory in experience replay.
  • Step 3-3: Making a method that builds and returns two batches of 10 inputs and 10 targets.

Step 4: Training the AI

  • Step 4-1: Building the environment by creating an object of the Environment class built in Step 1.
  • Step 4-2: Building the artificial brain by creating an object of the Brain class built in Step 2.
  • Step 4-3: Building the DQN model by creating an object of the DQN class built in Step 3.
  • Step 4-4: Choosing the training mode.
  • Step 4-5: Starting the training with a for loop over a chosen number of epochs.
  • Step 4-6: During each epoch we repeat the whole deep Q-learning process, while also doing some exploration 30% of the time.

Step 5: Testing the AI

  • Step 5-1: Building a new environment by creating an object of the Environment class built in Step 1.
  • Step 5-2: Loading the artificial brain with its pre-trained weights from the previous training.
  • Step 5-3: Choosing the inference mode.
  • Step 5-4: Starting the simulation.
  • Step 5-5: At each iteration (each minute), our AI only plays the action that results from its prediction, and no exploration or deep Q-learning training is happening whatsoever.

Summary

In this chapter you re-applied deep Q-learning to a new business problem. You were supposed to find the best strategy to cool down and heat up the server. Before you started defining the AI strategy, you had to make some assumptions about your environment, for example the way the temperature is calculated. As inputs to your ANN, you had information about the server at any given time, like the temperature and data transmission. As outputs, your AI predicted whether to cool down or heat up our server by a certain amount. The reward was the energy saved with respect to the other, traditional cooling system. Your AI was able to save 87% energy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.81.210