© Puneet Mathur 2020
P. MathurIoT Machine Learning Applications in Telecom, Energy, and Agriculturehttps://doi.org/10.1007/978-1-4842-5549-0_9

9. Agriculture Industry Case Study: Predicting a Cash Crop Yield

Puneet Mathur1 
(1)
Bangalore, Karnataka, India
 

This chapter covers an agriculture industry-based case study for predicting a cash crop yield. The case study gives you an idea of the challenges faced by a mid-sized agri-conglomerate trying to reach the next level and become as big as the vision of its founder. The problem of crop yield is very important for such organizations due to the fact that they want to maximize their land resources to get the highest revenue possible. What you will primarily learn in the case study is the fact that a company’s vision should be tied to the machine learning operation that it is undertaking; otherwise it will be a wasteful expenditure. This is what I tell most of my clients when they hire me for consultation to look at how their current machine learning application helps them. Does it lower costs? If so, by how much to increase revenue? Then by how much? The goal of the project has to be quantifiable or it will not be successful but it will give dissatisfaction to the business owners and stakeholders. So read on…

Agriculture Industry Case Study Overview

An international agriculture conglomerate named Aystsaga Agro has investments in farming, fertilizer production, and tractors. The president, Tamio Polskab, is also the founder of the company. He has steered the company from its humble origins from one North American farm inherited from his father to an agri-business that spans three continents, North America, Africa, and Asia. The agri-business is not an easy one as it has many challenges that are emerging as a result of tariff wars and climate change. Worldwide, there is an agriculture crisis happening from South Asia to African countries trying to protect small farmers from the onslaught of such rapid changes. There is one big silent change that is happening: the use of technology, such as robotic machines, to replace humans in farming. Agriculture conglomerates such as Aystsaga Agro, which are highly commercial in their operations, are embracing technology to see how they can adopt and benefit from it. Since agriculture is not such a profitable business as compared to other industries such as service industries, the funding for such operations is really hard to get due to low ROI. This is the reason why such commercial companies focus on cash crops like corn, sugarcane, wheat, and soybeans. Sugarcane, sugar beets, and tomatoes fetch better yield than other crops for Aystsaga Agro, as shown in Table 9-1.
Table 9-1

Global Yearly Production at Aystasaga Agro for Three Cash Crops

Aystsaga Agro

2017-2018 Worldwide Production

Tonnes

Total Land Size

Sugarcane

470817.99

5361.70

Sugar beet

503999.80

5361.70

Tomatoes

273446.70

5361.70

In 2013, Aystsaga Agro purchased big farms in Brazil and South Africa. The Brazil farms were in the region of Sao Palo, and the South African farms were in the interior regions. Tamio, the founder of the company, succeeded in expanding the company’s agriculture farming operations by acquiring land and using it for commercial operations. His vision was to establish his company as a major agri-business in the world.

While drinking his morning green tea, Tamio looked at the yearly financial results his CFO sent him that morning for his review. See Table 9-2.
Table 9-2

Worldwide Production Numbers and Revenue Figures for Aystsaga Agro

2017-2018 Worldwide Production

Tonnes

Total Land Size

US$

Sugarcane

470817.99

5361.70

241058809.3

Sugar beet

503999.80

5361.70

25199990

Tomatoes

178446.70

5361.70

180231167

Income from Agricultural Operation

  

446489966.3

While he was looking at the numbers, there was a knock on his plush South African office cabin door. He raised his head to catch his Chief Operations Officer, Glanzo, smiling. He signaled for him to come and sit in front of him. As Glanzo came and sat down, Tanio rose up and went near the right side of the window. He looked at the breathtaking view of the sea in the South African capital as he started speaking slowly.

The Problem

“Have you seen the numbers sent by Nambi this morning?” asked Tamio.

Glanzo nodded and said, “I think they are pretty impressive given that we have had several storms around our farms in Brazil and South Africa this year.”

“You don’t understand my vision, do you?” asked Tamio, looking straight at his right-hand man. “My vision is to achieve a turnover of USD 500 million and reach USD 1 billion in six years. I told the board about this in the last AGM. You were there too,” said the boss.

Glanzo said, raising his shoulders, “We were all there, but I thought you were saying that to please the investors and the board.”

“No, that was a goal that I have been working towards for the past 15 years, ever since this company went public. This is not just a vision or a goal; this is a challenge that I have thrown up to you all,” said Tamio decidedly.

Glanzo retorted, “Yes, I understand the need to grow; otherwise, we will be eaten by the big fish in the industry. But you must understand the challenges that our operations are facing today, and unless we find real solutions for them, we will not grow at the rate that you want it to happen.”

Tamio now had a frown at his face as he sat down and rested his head on his huge, leather executive chair. He was listening intently as Glanzo spoke further. “The single thing that is preventing us from growing is our ability to predict the yield of crops in a given land.” Glanzo continued, “Our agriculture operations are a major problem for us in regions where we made a bad decision in buying infertile land that is low yielding. Low yielding land requires us to rectify it by applying different chemicals on hectares of land, which increases the cost of production and eats into our profitability. If we have to grow at the rate that you are spelling out, we need a way to determine which land is high yielding for a particular crop. If we are able to achieve this, we can avoid buying farms that are unproductive and low yielding, like the ones we have in Brazil where the soil is highly acidic and has to be treated with lime to make it more alkaline.”

Tamio raised his head slightly, signaling that he understood what Glanzo was talking about. He asked, “Do you know of a way to find out how to predict the yield for a particular type of land, Glanzo?”

Glanzo responded, “I have been looking at some of the research that has been happening in universities around the world; however, nothing concrete is available. But, in my opinion, data machine learning and AI show some promise to solve the problem. We can hire some machine learning engineers and data scientists who can help us create a model for ascertaining the yield of crop in a particular land location.”

“That sounds great. Why don’t you form a team to look into this problem and propose possible solutions?” asked Tamio, smiling at Glanzo.

Glanzo said, “Yes, that is what I intend to do. First, I’ll hire machine learning engineers and data scientist and then I’ll add people from our business operations to the team.”

“Send me an email for approval to go ahead,” said Tamio, picking up the eyeglasses from his desk as Glanzo rose to leave his cabin.

Machine Learning to the Rescue?

The machine learning team was formed in two months’ time with Hert Liu hired as the machine learning engineer for the pilot project. Along with three data scientists, he was made to take a robust tour of Aystsaga Agro’s agriculture operations in Brazil, South Africa, and India. The business operations team members were introduced to them once their induction program was over. With detailed briefings, Hert and his team had various new words typical to agriculture farming added to their vocabulary. They also understood the inside processes that went into producing crops. The detailed tours really helped the team in dig deeper into the company’s operations. However, what they lacked was the experiential knowledge, and that is the gap the business operations team members were going to fill in this pilot team.

Hert and his team met a couple of times after coming together. They met Glanzo several times too. He effectively communicated the company founder’s vision and the problem at hand, which was linked to its growth. Hert was an experienced machine learning engineer who worked in the insurance domain earlier. His only brush with agriculture was creating a model for predicting claims for agricultural farmers. So this was a very high learning curve for him, where he had to understand the intricacies of the commercial farming business and also the intricacies involved in managing it, such as crop failures due to insect infestations, changes in weather patterns, and the soil constitution and its effects on farm output. Hert and his team started gathering some parameters that they felt could help in predicting the farm output of a crop. They divided their analysis into weather, soil, and economic environments. They shared their understanding with Glanzo in bi-weekly meeting with him. Hert showed how weather was damaging crops in their Indian farms due to unpredictable rains and floods near the farm. Glanzo pointed out that the biggest problem they faced was determining the profile of high yielding farm land before purchase. “As per our founder’s vision, we are looking to buy a lot of farm land around the world such as in Australia, Thailand, and other regions; however, we can’t do this simply by blindly buying land and then finding out it yields less crop that the average farming operations. This means disaster to our ROI. You as a team should build a model that helps us in determining the profile of high yield cash crop farmland. To do so, you should look at the soil profile and what makes soil give high yield or low yield. Use whatever instruments you want; we can buy them. There is no shortage of funds but you must build a system that will benefit us the most in our bid to grow exponentially,” he said.

After getting clear direction on which way to proceed, Hert and his team got together with the business operations team members to understand what constituted a soil profile for a sugarcane crop, which they chose for their pilot project. They selected the parameters shown in Table 9-3.
Table 9-3

Parameters for Monitoring Soil Nutrients

Soil pH

Organic Carbon %

Nitrogen kg/ha

Phosphorus kg/ha

Potassium kg/ha

Zinc mg/kg

Iron mg/kg

Copper mg/kg

Manganese mg/kg

Sulphur kg/ha

After having decided to zero in on these parameters in order to create a soil profile, they now wanted to collect data from each of the farms from Aystsaga Agro globally. Glanzo helped them buy the following equipment:
  1. 1.

    Commercial grade IoT sensors kits to read soil nutrient data

     
  2. 2.

    Soil nutrient manual testing kits for places where IoT sensors were hard to run

     

We are now going to build a solution for the problem of predicting the yield of a sugarcane cash crop based on the dataset from the file cashcrop_Yield_dataset.csv.

Solution

We can assume that the dataset is produced after reading soil samples from the IoT sensors and manual soil nutrient test kits for the various parameters given in the dataset file cashcrop_Yield_dataset.csv . The code in Listing 9-1 is not the most definitive solution for this problem but one that is simple and quick to achieve, as in any pilot project. In a real-world scenario, the data collected for soil nutrients will have many more parameters. Such a data collection exercise may take months to complete if the size of operations spans several continents. Agriculture operations are spread far and wide away from urban places so it’s a hectic job for any company to comply. We’ll keep these factors in mind when designing a solution. Note that we’re only using linear regression as it gives a highly accurate score; however, I leave it up to you to try other regressor algorithms.

The Python Code

Listing 9-1 gives the Python-based solution for the problem that Aystsaga Agro is facing. The code starts with the usual imports of the common Python libraries such as pandas, StringIO, requests, etc. After this, we load the dataset from the CSV file cashcrop_Yield_dataset.csv. You can use a SQLite database instead. After this, we do the exploratory data analysis by looking at the mean, median, mode, standard deviation, and minimum and maximum values of each column of the dataframe. After this, we look at the outliers and count them, such as in the code df['Yield_per_ha'].loc[df['Yield_per_ha'] <=151.50000].count() for the yield per hectare column. This is going to be our predictor because the company wants to know the parameters that increase this yield, as discussed in the case study. We then visualize the data using boxplots and look at the skewness and kurtosis of the numeric columns. We also visualize it using an area plot and histogram. After this, it’s time to look at the relationships between the columns with reference to the predictor column Yield_per_ha with the df.corr() code . Next in a three-step process, the code in the first step splits data into features and target variables. In step 2, it shuffles and splits the final dataset into training and testing datasets for building the prediction model. The last step is model building and evaluation, where we use linear regression since our goal is to predict a numerical variable yield per hectare. The last lines of the code give us the ability to take any new farm values and predict yield per hectare through the code predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]]). To understand more about the result of the code and how it is being executed, look at the discussion after Listing 9-1.
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 08 19:33:25 2019
@author: PUNEETMATHUR
"""
#importing python libraries
import pandas as pd
from io import StringIO
import requests
import os
os.getcwd()
#Loading dataset
fname="cashcrop_Yield_dataset.csv"
agriculture= pd.read_csv(fname, low_memory=False, index_col=False)
df= pd.DataFrame(agriculture)
print(df.head(1))
#Checking data sanctity
print(df.size)
print(df.shape)
print(df.columns)
df.dtypes
#Check if there are any columns with empty/null dataset
df.isnull().any()
#Checking how many columns have null values
df.info()
#Using individual functions to do EDA
#Checking out Statistical data Mean Median Mode correlation
df.mean()
df.median()
df.mode()
#How is the data distributed and detecting Outliers
df.std()
df.max()
df.min()
df.quantile(0.25)*1.5
df.quantile(0.75)*1.5
#How many Outliers in the Total Food ordered column
df.columns
df.dtypes
df.set_index(['FarmID'])
df['Yield_per_ha'].loc[df['Yield_per_ha'] <=151.50000].count()
df['Yield_per_ha'].loc[df['Yield_per_ha'] >=159.285].count()
#Visualizing the dataset
df.boxplot(figsize=(17, 10))
df.plot.box(vert=False)
df.kurtosis()
df.skew()
import scipy.stats as sp
sp.skew(df['Yield_per_ha'])
#Visualizing dataset
df.plot()
df.hist(figsize=(10, 6))
df.plot.area()
df.plot.area(stacked=False)
#Now look at correlation and patterns
df.corr()
#Change to dataset columns and look at scatter plots closely
df.plot.scatter(x='Yield_per_ha', y="Soil_pH",s=df['Yield_per_ha']*2)
df.plot.hexbin(x='Yield_per_ha', y="Soil_pH", gridsize=20)
#Data Preparation Steps
#Step 1 Split data into features and target variable
# Split the data into features and target label
cropyield = pd.DataFrame(df['Yield_per_ha'])
dropp=df[['Iron mg/kg','Copper mg/kg','Crop','Center','FarmID','Yield_per_ha']]
features= df.drop(dropp, axis=1)
cropyield.columns
features.columns
#Step 2 Shuffle & Split Final Dataset
# Import train_test_split
from sklearn.cross_validation import train_test_split
from sklearn.utils import shuffle
# Shuffle and split the data into training and testing subsets
features=shuffle(features,  random_state=0)
cropyield=shuffle(cropyield,  random_state=0)
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, cropyield, test_size = 0.2, random_state = 0)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
# Step 3 Model Building & Evaluation
#Creating the the Model for prediction
#Loading model Libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
#Creating Linear Regression object
regr = linear_model.LinearRegression()
regr.fit(X_train,y_train)
y_pred= regr.predict(X_test)
#Printing Codfficients
print('Coefficients: ',regr.coef_)
#print(LinearSVC().fit(X_train,y_train).coef_)
regr.score(X_train,y_train)
#Mean squared error
print("mean squared error:  %.2f" %mean_squared_error(y_test,y_pred))
#Variance score
print("Variance score: %2f"  % r2_score(y_test, y_pred))
#Plot and visualize the Linear Regression plot
plt.plot(X_test, y_pred, linewidth=3)
plt.show()
#Predicting Yield per hectare for a new farmland
X_test.dtypes
predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]])
print(predicted)
Listing 9-1

Code for the Solution of the Case Study

This is straightforward code. It first loads the dataset into a pandas dataframe and then checks data sanctity through df.size, df.dtypes, and other statements. The EDA is done with the df.mean(), df.median(), and df.mode() statements. The correlation is shown in Figure 9-1. In our case, we can have a look at the average mean values for soil pH and other soil nutrients in the output of Listing 9-2 and Figures 9-2 through 9-9. Outlier detection tells us that the yield data column has no outliers beyond the upper threshold limit; however, all the values are below the lower threshold limit. This can be seen visually by plotting the histogram using the command df.hist() in the code. After this, we can look at the correlation.
../images/484167_1_En_9_Chapter/484167_1_En_9_Fig1_HTML.png
Figure 9-1

Correlation between the variables

As we can see from Figure 9-1, we are interested in the Predictor Yield_per_ha or Yield per hectare. We think it is dependent on other variables or soil nutrients; this can be confirmed by looking at the correlation values between the Yield_per_ha variable with other variables. Soil_ph has a correlation of 0.936215238 with the Yield variable, Organic Carbon % has 0.868255792, Nitrogen kg/ha has 0.831095999, Phosphorus kg/ha has 0.831095999, Potassium kg/ha has 0.78733627, Iron mg/kg has -0.030561931, Copper mg/kg has 0.006008786, and Sulphur kg/ha has 0.792147864. To build our model, we pick the ones that have a significant correlation, namely
  • Soil_pH

  • Organic Carbon %

  • Nitrogen kg/ha

  • Phosphorus kg/ha

  • Potassium kg/ha

  • Sulphur kg/ha

We can ignore and remove iron and copper because they do not show any significant correlation; in fact, it is negligible for them to be considered for our model building exercise.
        Crop  Center  FarmID  landsize_in_ha  Crop_in_tonnes  Yield_per_ha  
0  Sugarcane  Africa    1234            11.0          1235.3          112.3
   Soil_pH  Organic Carbon %  Nitrogen kg/ha  Phosphorus kg/ha  
0     7.81              0.88           355.0              36.0
   Potassium kg/ha  Iron mg/kg  Copper mg/kg  Sulphur kg/ha
0            485.0        9.73          4.73  33.0
6314
(451, 14)
Index([u'Crop', u'Center', u'FarmID', u'landsize_in_ha', u'Crop_in_tonnes', u'Yield_per_ha', u'Soil_pH', u'Organic Carbon %', u'Nitrogen kg/ha', u'Phosphorus kg/ha', u'Potassium kg/ha', u'Iron mg/kg', u'Copper mg/kg', u'Sulphur kg/ha'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 451 entries, 0 to 450
Data columns (total 14 columns):
Crop                451 non-null object
Center              451 non-null object
FarmID              451 non-null int64
landsize_in_ha      451 non-null float64
Crop_in_tonnes      451 non-null float64
Yield_per_ha        451 non-null float64
Soil_pH             451 non-null float64
Organic Carbon %    451 non-null float64
Nitrogen kg/ha      451 non-null float64
Phosphorus kg/ha    451 non-null float64
Potassium kg/ha     451 non-null float64
Iron mg/kg          374 non-null float64
Copper mg/kg        337 non-null float64
Sulphur kg/ha       451 non-null float64
dtypes: float64(11), int64(1), object(2)
memory usage: 49.4+ KB
Training set has 360 samples.
Testing set has 91 samples.
('Coefficients: ', array([[-2.11198278,  0.02322665,  5.78698466,  3.69351487,  0.01497725,  0.17939622,  0.01021635,
         0.15396713]]))
mean squared error:  16.50
Variance score: 0.950297
Listing 9-2

Output of Code from Listing 9-1

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig2_HTML.jpg
Figure 9-2

Vertical boxplot visualization

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig3_HTML.jpg
Figure 9-3

Horizontal boxplot visualization

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig4_HTML.jpg
Figure 9-4

Area graph of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig5_HTML.jpg
Figure 9-5

Histogram of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig6_HTML.jpg
Figure 9-6

Stacked area plot of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig7_HTML.jpg
Figure 9-7

Cumulative stacked area graph

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig8_HTML.jpg
Figure 9-8

Scatter plot of Yield_per_ha and Soil_oH variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig9_HTML.jpg
Figure 9-9

Chart plot of Yield_per_ha and Soil_oH variables

In Figures 9-2 through 9-8, we can see that the bloxplot shows us a highly distributed value of Crop_in_tonnes. One way to avoid seeing this is to apply scaling to all of the numerical variables, like I did in the case study solutions for healthcare, retail, and finance in my book Machine Learning Applications Using Python. The histogram shows the distribution of variables, and we see that most of the soil nutrients are right-skewed except for copper and iron. We also note that the scatter plots for Yield_per_ha and Soil_pH are closely related through the Python code df.plot.hexbin(x='Yield_per_ha', y="Soil_pH", gridsize=20). After completing the usual data preparation steps by dividing it into a target variable named cropyield and a features variable with all the other features, we then split the dataset into training and testing datasets with 360 samples belonging to the training dataset and 91 samples belonging to the testing dataset. After this, we load the linear_model Python library to execute the linear regression algorithm on the training and testing datasets through the Python code regr.fit(X_train,y_train) , y_pred= regr.predict(X_test). Then we look at the accuracy score that we get with our testing dataset before starting to make a prediction through the code print("Variance score: %2f" % r2_score(y_test, y_pred)), which gives us 0.9502, or 95.02% accuracy. Since this is good, we can proceed with making a prediction. Please remember that this is fictitious data so we are able to achieve a good accuracy level. However, in the real world, you may need to run more regressors or fine-tune your data gathering efforts in order to achieve such an accuracy level.

To predict, the code used is predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]]), which are the values of an agriculture land that Aystsaga Agro is evaluating in another country with following characteristic features:
landsize_in_ha      26
Crop_in_tonnes      1500
Soil_pH             6.8
Organic Carbon %    0.9
Nitrogen kg/ha      367
Phosphorus kg/ha    32
Potassium kg/ha     490
Sulphur kg/ha       35
Glanzo wants to know from the model what the probable yearly yield will be after they buy this farmland. The Python program gives out a value of 75.2637 tonnes per hectare, as shown in Listing 9-3.
predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]])
print(predicted)
[[75.26373654]]
Listing 9-3

Predicted Value by the Python Program

Please remember that this program does not take into account other conditions such as weather and seed varieties, which also effect crop production and yield. Building such a dataset would definitely be a huge exercise beyond the scope of this book.

Summary

We have now come to the end of this case study and the book. I have thoroughly enjoyed bringing you these IoT-based solutions to modern-day, practical business problems and trying to solve them through machine learning. I hope you enjoyed learning from them too. Do consider leaving feedback on the forums at www.pmauthor.com/raspbian.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.36.141