Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

P. MathurIoT Machine Learning Applications in Telecom, Energy, and Agriculturehttps://doi.org/10.1007/978-1-4842-5549-0_9

9. Agriculture Industry Case Study: Predicting a Cash Crop Yield

Puneet Mathur¹

(1)

Bangalore, Karnataka, India

This chapter covers an agriculture industry-based case study for predicting a cash crop yield. The case study gives you an idea of the challenges faced by a mid-sized agri-conglomerate trying to reach the next level and become as big as the vision of its founder. The problem of crop yield is very important for such organizations due to the fact that they want to maximize their land resources to get the highest revenue possible. What you will primarily learn in the case study is the fact that a company’s vision should be tied to the machine learning operation that it is undertaking; otherwise it will be a wasteful expenditure. This is what I tell most of my clients when they hire me for consultation to look at how their current machine learning application helps them. Does it lower costs? If so, by how much to increase revenue? Then by how much? The goal of the project has to be quantifiable or it will not be successful but it will give dissatisfaction to the business owners and stakeholders. So read on…

Agriculture Industry Case Study Overview

An international agriculture conglomerate named Aystsaga Agro has investments in farming, fertilizer production, and tractors. The president, Tamio Polskab, is also the founder of the company. He has steered the company from its humble origins from one North American farm inherited from his father to an agri-business that spans three continents, North America, Africa, and Asia. The agri-business is not an easy one as it has many challenges that are emerging as a result of tariff wars and climate change. Worldwide, there is an agriculture crisis happening from South Asia to African countries trying to protect small farmers from the onslaught of such rapid changes. There is one big silent change that is happening: the use of technology, such as robotic machines, to replace humans in farming. Agriculture conglomerates such as Aystsaga Agro, which are highly commercial in their operations, are embracing technology to see how they can adopt and benefit from it. Since agriculture is not such a profitable business as compared to other industries such as service industries, the funding for such operations is really hard to get due to low ROI. This is the reason why such commercial companies focus on cash crops like corn, sugarcane, wheat, and soybeans. Sugarcane, sugar beets, and tomatoes fetch better yield than other crops for Aystsaga Agro, as shown in Table 9-1.

Table 9-1

Global Yearly Production at Aystasaga Agro for Three Cash Crops

Aystsaga Agro
2017-2018 Worldwide Production	Tonnes	Total Land Size
Sugarcane	470817.99	5361.70
Sugar beet	503999.80	5361.70
Tomatoes	273446.70	5361.70

In 2013, Aystsaga Agro purchased big farms in Brazil and South Africa. The Brazil farms were in the region of Sao Palo, and the South African farms were in the interior regions. Tamio, the founder of the company, succeeded in expanding the company’s agriculture farming operations by acquiring land and using it for commercial operations. His vision was to establish his company as a major agri-business in the world.

While drinking his morning green tea, Tamio looked at the yearly financial results his CFO sent him that morning for his review. See Table 9-2.

Table 9-2

Worldwide Production Numbers and Revenue Figures for Aystsaga Agro

2017-2018 Worldwide Production	Tonnes	Total Land Size	US$
Sugarcane	470817.99	5361.70	241058809.3
Sugar beet	503999.80	5361.70	25199990
Tomatoes	178446.70	5361.70	180231167
Income from Agricultural Operation			446489966.3

While he was looking at the numbers, there was a knock on his plush South African office cabin door. He raised his head to catch his Chief Operations Officer, Glanzo, smiling. He signaled for him to come and sit in front of him. As Glanzo came and sat down, Tanio rose up and went near the right side of the window. He looked at the breathtaking view of the sea in the South African capital as he started speaking slowly.

The Problem

“Have you seen the numbers sent by Nambi this morning?” asked Tamio.

Glanzo nodded and said, “I think they are pretty impressive given that we have had several storms around our farms in Brazil and South Africa this year.”

“You don’t understand my vision, do you?” asked Tamio, looking straight at his right-hand man. “My vision is to achieve a turnover of USD 500 million and reach USD 1 billion in six years. I told the board about this in the last AGM. You were there too,” said the boss.

Glanzo said, raising his shoulders, “We were all there, but I thought you were saying that to please the investors and the board.”

“No, that was a goal that I have been working towards for the past 15 years, ever since this company went public. This is not just a vision or a goal; this is a challenge that I have thrown up to you all,” said Tamio decidedly.

Glanzo retorted, “Yes, I understand the need to grow; otherwise, we will be eaten by the big fish in the industry. But you must understand the challenges that our operations are facing today, and unless we find real solutions for them, we will not grow at the rate that you want it to happen.”

Tamio now had a frown at his face as he sat down and rested his head on his huge, leather executive chair. He was listening intently as Glanzo spoke further. “The single thing that is preventing us from growing is our ability to predict the yield of crops in a given land.” Glanzo continued, “Our agriculture operations are a major problem for us in regions where we made a bad decision in buying infertile land that is low yielding. Low yielding land requires us to rectify it by applying different chemicals on hectares of land, which increases the cost of production and eats into our profitability. If we have to grow at the rate that you are spelling out, we need a way to determine which land is high yielding for a particular crop. If we are able to achieve this, we can avoid buying farms that are unproductive and low yielding, like the ones we have in Brazil where the soil is highly acidic and has to be treated with lime to make it more alkaline.”

Tamio raised his head slightly, signaling that he understood what Glanzo was talking about. He asked, “Do you know of a way to find out how to predict the yield for a particular type of land, Glanzo?”

Glanzo responded, “I have been looking at some of the research that has been happening in universities around the world; however, nothing concrete is available. But, in my opinion, data machine learning and AI show some promise to solve the problem. We can hire some machine learning engineers and data scientists who can help us create a model for ascertaining the yield of crop in a particular land location.”

“That sounds great. Why don’t you form a team to look into this problem and propose possible solutions?” asked Tamio, smiling at Glanzo.

Glanzo said, “Yes, that is what I intend to do. First, I’ll hire machine learning engineers and data scientist and then I’ll add people from our business operations to the team.”

“Send me an email for approval to go ahead,” said Tamio, picking up the eyeglasses from his desk as Glanzo rose to leave his cabin.

Machine Learning to the Rescue?

The machine learning team was formed in two months’ time with Hert Liu hired as the machine learning engineer for the pilot project. Along with three data scientists, he was made to take a robust tour of Aystsaga Agro’s agriculture operations in Brazil, South Africa, and India. The business operations team members were introduced to them once their induction program was over. With detailed briefings, Hert and his team had various new words typical to agriculture farming added to their vocabulary. They also understood the inside processes that went into producing crops. The detailed tours really helped the team in dig deeper into the company’s operations. However, what they lacked was the experiential knowledge, and that is the gap the business operations team members were going to fill in this pilot team.

Hert and his team met a couple of times after coming together. They met Glanzo several times too. He effectively communicated the company founder’s vision and the problem at hand, which was linked to its growth. Hert was an experienced machine learning engineer who worked in the insurance domain earlier. His only brush with agriculture was creating a model for predicting claims for agricultural farmers. So this was a very high learning curve for him, where he had to understand the intricacies of the commercial farming business and also the intricacies involved in managing it, such as crop failures due to insect infestations, changes in weather patterns, and the soil constitution and its effects on farm output. Hert and his team started gathering some parameters that they felt could help in predicting the farm output of a crop. They divided their analysis into weather, soil, and economic environments. They shared their understanding with Glanzo in bi-weekly meeting with him. Hert showed how weather was damaging crops in their Indian farms due to unpredictable rains and floods near the farm. Glanzo pointed out that the biggest problem they faced was determining the profile of high yielding farm land before purchase. “As per our founder’s vision, we are looking to buy a lot of farm land around the world such as in Australia, Thailand, and other regions; however, we can’t do this simply by blindly buying land and then finding out it yields less crop that the average farming operations. This means disaster to our ROI. You as a team should build a model that helps us in determining the profile of high yield cash crop farmland. To do so, you should look at the soil profile and what makes soil give high yield or low yield. Use whatever instruments you want; we can buy them. There is no shortage of funds but you must build a system that will benefit us the most in our bid to grow exponentially,” he said.

After getting clear direction on which way to proceed, Hert and his team got together with the business operations team members to understand what constituted a soil profile for a sugarcane crop, which they chose for their pilot project. They selected the parameters shown in Table 9-3.

Table 9-3

Parameters for Monitoring Soil Nutrients

Soil pH
Organic Carbon %
Nitrogen kg/ha
Phosphorus kg/ha
Potassium kg/ha
Zinc mg/kg
Iron mg/kg
Copper mg/kg
Manganese mg/kg
Sulphur kg/ha

After having decided to zero in on these parameters in order to create a soil profile, they now wanted to collect data from each of the farms from Aystsaga Agro globally. Glanzo helped them buy the following equipment:

1.
Commercial grade IoT sensors kits to read soil nutrient data
2.
Soil nutrient manual testing kits for places where IoT sensors were hard to run

We are now going to build a solution for the problem of predicting the yield of a sugarcane cash crop based on the dataset from the file cashcrop_Yield_dataset.csv.

Solution

We can assume that the dataset is produced after reading soil samples from the IoT sensors and manual soil nutrient test kits for the various parameters given in the dataset file cashcrop_Yield_dataset.csv . The code in Listing 9-1 is not the most definitive solution for this problem but one that is simple and quick to achieve, as in any pilot project. In a real-world scenario, the data collected for soil nutrients will have many more parameters. Such a data collection exercise may take months to complete if the size of operations spans several continents. Agriculture operations are spread far and wide away from urban places so it’s a hectic job for any company to comply. We’ll keep these factors in mind when designing a solution. Note that we’re only using linear regression as it gives a highly accurate score; however, I leave it up to you to try other regressor algorithms.

The Python Code

Listing 9-1 gives the Python-based solution for the problem that Aystsaga Agro is facing. The code starts with the usual imports of the common Python libraries such as pandas, StringIO, requests, etc. After this, we load the dataset from the CSV file cashcrop_Yield_dataset.csv. You can use a SQLite database instead. After this, we do the exploratory data analysis by looking at the mean, median, mode, standard deviation, and minimum and maximum values of each column of the dataframe. After this, we look at the outliers and count them, such as in the code df['Yield_per_ha'].loc[df['Yield_per_ha'] <=151.50000].count() for the yield per hectare column. This is going to be our predictor because the company wants to know the parameters that increase this yield, as discussed in the case study. We then visualize the data using boxplots and look at the skewness and kurtosis of the numeric columns. We also visualize it using an area plot and histogram. After this, it’s time to look at the relationships between the columns with reference to the predictor column Yield_per_ha with the df.corr() code . Next in a three-step process, the code in the first step splits data into features and target variables. In step 2, it shuffles and splits the final dataset into training and testing datasets for building the prediction model. The last step is model building and evaluation, where we use linear regression since our goal is to predict a numerical variable yield per hectare. The last lines of the code give us the ability to take any new farm values and predict yield per hectare through the code predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]]). To understand more about the result of the code and how it is being executed, look at the discussion after Listing 9-1.

# -*- coding: utf-8 -*-

"""

Created on Tue Oct 08 19:33:25 2019

@author: PUNEETMATHUR

"""

#importing python libraries

import pandas as pd

from io import StringIO

import requests

import os

os.getcwd()

#Loading dataset

fname="cashcrop_Yield_dataset.csv"

agriculture= pd.read_csv(fname, low_memory=False, index_col=False)

df= pd.DataFrame(agriculture)

print(df.head(1))

#Checking data sanctity

print(df.size)

print(df.shape)

print(df.columns)

df.dtypes

#Check if there are any columns with empty/null dataset

df.isnull().any()

#Checking how many columns have null values

df.info()

#Using individual functions to do EDA

#Checking out Statistical data Mean Median Mode correlation

df.mean()

df.median()

df.mode()

#How is the data distributed and detecting Outliers

df.std()

df.max()

df.min()

df.quantile(0.25)*1.5

df.quantile(0.75)*1.5

#How many Outliers in the Total Food ordered column

df.columns

df.dtypes

df.set_index(['FarmID'])

df['Yield_per_ha'].loc[df['Yield_per_ha'] <=151.50000].count()

df['Yield_per_ha'].loc[df['Yield_per_ha'] >=159.285].count()

#Visualizing the dataset

df.boxplot(figsize=(17, 10))

df.plot.box(vert=False)

df.kurtosis()

df.skew()

import scipy.stats as sp

sp.skew(df['Yield_per_ha'])

#Visualizing dataset

df.plot()

df.hist(figsize=(10, 6))

df.plot.area()

df.plot.area(stacked=False)

#Now look at correlation and patterns

df.corr()

#Change to dataset columns and look at scatter plots closely

df.plot.scatter(x='Yield_per_ha', y="Soil_pH",s=df['Yield_per_ha']*2)

df.plot.hexbin(x='Yield_per_ha', y="Soil_pH", gridsize=20)

#Data Preparation Steps

#Step 1 Split data into features and target variable

# Split the data into features and target label

cropyield = pd.DataFrame(df['Yield_per_ha'])

dropp=df[['Iron mg/kg','Copper mg/kg','Crop','Center','FarmID','Yield_per_ha']]

features= df.drop(dropp, axis=1)

cropyield.columns

features.columns

#Step 2 Shuffle & Split Final Dataset

# Import train_test_split

from sklearn.cross_validation import train_test_split

from sklearn.utils import shuffle

# Shuffle and split the data into training and testing subsets

features=shuffle(features, random_state=0)

cropyield=shuffle(cropyield, random_state=0)

# Split the 'features' and 'income' data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, cropyield, test_size = 0.2, random_state = 0)

# Show the results of the split

print("Training set has {} samples.".format(X_train.shape[0]))

print("Testing set has {} samples.".format(X_test.shape[0]))

# Step 3 Model Building & Evaluation

#Creating the the Model for prediction

#Loading model Libraries

import matplotlib.pyplot as plt

import numpy as np

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

#Creating Linear Regression object

regr = linear_model.LinearRegression()

regr.fit(X_train,y_train)

y_pred= regr.predict(X_test)

#Printing Codfficients

print('Coefficients: ',regr.coef_)

#print(LinearSVC().fit(X_train,y_train).coef_)

regr.score(X_train,y_train)

#Mean squared error

print("mean squared error: %.2f" %mean_squared_error(y_test,y_pred))

#Variance score

print("Variance score: %2f" % r2_score(y_test, y_pred))

#Plot and visualize the Linear Regression plot

plt.plot(X_test, y_pred, linewidth=3)

plt.show()

#Predicting Yield per hectare for a new farmland

X_test.dtypes

predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]])

print(predicted)

Listing 9-1

Code for the Solution of the Case Study

This is straightforward code. It first loads the dataset into a pandas dataframe and then checks data sanctity through df.size, df.dtypes, and other statements. The EDA is done with the df.mean(), df.median(), and df.mode() statements. The correlation is shown in Figure 9-1. In our case, we can have a look at the average mean values for soil pH and other soil nutrients in the output of Listing 9-2 and Figures 9-2 through 9-9. Outlier detection tells us that the yield data column has no outliers beyond the upper threshold limit; however, all the values are below the lower threshold limit. This can be seen visually by plotting the histogram using the command df.hist() in the code. After this, we can look at the correlation.

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig1_HTML.png — Figure 9-1
Correlation between the variables

As we can see from Figure 9-1, we are interested in the Predictor Yield_per_ha or Yield per hectare. We think it is dependent on other variables or soil nutrients; this can be confirmed by looking at the correlation values between the Yield_per_ha variable with other variables. Soil_ph has a correlation of 0.936215238 with the Yield variable, Organic Carbon % has 0.868255792, Nitrogen kg/ha has 0.831095999, Phosphorus kg/ha has 0.831095999, Potassium kg/ha has 0.78733627, Iron mg/kg has -0.030561931, Copper mg/kg has 0.006008786, and Sulphur kg/ha has 0.792147864. To build our model, we pick the ones that have a significant correlation, namely

Soil_pH
Organic Carbon %
Nitrogen kg/ha
Phosphorus kg/ha
Potassium kg/ha
Sulphur kg/ha

We can ignore and remove iron and copper because they do not show any significant correlation; in fact, it is negligible for them to be considered for our model building exercise.

Crop Center FarmID landsize_in_ha Crop_in_tonnes Yield_per_ha

0 Sugarcane Africa 1234 11.0 1235.3 112.3

Soil_pH Organic Carbon % Nitrogen kg/ha Phosphorus kg/ha

0 7.81 0.88 355.0 36.0

Potassium kg/ha Iron mg/kg Copper mg/kg Sulphur kg/ha

0 485.0 9.73 4.73 33.0

6314

(451, 14)

Index([u'Crop', u'Center', u'FarmID', u'landsize_in_ha', u'Crop_in_tonnes', u'Yield_per_ha', u'Soil_pH', u'Organic Carbon %', u'Nitrogen kg/ha', u'Phosphorus kg/ha', u'Potassium kg/ha', u'Iron mg/kg', u'Copper mg/kg', u'Sulphur kg/ha'],

dtype='object')

RangeIndex: 451 entries, 0 to 450

Data columns (total 14 columns):

Crop 451 non-null object

Center 451 non-null object

FarmID 451 non-null int64

landsize_in_ha 451 non-null float64

Crop_in_tonnes 451 non-null float64

Yield_per_ha 451 non-null float64

Soil_pH 451 non-null float64

Organic Carbon % 451 non-null float64

Nitrogen kg/ha 451 non-null float64

Phosphorus kg/ha 451 non-null float64

Potassium kg/ha 451 non-null float64

Iron mg/kg 374 non-null float64

Copper mg/kg 337 non-null float64

Sulphur kg/ha 451 non-null float64

dtypes: float64(11), int64(1), object(2)

memory usage: 49.4+ KB

Training set has 360 samples.

Testing set has 91 samples.

('Coefficients: ', array([[-2.11198278, 0.02322665, 5.78698466, 3.69351487, 0.01497725, 0.17939622, 0.01021635,

0.15396713]]))

mean squared error: 16.50

Variance score: 0.950297

Listing 9-2

Output of Code from Listing 9-1

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig2_HTML.jpg — Figure 9-2
Vertical boxplot visualization

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig3_HTML.jpg — Figure 9-3
Horizontal boxplot visualization

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig4_HTML.jpg — Figure 9-4
Area graph of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig5_HTML.jpg — Figure 9-5
Histogram of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig6_HTML.jpg — Figure 9-6
Stacked area plot of the numeric variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig7_HTML.jpg — Figure 9-7
Cumulative stacked area graph

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig8_HTML.jpg — Figure 9-8
Scatter plot of Yield_per_ha and Soil_oH variables

../images/484167_1_En_9_Chapter/484167_1_En_9_Fig9_HTML.jpg — Figure 9-9
Chart plot of Yield_per_ha and Soil_oH variables

In Figures 9-2 through 9-8, we can see that the bloxplot shows us a highly distributed value of Crop_in_tonnes. One way to avoid seeing this is to apply scaling to all of the numerical variables, like I did in the case study solutions for healthcare, retail, and finance in my book Machine Learning Applications Using Python. The histogram shows the distribution of variables, and we see that most of the soil nutrients are right-skewed except for copper and iron. We also note that the scatter plots for Yield_per_ha and Soil_pH are closely related through the Python code df.plot.hexbin(x='Yield_per_ha', y="Soil_pH", gridsize=20). After completing the usual data preparation steps by dividing it into a target variable named cropyield and a features variable with all the other features, we then split the dataset into training and testing datasets with 360 samples belonging to the training dataset and 91 samples belonging to the testing dataset. After this, we load the linear_model Python library to execute the linear regression algorithm on the training and testing datasets through the Python code regr.fit(X_train,y_train) , y_pred= regr.predict(X_test). Then we look at the accuracy score that we get with our testing dataset before starting to make a prediction through the code print("Variance score: %2f" % r2_score(y_test, y_pred)), which gives us 0.9502, or 95.02% accuracy. Since this is good, we can proceed with making a prediction. Please remember that this is fictitious data so we are able to achieve a good accuracy level. However, in the real world, you may need to run more regressors or fine-tune your data gathering efforts in order to achieve such an accuracy level.

To predict, the code used is predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]]), which are the values of an agriculture land that Aystsaga Agro is evaluating in another country with following characteristic features:

landsize_in_ha 26

Crop_in_tonnes 1500

Soil_pH 6.8

Organic Carbon % 0.9

Nitrogen kg/ha 367

Phosphorus kg/ha 32

Potassium kg/ha 490

Sulphur kg/ha 35

Glanzo wants to know from the model what the probable yearly yield will be after they buy this farmland. The Python program gives out a value of 75.2637 tonnes per hectare, as shown in Listing 9-3.

predicted= regr.predict([[26,1500,6.8,0.9,367,32,490,35]])

print(predicted)

[[75.26373654]]

Listing 9-3

Predicted Value by the Python Program

Please remember that this program does not take into account other conditions such as weather and seed varieties, which also effect crop production and yield. Building such a dataset would definitely be a huge exercise beyond the scope of this book.

Summary

We have now come to the end of this case study and the book. I have thoroughly enjoyed bringing you these IoT-based solutions to modern-day, practical business problems and trying to solve them through machine learning. I hope you enjoyed learning from them too. Do consider leaving feedback on the forums at www.pmauthor.com/raspbian.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Agriculture Industry Case Study: Predicting a Cash Crop Yield

Create new playlist

Sign In

Sign Up