Data loading and review

To begin with, load the necessary packages:

 

> library(magrittr)

> install.packages("caret")

> install.packages("DataExplorer")

> install.packages("earth")

> install.packages("ggthemes")

> install.packages("psych")

> install.packages("tidyverse")

> options(scipen = 999)

Now, read the data into your environment:

 

> army_ansur <- readRDS("army_ansur.RData")

The feature names are fairly straightforward. Here, I just put in the last few features as output:

 

> colnames(army_ansur)
[93] "wristcircumference" "wristheight"
[95] "Gender" "Date"
[97] "Installation" "Component"
[99] "Branch" "PrimaryMOS"
[101] "SubjectsBirthLocation" "SubjectNumericRace"
[103] "Ethnicity" "DODRace"
[105] "Age" "Heightin"
[107] "Weightlbs" "WritingPreference"
[109] "SubjectId"

I'm interested in looking at the breakdown of the "Component" and "Gender" columns:

> table(army_ansur$Component)

Army National Guard Army Reserve Regular Army
2708 220 3140

> table(army_ansur$Gender)

Female Male
1986 4082

If we look at missing values, we can see something of interest. Here is the abbreviated output:

> sapply(army_ansur, function(x) sum(is.na(x)))
PrimaryMOS SubjectsBirthLocation SubjectNumericRace
0 0 0
Ethnicity DODRace Age
4647 0 0
Heightin Weightlbs WritingPreference
0 0 0
SubjectId
4082

We have a bunch of missing subject IDs. Fine, let's take care of that right now:

> army_ansur$subjectid <- seq(1:6068)

Since weight is what we will predict after we build our unsupervised model, let's have a look at it:

> sjmisc::descr(army_ansur$Weightlbs)

## Basic descriptive statistics
var type label n NA.prc mean sd se md trimmed range skew
dd integer dd 6068 0 174.8 33.69 0.43 173 173.4 321 (0-321) 0.39

Look at the range! We have someone who weighs zero. A plot of this data is in order, I believe:

> ggplot2::ggplot(army_ansur, ggplot2::aes(x = Weightlbs)) + 
ggplot2::geom_density() +
ggthemes::theme_wsj()

The output of the preceding code is as follows:

So, I would estimate we only have one or two observations of implausible weight values. Indeed, this code will confirm that assumption:

> dplyr::select(army_ansur, Weightlbs) %>%
dplyr::arrange(Weightlbs)
# A tibble: 6,068 x 1
Weightlbs
<int>
1 0
2 86
3 88
4 90
5 95
6 95
7 95
8 96
9 98
10 100
# ... with 6,058 more rows

Removing that observation is important:

> armyClean <- dplyr::filter(army_ansur, Weightlbs > 0)

We can now transition to bundling our features for PCA and creating training and testing dataframes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.186.83