Data loading and review

To begin with, load the necessary packages:


> library(magrittr)

> install.packages("caret")

> install.packages("DataExplorer")

> install.packages("earth")

> install.packages("ggthemes")

> install.packages("psych")

> install.packages("tidyverse")

> options(scipen = 999)

Now, read the data into your environment:


> army_ansur <- readRDS("army_ansur.RData")

The feature names are fairly straightforward. Here, I just put in the last few features as output:


> colnames(army_ansur)
[93] "wristcircumference" "wristheight"
[95] "Gender" "Date"
[97] "Installation" "Component"
[99] "Branch" "PrimaryMOS"
[101] "SubjectsBirthLocation" "SubjectNumericRace"
[103] "Ethnicity" "DODRace"
[105] "Age" "Heightin"
[107] "Weightlbs" "WritingPreference"
[109] "SubjectId"

I'm interested in looking at the breakdown of the "Component" and "Gender" columns:

> table(army_ansur$Component)

Army National Guard Army Reserve Regular Army
2708 220 3140

> table(army_ansur$Gender)

Female Male
1986 4082

If we look at missing values, we can see something of interest. Here is the abbreviated output:

> sapply(army_ansur, function(x) sum(
PrimaryMOS SubjectsBirthLocation SubjectNumericRace
0 0 0
Ethnicity DODRace Age
4647 0 0
Heightin Weightlbs WritingPreference
0 0 0

We have a bunch of missing subject IDs. Fine, let's take care of that right now:

> army_ansur$subjectid <- seq(1:6068)

Since weight is what we will predict after we build our unsupervised model, let's have a look at it:

> sjmisc::descr(army_ansur$Weightlbs)

## Basic descriptive statistics
var type label n NA.prc mean sd se md trimmed range skew
dd integer dd 6068 0 174.8 33.69 0.43 173 173.4 321 (0-321) 0.39

Look at the range! We have someone who weighs zero. A plot of this data is in order, I believe:

> ggplot2::ggplot(army_ansur, ggplot2::aes(x = Weightlbs)) + 
ggplot2::geom_density() +

The output of the preceding code is as follows:

So, I would estimate we only have one or two observations of implausible weight values. Indeed, this code will confirm that assumption:

> dplyr::select(army_ansur, Weightlbs) %>%
# A tibble: 6,068 x 1
1 0
2 86
3 88
4 90
5 95
6 95
7 95
8 96
9 98
10 100
# ... with 6,058 more rows

Removing that observation is important:

> armyClean <- dplyr::filter(army_ansur, Weightlbs > 0)

We can now transition to bundling our features for PCA and creating training and testing dataframes.

