Writing the code

Up to this point, we have only done the cleanup in our heads. I personally find this to be a much more rewarding exercise: to mentally clean up the data before actually cleaning up. This is not because I'm highly confident that I will have handled all the irregularities in the data. Instead, I like this process because it clarifies what needs to be done. And that in turn guides the data structures required for the job.

But, once the thinking is done, it's time to validate our thinking with code.

We start with the clean function:

// hints is a slice of bools indicating whether it's a categorical variable
func clean(hdr []string, data [][]string, indices []map[string][]int, hints []bool, ignored []string) (int, int, []float64, []float64, []string, []bool) {
  modes := mode(indices)
  var Xs, Ys []float64
  var newHints []bool
  var newHdr []string
  var cols int

  for i, row := range data {

    for j, col := range row {
      if hdr[j] == "Id" { // skip id
        continue
      }
      if hdr[j] == "SalePrice" { // we'll put SalePrice into Ys
        cxx, _ := convert(col, false, nil, hdr[j])
        Ys = append(Ys, cxx...)
        continue
      }

      if inList(hdr[j], ignored) {
        continue
      }

      if hints[j] {
        col = imputeCategorical(col, j, hdr, modes)
      }
      cxx, newHdrs := convert(col, hints[j], indices[j], hdr[j])
      Xs = append(Xs, cxx...)

      if i == 0 {
        h := make([]bool, len(cxx))
        for k := range h {
          h[k] = hints[j]
        }
        newHints = append(newHints, h...)
        newHdr = append(newHdr, newHdrs...)
      }
    }
    // add bias

    if i == 0 {
      cols = len(Xs)
    }
  }
  rows := len(data)
  if len(Ys) == 0 { // it's possible that there are no Ys (i.e. the test.csv file)
    Ys = make([]float64, len(data))
  }
  return rows, cols, Xs, Ys, newHdr, newHints
}

clean takes data (in the form of [][]string), and with the help of the indices built earlier, we want to build a matrix of Xs (which will be float64) and Ys. In Go, it's a simple loop. We'll read over the input data and try to convert that. A hints slice is also passed in to help us figure out if a variable should be considered a categorical or continuous variable.

In particular, the treatment of any year variables is of contention. Some statisticians think it's fine to treat a year variable as a discrete, non-categorical variable, while some statisticians think otherwise. I'm personally of the opinion that it doesn't really matter. If treating a year variable as a categorical variable improves the model score, then by all means use it. It's unlikely, though.

The meat of the preceding code is the conversion of a string into []float64, which is what the convert function does. We will look in that function in a bit, but it's important to note that the data has to be imputed before conversion. This is because Go's slices are well-typed. A []float64 can only contain float64.

While it's true that we can also replace any unknown data with NaN, that would not be helpful, especially in the case of categorical data, where NA might actually have semantic meaning. So, we impute categorical data before converting them. This is what imputeCategorical looks like:

// imputeCategorical replaces "NA" with the mode of categorical values
func imputeCategorical(a string, col int, hdr []string, modes []string) string {
  if a != "NA" || a != "" {
    return a
  }
  switch hdr[col] {
  case "MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType", "Exterior1st", "Exterior2nd":
    return modes[col]
  }
  return a
}

What this function says is, if the value is not NA and the value is not an empty string, then it's a valid value, hence we return early. Otherwise, we will have to consider whether to return NA as a valid category.

For some specific categories, NAs are not valid categories, and they are replaced by the most-commonly occurring value. This is a logical thing to do—a shed in the middle of nowhere with no electricity, no gas, and no bath is a very rare occurrence. There are techniques to deal with that (such as LASSO regression), but we're not going to do that right now. Instead, we'll just replace them with the mode.

The mode was calculated in the clean function. This is a very simple definition for finding the modes; we simply find the value that has the greatest length and return the value:

// mode finds the most common value for each variable
func mode(index []map[string][]int) []string {
  retVal := make([]string, len(index))
  for i, m := range index {
    var max int
    for k, v := range m {
      if len(v) > max {
        max = len(v)
        retVal[i] = k
      }
    }
  }
  return retVal
}

After we've imputed the categorical data, we'll convert all the data to []float. For numerical data, that will result in a slice with a single value. But for categorical data, it will result in a slice of 0s and 1s.

For the purposes of this chapter, any NAs found in the numerical data will be converted to 0.0. There are other valid strategies that will improve the results of the model very slightly, but these strategies are not brief.

And so, the conversion code looks simple:

// convert converts a string into a slice of floats
func convert(a string, isCat bool, index map[string][]int, varName string) ([]float64, []string) {
  if isCat {
    return convertCategorical(a, index, varName)
  }
  // here we deliberately ignore errors, because the zero value of float64 is well, zero.
  f, _ := strconv.ParseFloat(a, 64)
  return []float64{f}, []string{varName}
}

// convertCategorical is a basic function that encodes a categorical variable as a slice of floats.
// There are no smarts involved at the moment.
// The encoder takes the first value of the map as the default value, encoding it as a []float{0,0,0,...}
func convertCategorical(a string, index map[string][]int, varName string) ([]float64, []string) {
  retVal := make([]float64, len(index)-1)

  // important: Go actually randomizes access to maps, so we actually need to sort the keys
  // optimization point: this function can be made stateful.
  tmp := make([]string, 0, len(index))
  for k := range index {
    tmp = append(tmp, k)
  }

  // numerical "categories" should be sorted numerically
  tmp = tryNumCat(a, index, tmp)

  // find NAs and swap with 0
  var naIndex int
  for i, v := range tmp {
    if v == "NA" {
      naIndex = i
      break
    }
  }
  tmp[0], tmp[naIndex] = tmp[naIndex], tmp[0]

  // build the encoding
  for i, v := range tmp[1:] {
    if v == a {
      retVal[i] = 1
      break
    }
  }
  for i, v := range tmp {
    tmp[i] = fmt.Sprintf("%v_%v", varName, v)
  }

  return retVal, tmp[1:]
}

I would like to draw your attention to the convertCategorical function. There is some verbosity involved in the code, but the verbosity wills away the magic. Because Go randomizes access to a map, it's important to get a list of keys, and then sort them. This way, all subsequent access will be deterministic.

The function also allows room for optimization—making this function a stateful function would optimize it further, but for this project we shan't bother.

This is our main function so far:

func main() {
 f, err := os.Open("train.csv")
 mHandleErr(err)
 hdr, data, indices, err := ingest(f)
 mHandleErr(err)
 fmt.Printf("Original Data: nRows: %d, Cols: %dn========n", len(data), len(hdr))
 c := cardinality(indices)
 for i, h := range hdr {
  fmt.Printf("%v: %vn", h, c[i])
 }
 fmt.Println("")
 fmt.Printf("Building into matricesn=============n")
 rows, cols, XsBack, YsBack, newHdr, _ := clean(hdr, data, indices, datahints, nil)
 Xs := tensor.New(tensor.WithShape(rows, cols), tensor.WithBacking(XsBack))
 Ys := tensor.New(tensor.WithShape(rows, 1), tensor.WithBacking(YsBack
 fmt.Printf("Xs:
%+1.1snYs:
%1.1sn", Xs, Ys)
 fmt.Println("")
}

And the output of the code is as follows:

Original Data:
Rows: 1460, Cols: 81
========
Id: 1460
MSSubClass: 15
MSZoning: 5
LotFrontage: 111
LotArea: 1073
Street: 2
 ⋮
Building into matrices
=============
Xs:
⎡ 0 0 ⋯ 1 0⎤
⎢ 0 0 ⋯ 1 0⎥
 ⋮
⎢ 0 0 ⋯ 1 0⎥
⎣ 0 0 ⋯ 1 0⎦
Ys:
C[2e+05 2e+05 ⋯ 1e+05 1e+05]

Note that while the original data had 81 variables, by the time we are done with the encoding there are 615 variables. This is what we want to pass into the regression. At this point, the seasoned data scientist may notice a few things that may not sit well with her. For example, the number of variables (615) is too close to the number of observations (1,460) for comfort, so we might run into some issues. We will address those issues later.

Another point to note is that we're converting the data to *tensor.Dense. You can think of the *tensor.Dense data structure as a matrix. It is an efficient data structure with a lot of niceness that we will use later.

Table of Contents for Writing the code

Create new playlist

Sign In

Sign Up

Table of Contents for
Writing the code