Filtering, renaming, and deleting columns in Weka datasets

Generally, data won't be quite in the form we'll need for our analyses. We spent a lot of time transforming data in Clojure in Chapter 2, Cleaning and Validating Data. Weka contains several methods for renaming columns and filtering the ones that will make it into the dataset.

Most datasets have one or more columns that will throw off clustering—row identifiers or name fields, for instance—so we must filter the columns in the datasets before we perform any analysis. We'll see lot of examples of this in the recipes to come.

Getting ready

We'll use the dependencies, imports, and datafiles that we did in the Loading CSV and ARFF files into Weka recipe. We'll also use the dataset that we loaded in that recipe. We'll need to access a different set of Weka classes, as well as the clojure.string library:

(import [weka.filters Filter]
        [weka.filters.unsupervised.attribute Remove])
(require '[clojure.string :as str])

How to do it…

In this recipe, we'll first rename the columns from the dataset. Then we'll look at two different ways to remove columns, one destructively and one not.

Renaming columns

We'll create a function to rename the attributes with a sequence of keywords, and then we'll see this function in action:

  1. First, we'll define a function that takes a dataset and a sequence of field names, and then renames the columns in the dataset to match those passed in:
    (defn set-fields [instances field-seq]
      (doseq [n (range (.numAttributes instances))]
        (.renameAttribute instances
                          (.attribute instances n)
                          (name (nth field-seq n)))))
  2. Now, let's look at the dataset's current column names:
    user=> (map #(.. data (attribute %) name)
                (range (.numAttributes data)))
    ("Country-Code" "Year" "AG.SRF.TOTL.K2" "AG.LND.AGRI.ZS" "AG.LND.AGRI.K2")
  3. These are the names that World Bank gives these fields, but we can change the field names to something more obvious:
    (set-fields data
                [:country-code :year
                 :total-land :agri-percent :agri-total])

Removing columns

This dataset also contains a number of columns that we won't use, for example, the field agri-percent. Since it won't ever be used, we'll destructively remove it from the dataset:

  1. Weka allows us to delete attributes by index, but we want to specify them by name. We'll write a function that takes an attribute name and returns the index:
    (defn attr-n [instances attr-name]
      (->> instances
        (.numAttributes)
        range
        (map #(vector % (.. instances (attribute %)
                            name)))
        (filter #(= (second %) (name attr-name)))
        ffirst))
  2. We can use that function to call reduce on the instances and remove the attributes as we go:
    (defn delete-attrs [instances attr-names]
      (reduce (fn [is n]
                (.deleteAttributeAt is (attr-n is n)) is)
              instances
              attr-names))
  3. Finally, we can use the following to delete the attributes I mentioned earlier:
    (delete-attrs data [:agri-percent])

Hiding columns

There are a few attributes that we'll hide. Instead of destructively deleting attributes from one set of instances, filtering them creates a new dataset without the hidden attributes. It can be useful to have one dataset for clustering and another with the complete information for the dataset (for example, a name or ID attribute). For this example, I'll take out the country code:

  1. Weka does this by applying a filter class to a dataset to create a new dataset. We'll use the Remove filter in this function. This also uses the attr-n function, which was used earlier in this recipe:
     (defn filter-attributes [dataset remove-attrs]
      (let [attrs (map inc
                       (map (partial attr-n dataset)
                            remove-attrs))
            options (->options
                       "-R"
                       (str/join , (map str attrs)))
            rm (doto (Remove.)
                 (.setOptions options)
                 (.setInputFormat dataset))]
        (Filter/useFilter dataset rm)))
  2. We can call this function with the attribute names that we want to filter out:
    (def data-numbers
          (filter-attributes data [:country-code]))

    And we can see the results.

    user=> (map #(.. data-numbers (attribute %) name)
                (range (.numAttributes data-numbers)))
    ("year" "total-land" "agri-total")

How it works…

Weka's attributes are an integral part of its data model. Moreover, the later algorithms that we'll see can be sensitive to which columns are in the dataset. In order to work with only the attributes that are important, we can hide them or delete them altogether using the functions in this recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.151.32