Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Michael Paluszek and Stephanie Thomas, MATLAB Machine Learning, 10.1007/978-1-4842-2250-8_4

4. Representation of Data for Machine Learning in MATLAB

Michael Paluszek¹ and Stephanie Thomas¹

(1)New Jersey, USA

4.1 Introduction to MATLAB Data Types

4.1.1 Matrices

By default, all variables in MATLAB are double-precision matrices. You do not need to declare a type for these variables. Matrices can be multidimensional and are accessed using 1-based indices via parentheses. You can address elements of a matrix using a single index, taken column-wise, or one index per dimension. To create a matrix variable, simply assign a value to it, like this 2 × 2 matrix a:

>> a = [1 2; 3 4];

>> a(1,1)

>> a(3)

You can simply add, subtract, multiply, and divide matrices with no special syntax. The matrices must be the correct size for the linear algebra operation requested. A transpose is indicated using a single quote suffix, A’, and the matrix power uses the operator ̂.

>> b = a’*a;

>> c = a^2;

>> d = b + c;

By default, every variable is a numerical variable. You can initialize matrices to a given size using the zeros, ones, eye, or rand functions, which produce zeros, ones, identity matrices (ones on the diagonal), and random numbers, respectively. Use isnumeric to identify numeric variables.

Table 4.1 summarizes some key functions for interacting with matrices.

Table 4.1 Key Functions for Matrices

Function	Purpose
zeros	Initialize a matrix to zeros
ones	Initialize a matrix to ones
eye	Initialize an identity matrix
rand, randn	Initialize a matrix of random numbers
isnumeric	Identify a matrix or scalar numeric value
isscalar	Identify a scalar value (a 1 × 1 matrix)
size	Return the size of the matrix

4.1.2 Cell Arrays

One variable type unique to MATLAB is cell arrays. This is really a list container, and you can store variables of any type in elements of a cell array. Cell arrays can be multidimensional, just like matrices, and are useful in many contexts.

Cell arrays are indicated by curly braces, {}. They can be of any dimension and contain any data, including string, structures, and objects. You can initialize them using the cell function, recursively display the contents using celldisp, and access subsets using parentheses just like for a matrix. A short example is below.

>> c = cell(3,1);

>> c{1} = ’string’;

>> c{2} = false;

>> c{3} = [1 2; 3 4];

>> b = c(1:2);

>> celldisp(b)

b{1} =

string

b{2} =

Using curly braces for access gives you the element data as the underlying type. When you access elements of a cell array using parentheses, the contents are returned as another cell array, rather than the cell contents. MATLAB help has a special section called Comma-Separated Lists that highlights the use of cell arrays as lists. The code analyzer will also suggest more efficient ways to use cell arrays. For instance,

Replace

a = {b{:} c};

with

a = [b {c}];

Cell arrays are especially useful for sets of strings, with many of MATLAB’s string search functions optimized for cell arrays, such as strcmp.

Use iscell to identify cell array variables. Use deal to manipulate structure array and cell array contents.

Table 4.2 summarizes some key functions for interacting with cell arrays.

Table 4.2 Key Functions for Cell Arrays

Function	Purpose
cell	Initialize a cell array
cellstr	Create cell array from a character array
iscell	Identify a cell array
iscellstr	Identify a cell array containing only strings
celldisp	Recursively display the contents of a cell array

4.1.3 Data Structures

Data structures in MATLAB are highly flexible, leaving it up to the user to enforce consistency in fields and types. You are not required to initialize a data structure before assigning fields to it, but it is a good idea to do so, especially in scripts, to avoid variable conflicts.

Replace

d.fieldName = 0;

with

d = struct;

d.fieldName = 0;

In fact, we have found it generally a good idea to create a special function to initialize larger structures that are used throughout a set of functions. This is similar to creating a class definition. Generating your data structure from a function, instead of typing out the fields in a script, means you always start with the correct fields. Having an initialization function also allows you to specify the types of variables and provide sample or default data. Remember, since MATLAB does not require you to declare variable types, doing so yourself with default data makes your code that much clearer.

■ TIP Create an initialization function for data structures.

You make a data structure into an array simply by assigning an additional copy. The fields must be in the same order, which is yet another reason to use a function to initialize your structure. You can nest data structures with no limit on depth.

d = MyStruct;

d(2) = MyStruct;

function d = MyStruct

d = struct;

d.a = 1.0;

d.b = ’string’;

MATLAB now allows for dynamic field names using variables, that is, structName.(dynamic Expression). This provides improved performance over getfield, where the field name is passed as a string. This allows for all sorts of inventive structure programming. Take our data structure array in the previous code snippet, and let’s get the values of field a using a dynamic field name; the values are returned in a cell array.

>> field = ’a’;

>> values = {d.(field)}

values =

[1] [1]

Use isstruct to identify structure variables and isfield to check for the existence of fields. Note that isempty will return false for a struct initialized with struct, even if it has no fields.

Table 4.3 provides key functions for structs.

Table 4.3 Key Functions for Structs

Function	Purpose
struct	Initialize a structure with or without fields
isstruct	Identify a structure
isfield	Determine if a field exists in a structure
fieldnames	Get the fields of a structure in a cell array
rmfield	Remove a field from a structure
deal	Set fields in a structure to a value

4.1.4 Numerics

While MATLAB defaults to doubles for any data entered at the command line or in a script, you can specify a variety of other numeric types, including single, uint8, uint16, uint32, uint64, logical (i.e., an array of booleans). Use of the integer types is especially relevant to using large data sets such as images. Use the minimum data type you need, especially when your data sets are large.

4.1.5 Images

MATLAB supports a variety of formats, including GIF, JPG, TIFF, PNG, HDF, FITS, and BMP. You can read in an image directly using imread, which can determine the type automatically from the extension, or fitsread. (FITS stands for Flexible Image Transport System and the interface is provided by the CFITSIO library.) imread has special syntaxes for some image types, such as handling alpha channels for PNG, so you should review the options for your specific images. imformats manages the file format registry and allows you to specify handling of new user-defined types if you can provide read and write functions.

You can display an image using either imshow, image, or imagesc, which scales the colormap for the range of data in the image.

For example, we use a set of images of cats in Chapter 7, Face Recognition. If we look at the image info for one of these sample images using imfinfo,

>> imfinfo(’IMG_4901.JPG’)

ans =

Filename: ’MATLAB/Cats/IMG_4901.JPG’

FileModDate: ’28-Sep-2016␣12:48:15’

FileSize: 1963302

Format: ’jpg’

FormatVersion: ’’

Width: 3264

Height: 2448

BitDepth: 24

ColorType: ’truecolor’

FormatSignature: ’’

NumberOfSamples: 3

CodingMethod: ’Huffman’

CodingProcess: ’Sequential’

Comment: {}

Make: ’Apple’

Model: ’iPhone␣6’

Orientation: 1

XResolution: 72

YResolution: 72

ResolutionUnit: ’Inch’

Software: ’9.3.5’

DateTime: ’2016:09:17␣22:05:08’

YCbCrPositioning: ’Centered’

DigitalCamera: [1x1 struct]

GPSInfo: [1x1 struct]

ExifThumbnail: [1x1 struct]

and we view this image using imshow, it will publish a warning that the image is too big to fit on the screen and that it is displayed at 33%. If we view it using image, there will be a visible set of axes. image is useful for displaying other two-dimensional matrix data as individual elements per pixel. Both functions return a handle to an image object; only the axes’ properties are different. Figure 4.1 shows the resulting figures. Note the labeled axes on the right figure.

Figure 4.1 Image display options. A figure created using imshow is on the left and a figure using image is on the right.

Table 4.4 provides key images for interacting with images.

Table 4.4 Key Functions for Images

Function	Purpose
imread	Read an image in a variety of formats
imfinfo	Gather information about an image file
imformats	Determine if a field exists in a structure
imwrite	Write data to an image file
image	Display image from array
imagesc	Display image data scaled to the current colormap
imshow	Display an image, optimizing figure, axes, and image object properties, and taking an array or a filename as an input
rgb2gray	Write data to an image file
ind2rgb	Convert index data to RGB
rgb2ind	Convert RGB data to indexed image data
fitsread	Read a FITS file
fitswrite	Write data to a FITS file
fitsinfo	Information about a FITS file returned in a data structure
fitsdisp	Display FITS file metadata for all Header Data Units (HDUs) in the file

4.1.6 Datastore

Datastores allow you to interact with files containing data that are too large to fit in memory. There are different types of datastores for tabular data, images, spreadsheets, databases, and custom files. Each datastore provides functions to extract smaller amounts of data that do fit in memory for analysis. For example, you can search a collection of images for those with the brightest pixels or maximum saturation values. We will use our directory of cat images as an example.

>> location = pwd

location =

/Users/Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats

>> ds = datastore(location)

ds =

ImageDatastore with properties:

Files: {

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_0191.png’;

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_1603.png’;

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_1625.png’

... and 19 more

}

Labels: {}

ReadFcn: @readDatastoreImage

Once the datastore is created, you use the applicable class functions to interact with it. Datastores have standard container-style functions like read, partition, and reset. Each type of datastore has different properties. The DatabaseDatastore requires the Database Toolbox and allows you to use SQL queries.

MATLAB provides the MapReduce framework for working with out-of-memory data in datastores. The input data can be any of the datastore types, and the output is a key-value datastore. The map function processes the datastore input in chunks and the reduce function calculates the output values for each key. mapreduce can be sped up by using it with the MATLAB Parallel Computing Toolbox, Distributed Computer Server, or Compiler. Table 4.5 gives key functions for using datastores.

Table 4.5 Key Functions for Datastore

Function	Purpose
datastore
read	Read a subset of data from the datastore
readall	Read all of the data in the datastore
hasdata	Check to see if there is more data in the datastore
reset	Check to see if there is more data in the datastore
partition	Excerpt a portion of the datastore
numpartitions	Estimate a reasonable number of partitions
ImageDatastore	Datastore of a list of image files
TabularTextDatastore	A collection of one or more tabular text files
SpreadsheetDatastore	Datastore of spreadsheets
FileDatastore	Datastore for files with a custom format, for which you provide a reader function
KeyValueDatastore	Datastore of key-value pairs
DatabaseDatastore	Database connection, provides Database Toolbox

4.1.7 Tall Arrays

Tall arrays are new to release R2016b of MATLAB. They are allowed to have more rows than will fit in memory. You can use them to work with datastores that might have millions of rows. Tall arrays can use almost any MATLAB type as a column variable, including numeric data, cell arrays, strings, datetimes, and categoricals. The MATLAB documentation provides a list of functions that support tall arrays. Results for operations on the array are only evaluated when they are explicitly requested using the gather function. The histogram function can be used with tall arrays and will execute immediately.

The MATLAB Statistic and Machine Learning Toolbox^TM, Database Toolbox, Parallel Computing Toolbox, Distributed Computing Server, and Compiler all provide additional extensions for working with tall arrays. For more information about this new feature, use the following topics in the documentation:

Tall Arrays
Analysis of Big Data with Tall Arrays
Functions That Support Tall Arrays (AZ)
Index and View Tall Array Elements
Visualization of Tall Arrays
Extend Tall Arrays with Other Products
Tall Array Support, Usage Notes, and Limitations

Table 4.6 gives key functions for using Tall Arrays.

Table 4.6 Key Functions for Tall Arrays

Function	Purpose
tall	Initialize a tall array
gather	Execute the requested operations
summary	Display summary information to the command line
head	Access first rows of a tall array
tail	Access last rows of a tall array
istall	Check the type of the array to determine if it is tall
write	Write the tall array to disk

4.1.8 Sparse Matrices

Sparse matrices are a special category of matrix in which most of the elements are zero. They appear commonly in large optimization problems and are used by many such packages. The zeros are “squeezed” out and MATLAB stores only the nonzero elements along with index data such that the full matrix can be recreated. Many regular MATLAB functions, such as chol or diag, preserve the sparseness of an input matrix. Table 4.7 gives key functions for sparse matrices.

Table 4.7 Key Functions for Sparse Matrices

Function	Purpose
sparse	Create a sparse matrix from a full matrix or from a list of indices and values
issparse	Determine if a matrix is sparse
nnz	Number of nonzero elements in a sparse matrix
spalloc	Allocate nonzero space for a sparse matrix
spy	Visualize a sparsity pattern
spfun	Selectively apply a function to the nonzero elements of a sparse matrix
full	Convert a sparse matrix to full form

4.1.9 Tables and Categoricals

Tables were introduced in release R2013 of MATLAB and allow tabular data to be stored with metadata in one workspace variable. It is an effective way to store and interact with data that one might put in, or import from, a spreadsheet. The table columns can be named, assigned units and descriptions, and accessed as one would fields in a data structure, that is, T.DataName. See readtable on creating a table from a file, or try out the Import Data button from the command window.

Categorical arrays allow for storage of discrete nonnumeric data, and they are often used within a table to define groups of rows. For example, time data may have the day of the week, or geographic data may be organized by state or county. They can be leveraged to rearrange data in a table using unstack.

You can also combine multiple data sets into single tables using join, innerjoin, and outerjoin, which will be familiar to you if you have worked with databases.

Table 4.8 lists key functions for using tables.

Table 4.8 Key Functions for Tables

Function	Purpose
table	Create a table with data in the workspace
readtable	Create a table from a file
join	Merge tables by matching up variables
innerjoin	Join tables A and B retaining only the rows that match
outerjoin	Join tables including all rows
stack	Stack data from multiple table variables into one variable
unstack	Unstack data from a single variable into multiple variables
summary	Calculate and display summary data for the table
categorical	Arrays of discrete categorical data
iscategorical	Create a categorical array
categories	List of categories in the array
iscategory	Test for a particular category
addcats	Add categories to an array
removecats	Remove categories from an array
mergecats	Merge categories

4.1.10 Large MAT-Files

You can access parts of a large MAT-file without loading the entire file into memory by using the matfile function. This creates an object that is connected to the requested MAT-file without loading it. Data are only loaded when you request a particular variable, or part of a variable. You can also dynamically add new data to the MAT-file.

For example, we can load a MAT-file of neural net weights generated in a later chapter.

>> m = matfile(’PitchNNWeights’,’Writable’,true)

m =

matlab.io.MatFile

Properties:

Properties.Source: ’/Users/Shared/svn/Manuals/MATLABMachineLearning/MATLAB/PitchNNWeights.mat’

Properties.Writable: true

w: [1x8 double]

We can access a portion of the previously unloaded w variable, or add a new variable name, all using this object m.

>> y = m.w(1:4)

y =

1 1 1 1

>> m.name = ’Pitch␣Weights’

m =

matlab.io.MatFile

Properties:

Properties.Source: ’/Users/Shared/svn/Manuals/MATLABMachineLearning/MATLAB/PitchNNWeights.mat’

Properties.Writable: true

name: [1x13 char]

w: [1x8 double]

>> d = load(’PitchNNWeights’)

d =

w: [1 1 1 1 1 1 1 1]

name: ’Pitch␣Weights’

There are some limits to the indexing into unloaded data, such as struct arrays and sparse arrays. Also, matfile requires MAT-files using version 7.3, which is not the default for a generic save operation as of R2016b. You must either create the MAT-file using matfile to take advantage of these features or use the -v7.3’ flag when saving the file.

4.2 Initializing a Data Structure Using Parameters

4.2.1 Problem

It’s always a good idea to use a special function to define a data structure you are using as a type in your codebase, similar to writing a class but with less overhead. Users can then overload individual fields in their code, but there is an alternative way to set many fields at once: an initialization function, which can handle a parameter-pair input list. This allows you to do additional processing in your initialization function. Also, your parameter string names can be more descriptive than you would choose to make your field names.

4.2.2 Solution

The simplest way to implement the parameter pairs is using varargin and a switch statement. Alternatively, you could write an inputParser, which allows you to specify required and optional inputs as well as named parameters. In that case, you have to write separate or anonymous functions for validation that can be passed to the inputParser, rather than just write out the validation in your code.

4.2.3 How It Works

We will use the data structure developed for the automobile simulation in Chapter 12 as an example. The header lists the input parameters along with the input dimensions and units, if applicable.

%% AUTOMOBILEINITIALIZE Initialize the automobile data structure.

%% Form:

% d = AutomobileInitialize( varargin )

%% Description

% Initializes the data structure using parameter pairs.

%% Inputs

% varargin: (’parameter’,value,...)

% ’mass’ (1,1) (kg)

% ’steering angle’ (1,1) (rad)

% ’position tires’ (2,4) (m)

% ’frontal drag coefficient’ (1,1)

% ’side drag coefficient’ (1,1)

% ’tire friction coefficient’ (1,1)

% ’tire radius’ (1,1) (m)

% ’engine torque’ (1,1) (Nm)

% ’rotational inertia’ (1,1) (kg-m^2)

% ’state’ (6,1) [m;m;m/s;m/s;rad;rad/s]

The function first creates the data structure using a set of defaults and then handles the parameter pairs entered by a user. After the parameters have been processed, two areas are calculated using the dimensions and the height.

function d = AutomobileInitialize( varargin )

% Defaults

d.mass = 1513;

d.delta = 0;

d.r = [ 1.17 1.17 -1.68 -1.68;...

-0.77 0.77 -0.77 0.77];

d.cDF = 0.25;

d.cDS = 0.5;

d.cF = 0.01; % Ordinary car tires on concrete

d.radiusTire = 0.4572; % m

d.torque = d.radiusTire*200.0; % N

d.inr = 2443.26;

d.x = [0;0;0;0;0;0];

d.fRR = [0.013 6.5e-6];

d.dim = [1.17+1.68 2*0.77];

d.h = 2/0.77;

d.errOld = 0;

d.passState = 0;

n = length(varargin);

for k = 1:2:length(varargin)

switch lower(varargin{k})

case ’mass’

d.mass = varargin{k+1};

case ’steering␣angle’

d.delta = varargin{k+1};

case ’position␣tires’

d.r = varargin{k+1};

case ’frontal␣drag␣coefficient’

d.cDF = varargin{k+1};

case ’side␣drag␣coefficient’

d.cDS = varargin{k+1};

case ’tire␣friction␣coefficient’

d.cF = varargin{k+1};

case ’tire␣radius’

d.radiusTire = varargin{k+1};

case ’engine␣torque’

d.torque = varargin{k+1};

case ’rotational␣inertia’

d.inertia = varargin{k+1};

case ’state’

d.x = varargin{k+1};

case ’rolling␣resistance␣coefficients’

d.fRR = varargin{k+1};

case ’height␣automobile’

d.h = varargin{k+1};

case ’side␣and␣frontal␣automobile␣dimensions’

d.dim = varargin{k+1};

end

% Processing

d.areaF = d.dim(2)*d.h;

d.areaS = d.dim(1)*d.h;

To perform the same tasks with inputParser, you add a addRequired, addOptional, or addParameter call for every item in the switch statement. The named parameters require default values. You can optionally specify a validation function; in the example below we use isNumeric to limit the values to numeric data.

p = inputParser(’FunctionName’,’AutomobileInitialize’,... % throw errors as from AutomobileInitialize

’PartialMatching’,false); % disallow partial matches

cDF_Default = 0.25;

mass_Default = 1513;

addParameter(p,’mass’,mass_Default,@isnumeric);

addParameter(p,’cDF’,cDF_Default,@isnumeric);

parse(p,varargin{:});

d = p.Results;

In this case, the results of the parsed parameters are stored in a Results substructure.

4.3 Performing mapreduce on an Image Datastore

4.3.1 Problem

We discussed the datastore class in the introduction to the chapter. Now let’s use it to perform analysis on the full set of cat images using mapreduce, which is scalable to very large numbers of images.

4.3.2 Solution

We create the datastore by passing in the path to the folder of cat images. We also need to create a map function and a reduce function, to pass into mapreduce. If you are using additional toolboxes like the Parallel Computing Toolbox, you would specify the reduce environment using mapreducer.

4.3.3 How It Works

First, create the datastore using the path to the images.

>> imds = imageDatastore(’MATLAB/Cats’);

imds =

ImageDatastore with properties:

Files: {

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_0191.png’;

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_1603.png’;

’␣.../Shared/svn/Manuals/MATLABMachineLearning/MATLAB/Cats/IMG_1625.png’

... and 19 more

}

Labels: {}

ReadFcn: @readDatastoreImage

Second, we write the map function. This must generate and store the intermediate values that will be processed by the reduce function. Each intermediate value must be stored as a key in the intermediate key-value datastore using add. In this case, the map function will receive one image each time it is called.

function catColorMapper(data, info, intermediateStore)

add(intermediateStore, ’Avg␣Red’, struct(’Filename’, info.Filename, ’Val’, mean(mean(data(:,:,1)))) );

add(intermediateStore, ’Avg␣Blue’, struct(’Filename’, info.Filename, ’Val’, mean(mean(data(:,:,2)))) );

add(intermediateStore, ’Avg␣Green’, struct(’Filename’, info.Filename, ’Val’, mean(mean(data(:,:,3)))) );

The reduce function will then receive the list of the image files from the datastore once for each key in the intermediate data. It receives an iterator to the intermediate datastore as well as an output datastore. Again, each output must be a key-value pair. The hasnext and getnext functions used are part of the mapreduce ValueIterator class.

function catColorReducer(key, intermediateIter, outputStore)

% Iterate over values for each key

minVal = 255;

minImageFilename = ’’;

while hasnext(intermediateIter)

value = getnext(intermediateIter);

% Compare values to find the minimum

if value.Val < minVal

minVal = value.Val;

minImageFilename = value.Filename;

end

% Add final key-value pair

add(outputStore, [’Maximum␣’ key], minImageFilename);

Finally, we call mapreduce using function handles to our two helper functions.

maxRGB = mapreduce(imds, @catColorMapper, @hueSaturationValueReducer);

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 13% Reduce 0%

Map 27% Reduce 0%

Map 40% Reduce 0%

Map 50% Reduce 0%

Map 63% Reduce 0%

Map 77% Reduce 0%

Map 90% Reduce 0%

Map 100% Reduce 0%

Map 100% Reduce 33%

Map 100% Reduce 67%

Map 100% Reduce 100%

The results are stored in a MAT-file, for example, results_1_28-Sep-2016_16-28-38_347. The store returned is a key-value store to this MAT-file, which in turn contains the store with the final key-value results.

>> output = readall(maxRGB)

output =

Key Value

_______________ __________________________________________

’Maximum␣Avg␣Red’ ’/MATLAB/Cats/IMG_1625.png’

’Maximum␣Avg␣Blue’ ’/MATLAB/Cats/IMG_4866.JPG’

’Maximum␣Avg␣Green’ ’/MATLAB/Cats/IMG_4866.JPG’

4.4 Creating a Table from a File

Summary

There are a variety of data containers in MATLAB to assist you in analyzing your data for machine learning. If you have access to a computer cluster of one of the specialized computing toolboxes, you have even more options. Table 4.9 gives a listing of the code presented in this chapter.

Table 4.9 Chapter Code Listing

File	Description
AutomobileInitialize	Data structure initialization example from Chapter 12
catReducer	Image datastore used with mapreduce

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Representation of Data for Machine Learning in MATLAB

Create new playlist

Sign In

Sign Up