Defining a methodology

Let's start by clarifying the role of the data scientist, software engineer, and domain expert.

A domain or subject-matter expert is a person with authoritative or credited expertise in a particular area or topic. A chemist is an expert in the domain of chemistry and possibly related fields.

A data scientist solves problems related to data in a variety of fields such as biological sciences, health care, marketing, or finances. Data and text mining, signal processing, statistical analysis, and modeling using machine learning algorithms are some of the activities performed by a data scientist.

A software developer performs all the tasks related to creating software applications, including analysis, design, coding, testing, and deployment.

A data scientist has many options in selecting and implementing a classification or clustering algorithm.

Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters:

  • Business requirements, such as accuracy of results or computation time
  • Availability of training data, algorithms, and libraries
  • Access to a domain or subject-matter expert, if needed

Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters:

  • Available resources, such as machines, CPU, memory, or I/O bandwidth
  • Implementation strategy, such as iterative versus recursive computation or caching
  • Requirement for responsiveness of the overall process, such as duration of computation or display of intermediate results

Thirdly, a domain expert has to tag or label the observations in order to generate an accurate classifier.

Finally, the model has to be validated against a reliable test dataset.

The following diagram illustrates the selection process to create a workflow:

Defining a methodology

Statistical and computation modelling for machine learning applications

The parameters of a data transformation may need to be reconfigured according to the output of the upstream data transformation. Scala's higher-order functions are particularly suitable for implementing configurable data transformations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.204.5