Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Stefania Loredana Nita and Marius Mihailescu, Practical Concurrent Haskell, https://doi.org/10.1007/978-1-4842-2781-7_8

8. Debugging Techniques Used in Big Data

Stefania Loredana Nita¹ and Marius Mihailescu¹

(1)Bucharest, Romania

In this chapter, you learn what big data means and how Haskell can be integrated with big data. You also see some debugging techniques.

Data Science

There is critical and developing interest in data science by the data-savvy experts at organizations, open offices, and charities. The supply of experts who can work successfully with information at scale is limited, however, which is reflected by the quickly rising salaries of information engineers, information researchers, analysts, and information investigators.

In a recent review by the McKinsey Global Institute, an expert remarked: “A shortage of the analytical and managerial talent necessary to make the most of big data is a significant and pressing challenge (for the United States).”

The report states that by 2018 there will be 4 to 5 million jobs in the United States that require data analysis aptitude. An extensive number of positions may be filled through preparing or retraining. There will be a need for 1.5 million managers and examiners with investigative and specialized abilities “who can ask the right questions and consume the results of analysis of big data effectively.”

Information becomes inexpensive and omnipresent. We are currently digitizing easy content that was made over hundreds of years, and gathering new sorts of information from web logs, cell phones, sensors, instruments, and exchanges. The measure of computerized information that exists is developing at a tremendous rate—doubling every two years, and changing the way we live. International Business Machines (IBM) estimates that 2.5 billion gigabytes (GB) of information was produced each day in 2012, which represents 90% of all available data from all of history. An article in Forbes states that data is coming more quickly than at any other time, and by the year 2020, around 1.7 megabytes of new data will be made each second—for each person on the planet.

Meanwhile, new technologies are developed in order to understand and to use this avalanche of unstructured data gathered from many fields of activity. These provides a way to recognize patterns in all available information, which contains data of different types, helping to propel grant, enhance the human condition, and make business and social to grow. The ascent of big data can develop our comprehension of aspects ranging from physical and natural frameworks to human social and monetary behavior.

Essentially, each area of the economy now has access to a larger amount of information than would have been possible even 10 years ago. Organizations today are amassing new information at a rate that surpasses their ability to get value from it. The question confronting each organization is how to adequately utilize the information—not only their own information, but also the greater part of all of the information that is accessible and important.

Our capacity to infer social and financial value from this recently accessible information is constrained by the absence of expertise. Working with this information requires particular new aptitudes and instruments. The information is often excessively voluminous, making it impossible to fit on one PC, to be controlled with conventional databases, or to use with standard design programming. The information is likewise more heterogeneous than the curated information of the past. Digitized, audio, visual, sensor, and blog content are often muddled, deficient, and unstructured. These types of information typically have unverifiable provenance and quality, and much of the time, they must be joined with other information to be helpful. Working with client-created information additionally raises protection, security, and moral issues.

Data science is rising with the convergence of the fields of sociology and software engineering. It is dealing with unorganized and organized information, and represents a field that involves everything that identifies with data cleansing, readiness, and investigation. Data science is the unification of mathematics, statistics, programming, and critical thinking. It requires the capacity to find and observe information in clever new ways, and to purge, prepare, and adjust this information. Basically, it is the umbrella of the systems utilized when attempting to understand information from data.

Data science has applications in a lot of fields, including Internet search, digital advertising, health care, travel, gaming, and energy management, among many others.

Big Data

Big data is a term for data sets that are so extensive or complex that conventional information handling applications are lacking to manage them. Its origin dates back to 1990. To understand this technology, this section offers a comprehensive description of big data.

Characteristics

Until now, there was no standard definition for big data. There are three commonly agreed upon characteristics of big data within the scientific community.

Volume. In 2012, about 2.5 exabytes (2.5 billion gigabytes) of information were created every day. It continues at a rate that doubles approximately every 2 years. Today, a greater amount of information crosses the Web each second than was placed on the entire Web 20 years ago. This gives organizations a chance to work with numerous petabytes of information in just one data set—and not only from the web. For example, it is estimated that Walmart gathers more than 2.5 petabytes of information from its client exchanges. A petabyte is one quadrillion bytes, or around 20 million file organizers of content. An exabyte is 1,000 times that sum, or 1 billion gigabytes.
Velocity. For some applications, the speed of data creation is considerably more imperative than the volume. Real-time or almost real-time data makes it workable for an organization to be significantly more coordinated than its rivals. Retrieving fast and useful information is a big advantage for a company.
Variety. Big data is collected from messages, updates, and pictures appearing on social media; data sent from sensors; data collected from GPS; and many other sources. A number of the most critical origins of enormous information are moderately new. The immense measures of data from social media, for instance, are just as old as the systems themselves. For example, Facebook was inaugurated in 2004; Twitter in 2006. Cell phones are similar; they now have gigantic amounts of information attached to individuals, activities, and places. Accordingly, the organized databases of corporate data are currently ill suited for handling massive amounts of information. In the meantime, the declining costs of the stocking, memory, handling, transmission capacity, and so forth, imply that the existing costly information gathering methodologies are rapidly becoming distinctly conservative.

The preceding characteristics are known as the 3Vs, defined by analyst Doug Laney of the META Group (now Gartner) in 2001. Even if these 3Vs are widely accepted by the community, in some reports, big data is seen as 5Vs. The other 2Vs come from the following.

Veracity. This alludes to the untidiness of or the trust we have in the information. With many types of big data, quality and precision are less controllable (consider Twitter posts with hash tags, grammatical mistakes, and casual discourse, and also questionable dependability and accuracy); however, big data and examination innovations now permit us to work with this sort of information. The volume of this information regularly compensates for the absence of value or accuracy.
Value. It is fine to have access to enormous amounts of information; however, unless we can transform it into something of value, it is useless. So, this last V can be seen as the most critical V in big data. It is imperative that organizations put forth a business defense for any endeavor to gather and influence massive amounts of information. It is so natural to fall into the buzz trap and gather enormous information activities without a reasonable comprehension of the expenses and the advantages.

Figure 8-1 illustrates a synthesizing of big data characteristics.

Figure 8-1. Big data represented as 5Vs

Tools

When we are working with data, there are more stages through which data passes until useful information is extracted. Let’s look at the stages involved in the process of extracting information from big data, as well as some of the tools used in every category.

Storage and Management

Classical database systems cannot handle large amounts of data, so new systems for storage and management have been developed.

Hadoop. Hadoop has become synonymous with big data. It’s an open source programming structure for distributed systems of huge data sets on computer clusters. This means that data can be scaled up and down without hardware damage. Hadoop provides gigantic measures of capacity for any sort of data, tremendous preparing power, and the space to handle tasks or jobs without limit.
Cloudera. Basically, it is a Hadoop brand name for with additional administration. It helps businesses construct an information center point that permits employees better access to information. Since it contains an open source component, Cloudera helps organizations manage their Hadoop system. They likewise convey a specific measure of information security, which is very important with sensitive or personal information.
MongoDB. Consider it a replacement option for relational databases . It is very good for data that changes rapidly, or unstructured or semi-structured data. Regular uses include stocking data for portable applications, item inventories, real-time personalization, content administration, and applications conveying a solitary view over numerous frameworks.
Talend. This is a company that provides open source technologies. One of their best products is Master Data Management, which combines real-time information, applications, and process incorporation assuring data quality.

Cleaning

It is a powerful thing to own data, but more important is to extract knowledge from that data. It is a little complicated to obtain information from data sets as they are, because in many cases, they are unstructured. The following products clean data sets and bring them to a usable form.

OpenRefine is a free tool for cleaning data. Data sets can be explored quickly and without complication, even if the data is unstructured.
DataCleaner is a that tool converts semistructured data sets into cleansed data sets that can be read without effort.

Data Mining

Data mining means finding knowledge inside a database rather than separating information from site pages into databases. The focus of data mining is to obtain predictions, which helps with decision making. The following are most commonly used tools for data mining.

RapidMiner is a device for predictive examination. It is strong, simple to use, and has an open source group behind it. You can even add your own particular techniques into RapidMiner by using its dedicated APIs.
IBM SPSS Modeler contains an entire suite of techniques used in data mining. It incorporates text examination, element analytics, and decision administration and improvement.

Other data-mining products include Oracle Data Mining (ODM), TeraData, FrameData, and Kagle.

Languages

Some languages provide great support for big data. The following are the most popular.

R is a language for statistical and graphical usage.
Python has become one of the most commonly used programming languages. It is flexible, free, and contains many libraries for data management and analysis. It is also used in web applications that need great scalability.

Haskell for Big Data

Of course, one of the most important languages used in big data is Haskell. The following are some of the tools that Haskell provides for working with large data set.

hspark
Hadron
Cloud Haskell
ZeroMQ
Krapsh library

Next, we briefly describe these Haskell tools.

hspark

A new library inspired by Apache Spark, it is useful in distributed, in-memory computations. It performs simple Map-Reduce jobs over nodes from networks.

hspark implements an easy and extendible Digital Subscriber Line (DSL) for specifying a job. A configuration of a cluster is taken as input and the jobs are translated into a set of distributed jobs, making use of the distributed-process Cloud Haskell library.

The following are components of hspark.

Context. Provides information about the cluster.
Resilient Distributed Dataset (RDD) DSL. Expresses hspark jobs.
Execution. Executes RDD and its dependencies.

Let’s look at Alp Mestanogullari’s and Mathieu Boespflug’s “Hello World!” example using hspark ( http://www.tweag.io/posts/2016-02-25-hello-sparkle.html ). It is on an Amazon cluster.

# Build it
$ stack build hello
# Package it
$ mvn package -f sparkle -Dsparkle.app=sparkle-example-hello
# Run it
$ spark-submit --master 'spark://IP:PORT' sparkle/target/sparkle-0.1.jar

The following code counts the number of lines from the input file that contains at least one “a” character.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE StaticPointers #-}

import Control.Distributed.Spark as RDD
import Data.Text (isInfixOf)

main :: IO ()
main = do
    conf <- newSparkConf "Hello sparkle!"
    sc <- newSparkContext conf
    rdd <- textFile sc "s3://some-bucket/some-file.txt"
    as <- RDD.filter (static (line -> "a" `isInfixOf` line)) rdd
    numAs <- RDD.count as
    putStrLn $ show numAs ++ " lines with the letter 'a'."

Hadron

The purpose of the Hadron library is to use type-safety in the complex and sensitive world of Hadoop Streaming MapReduce. It will be presented in more detail in Chapter 13.

The following are its main characteristics.

Binds into Hadoop using the Streaming interface.
Operates Hadoop jobs in many steps, so programmers do not need to call Hadoop manually.
Allows the user to interact with input/output data from the Hadoop Distributed File System (HDSF), Amazon S3, or other systems that Hadoop supports.
Makes a set of long and complicated jobs easier to design and maintain.
Provides built-in support for map-side joins.
Provides many combinators from the Controller module, which covers simple tasks.

It provides three modules.

Hadron.Basic: Constructs one MapReduce step, but it is not recommended for direct usage.
Hadron.Controller: Automates instrumentation of map-reduce jobs with multiple stages.
Hadron.Protocol: Describes data encode/decode strategies according to Protocol type.

Cloud Haskell

Cloud Haskell is a library used in distributed concurrency. It is covered in Chapter 9. The aim of this library is to provide support in writing programs for clusters. The supplied model is message-passing communication, which is very similar to Erlang.

It is available as a distributed process. The following are some of its characteristics.

Builds concurrent programs making use of asynchronous message passing.
Builds distributed computing programs.
Builds fault tolerance systems.
Runs on different network transport implementations.
Supports static values (necessary for remote communication).

An important purpose of Cloud Haskell is the separation between transport and process layers, so that the transport back-end does not depend on anything. Novel connections can be created using Control.Distributed.Process.

ZeroMQ

The ZeroMQ library is an extension of a classical socket interface, adding characteristics that are usually supplied by specific messaging middleware products. Its sockets deliver an abstractization of asynchronous message queues, manifold messaging models, messages filtration, and so forth.

The zeromq4-haskell package is used to bind to the ZeroMQ library. The following is an example of ZeroMQ usage in displaying the input for a socket ( https://gitlab.com/twittner/zeromq-haskell ).

{-# LANGUAGE OverloadedStrings #-}
import Control.Monad
import System.Exit
import System.IO
import System.Environment
import System.ZMQ4.Monadic
import qualified Data.ByteString.Char8 as CS

main :: IO ()
main = do
    args <- getArgs
    when (length args < 1) $ do
        hPutStrLn stderr "usage: display <address> [<address>, ...]"
        exitFailure
    runZMQ $ do
        sub <- socket Sub
        subscribe sub ""
        mapM_ (connect sub) args
        forever $ do
            receive sub >>= liftIO . CS.putStrLn
            liftIO $ hFlush stdout

The preceding code defines a socket and then prints the data it receives.

The Krapsh Library

This project explores an alternative API to run complex workflows on top of Apache Spark. It is available at https://github.com/krapsh/kraps-haskell .

The Developer’s Perspective

Through this API, complex transforms can be applied on top of Spark, making use of different programming languages. The advantage is that there is no need to communicate with Java objects. Every programming language only needs to implement an interface that does not rely on features specific to Java virtual machines, and that can be implemented using standard REST technologies.

A set of bindings is being developed in the Haskell programming language. It is used as the reference implementation. Despite its limited usage in data science, it is a useful tool to design strongly principled APIs that work across various programming languages.

The User’s Perspective

Citing Readme file from github page of Krapsh library, the following are some features of the Krapsh library.

Lazy computations. No call to Spark is issued until a result is required. Unlike standard Spark interfaces, even aggregation operations such as collect() or sum() are lazy. This allows Krapsh to perform whole-program analyses of the computation and to make optimizations that are currently beyond the reach of Spark.
Strong checks. Thanks to lazy evaluation, a complete data science pipeline can be checked for correctness before it is evaluated. This is useful when composing multiple notebooks or commands. For example, a lot of interesting operations in Spark, such as machine learning algorithms, involve an aggregation step. In Spark, such a step would break the analysis of the program and prevent the Spark analyzer from checking further transforms. Krapsh does not suffer from such limitations and checks the entire pipeline at once.
Automatic resource management. Because Krapsh has a global view of the pipeline, it can check when data needs to be cached or uncached. It is able to schedule caching and uncaching operations automatically, and it refuses to run a program that may be incorrect with respect to caching (for example, when uncaching happens before the data is accessed again).
Complex nested pipelines. Computations can be arbitrarily nested and composed. It allows the conceptual condensing of complex sequences in operations into a single high-level operation. This is useful for debugging and understanding a pipeline at a high level without being distracted by the implementation details in each step. This follows the same approach as TensorFlow.
Stable format and language agnostic. A complex pipeline may be stored as a JSON file in one programming language and read/restored in a different programming language. If this program is run on the same session, that other language can access the cached data .

Haskell vs. Data Science

As a programming language, Haskell offers a few engaging components. Specifically, it’s one of only a few current languages that are not strict, which means expressions can have a value regardless of the possibility that their subexpressions do not; this keeps it from requiring everything to be assessed. The code can be cleaner because capacities do not need everything fixed to work.

It is especially valuable when working with functions that may be required later: rather than assessing everything now, only use them if and when important (recursions are particularly great).

While the non-strictness identifies what Haskell computes, the how is also significant—lazy evaluation. This influences the reduction order by attempting to reduce the highest function in the program, which means the parameters of the function are evaluated only if it is necessary. While this implies the lazy evaluation has a tendency to get the same or better complexity like eager evaluation, it could make the code to lead to space leaks. So, as a conclusion, lazy evaluation has better time performance, but may lead to space leaks.

Therefore, thanks to Haskell’s non-strictness and lazy evaluation, it could be used functions like “bind” (>>=) which are useful when a value could not be obtained yet, and the bind chooses when to call the function. Basically, the control is taken by the bind function. Moreover, you can label values that outcome from impure calculation, and control the assessment order.

Haskell is used in many areas, including financial services and bioinformatics. Bank of America Merrill Lynch, BAE Systems, Capital IQ, Ericsson AB, Facebook, Google, Glyde, Intel, IVU Traffic Technologies AG, Microsoft, and NVIDIA are on the impressive list of companies that use Haskell.

Haskell has a solid base of financial code that is normally inaccessible to the public, yet it is the subject of a considerable number of blog articles and analysis on how you can construct exceptionally proficient, capable spilling frameworks that interact with Excel.

Haskell has a tendency to be a memory hog. Memory leaks can occur without careful attention to avoid them. This weakens its use for extensive data sets, but it is easy to prevent these issues. This means, it needs to know anytime where new laziness is created and deciding whether that is right or not. The data types that are strict and belong to UNPACK pragma remove space utilization and leakage.

For strict data types, there are two pragmas: Strict and StrictData (a subset of Strict). Using Strict, the fields of the constructor and the bindings (like let, where, case function arguments) are made strict. StrictData is almost the same, but it is applied only on constructor fields by using strict annotation (i.e., !).

data T = T !Int !Int

In the preceding definition, none of the constructor fields lead to space leak because they are fully evaluated to Ints when the constructor is called. (For more about Strict and StrictData pragmas, visit https://wiki.haskell.org/Performance/Data_types or http://blog.johantibell.com/2015/11/the-design-of-strict-haskell-pragma.html ).

The UNPACK pragma tells the compiler that the content of a field from a constructor needs to be unpacked into the constructor itself; thus, it removes a level of indirection. The following example is from the official documentation ( https://downloads.haskell.org/~ghc/8.0.2/docs/html/users_guide/ glasgow_exts.html#unpack-pragma). It generates a constructor, T, that has two unboxed floats.

data T = T {-# UNPACK #-} !Float
           {-# UNPACK #-} !Float

Note that this approach could not be an optimization if it used incorrectly for example, when the T constructor is examined and the floats are sent to a function that is non-strict, it needs to be boxed again (automatically)—a process that needs additional computations. To have an effect, the UNPACK pragma should be used with the –o option (through which unboxing is avoided).

Haskell provides a very useful library called vector, which is used very often in data science . Other useful libraries include ad, linear, vector-space, statistics, compensated, and log-domain. They are not the same. We will let you further explore these libraries.

Haskell’s best dense matrix library, hmatrix, is good and falls under General Public License (GPL). A disadvantage of these libraries is that they do not work as smoothly as the vector library. The Repa library is another good one; it is optimized for images and parallel matrix operations, such as as Discrete Fourier Transform (DFT).

If it is necessary that a graphics processing unit (GPU) be spared, then this is very easy when using the algorithms provided by the Accelerate library.

The following are the main reasons why Haskell is very good with data science .

Type system doubles as a design language, crystallizes thoughts
Catches errors early, refactors aggressively (in comparision to Ruby or Python)
Purity of functions in Haskell is a huge win for long-term solutions/applications
Stays at a very high level, yet still gets solid performance
QuickCheck is very good, testing is better
Simple multicore concurrency
Promising future for parallel algorithms

Debugging Tehniques

The dubugging techniques for big data are the same as classical debugging techniques from Haskell. For example, Microsoft Azure provides support for Haskell. Additional information about the Microsoft Azure Cloud platfomr is at https://docs.microsoft.com/en-us/azure .

In this section, we present a simple example of how to use Azure and Haskell. It is from Phil Freeman’s short blog tutorial “Haskell on Azure” ( http://blog.functorial.com/posts/2012-04-29-Haskell-On-Azure.html ). It uses the azure-servicebus package.

Microsoft Azure supplies storage services (table, queue, and blob storage services) revealing REST APIs. In the following, there is a union between the Happstack web server and blaze-html for building a simple web application that takes notes. It is easy to change in with a specific web server.

module Main where

import Data.Maybe
import Control.Monad
import Control.Monad.Trans
import System.Time
import System.Directory
import Happstack.Server
import qualified Text.Blaze.Html4.Strict as H
import qualified Text.Blaze.Html4.Strict.Attributes as A
import Network.TableStorage
import Text.Blaze ((!), toValue)

The next stept defines the account that will be used. it contains a service endpoint, a user name, and the secret key.

This example uses the development account , which means the Windows Azure storage emulator should exist and run on the computer on which the example is implemented.

account :: Account
account = developmentAccount

When deploying an account to staging or production, the developmentAccount could be replaced with a call to defaultAccount, having the account information extracted from the Azure management portal.

The application is very simple. It has three parts: many notebooks will exist, it will be possible to view the latest note in a notebook, and it will be possible to add a new note in a specific notebook. The notes can be fragmented, if necessary, according to the notebook’s capacities. In the table storage model, this means that the notebook being used is determined by the note entities’ partition key.

We should allow the generation of a new key. We will sort the note ids after the date of insertion so that the more recent notes are displayed first. We will generate a key based on the current time, subtracting the seconds component of time from the large value to reverse the order of the generated keys.

newId :: IO String
newId = do
  (TOD seconds picos) <- getClockTime
  return $ show (9999999999 - seconds) ++ show picos

With the preceding piece of code, now we can implement the method through which a new note is added. This will be a POST method that receives the arguments as form-encoded data in the request body.

After the generation of the key, it is uses the insertEntity IO action for adding the just-produced note object to the table notes. The elements of the note are the author and text properties, which are string columns.

If the adding procedure is successfully completed, then the redirect and get actions are performed to display the newly added note; otherwise, an error message is displayed—namely, 500 - Internal Server Error.

postNote :: ServerPartT IO Response
postNote = do
  methodM POST
  tmp <- liftIO getTemporaryDirectory
  decodeBody $ defaultBodyPolicy tmp 0 1000 1000
  text <- look "text"
  author <- look "author"
  partition <- look "partition"
  result <- liftIO $ do
    id <- newId
    let entity = Entity { entityKey = EntityKey { ekPartitionKey = partition,
                                                  ekRowKey = id },
                          entityColumns = [ ("text", EdmString $ Just text),
                                            ("author", EdmString $ Just author)] }
    insertEntity account "notes" entity
  case result of
    Left err -> internalServerError $ toResponse err
    Right _ -> seeOther ("?partition=" ++ partition) $ toResponse ()

When a notebook is chosen, the last 10 notes are displayed. This is done using the queryEntity function , which retrieves the notes from the table, and filters them after the partition key.

getNotes :: ServerPartT IO Response
getNotes = do
  methodM GET
  partition <- look "partition" `mplus` return "default"
  let query = defaultEntityQuery { eqPageSize = Just 10,
                                   eqFilter = Just $ CompareString "PartitionKey" Equal partition }
  result <- liftIO $ queryEntities account "notes" query
  case result of
    Left err -> internalServerError $ toResponse err
    Right notes -> ok $ setHeader "Content-Type" "text/html" $ toResponse $ root partition notes

The preceding implementation uses a very simple query, but there is a package called tablestorage that allows difficult queries, including many filters at once.

If the call of the function completes successfully, then it returns an HTML page, created using blaze-html; otherwise, a 500 error code is displayed. The root function performs these actions.

root :: String -> [Entity] -> H.Html
root partition notes = H.html $ do
  H.head $
    H.title $ H.toHtml "Notes"
  H.body $ do
    H.h1 $ H.toHtml "Add Note"
    H.form ! A.method (toValue "POST") $ do
      H.input ! A.type_ (toValue "hidden") ! A.name (toValue "partition") ! A.value (toValue partition)
      H.div $ do
        H.label ! A.for (toValue "text") $ H.toHtml "Text: "
        H.input ! A.type_ (toValue "text") ! A.name (toValue "text")
      H.div $ do
        H.label ! A.for (toValue "author") $ H.toHtml "Author: "
        H.input ! A.type_ (toValue "text") ! A.name (toValue "author")
      H.div $
        H.input ! A.type_ (toValue "submit") ! A.value (toValue "Add Note")
    H.h1 $ H.toHtml "Recent Notes"
    H.ul $ void $ mapM displayNote notes

The page has a form, where the user could submit a new note, with text areas for the text and the author arguments. The partition key is given as an unseen field, but the user can modify it by changing the query string.

Next is a form that lists the latest notes from the current notebook (partition key). These recent notes are displayed with the help of the displayNote function .

displayNote :: Entity -> H.Html
displayNote note = fromMaybe (return ()) $ do
  text <- edmString "text" note
  author <- edmString "author" note
  return $ H.li $ do
    H.toHtml "'"
    H.toHtml text
    H.toHtml "'"
    H.i $ do
      H.toHtml " says "
      H.toHtml author

The edmString helper function is used for extracting a column whose type is string from the available columns retrieved through the query.

The Maybe-valued function returns Just as a string when the column’s name has string values, and Nothing is returned if there is no column or the type of column in not string. Functions for different column types could be found in the tablestorage package.

We use two routes—getNotes and postNotes—for creating the web application.

routes :: [ServerPartT IO Response]
routes = [ getNotes, postNote ]

The main function defines the constraints to create the notes table, if it does not exist. Using the createTableIfNecessary function checks this.

main :: IO ()
main = do
  result <- createTableIfNecessary account "notes"
  case result of
    Left err -> putStrLn err
    Right _ -> simpleHTTP nullConf $ msum routes

In order to deploy the application, cspack and csrun command-line utilities are used.

Some basic configuration files are needed. The web server will run as a worker, fact that needs to be set up in the ServiceDefinition.csdef file . We need to add a worker role definition, like in the following.

-- <ServiceDefinition name="Notes" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
--   <WorkerRole name="WebServer" vmsize="Small">
--     <Runtime>
--       <EntryPoint>
--         <ProgramEntryPoint commandLine="Main.exe" setReadyOnProcessStart="true" />
--       </EntryPoint>
--     </Runtime>
--     <Endpoints>
--       <InputEndpoint name="happstackEndpoint" protocol="tcp" port="80" />
--     </Endpoints>
--   </WorkerRole>
-- </ServiceDefinition>

The ServiceConfiguration.*.cfg files should contain a role section with the same worker role name.

-- <Role name="WebServer">
--   <ConfigurationSettings />
--   <Instances count="1" />
-- </Role>

After building the application, an executable Main.exe is obtained, which needs to be put under WebServer role subdirectory. The command used build for deployment is:

-- cspack /copyOnly ServiceDefinition.csdef

The package can then be deployed locally using

-- csrun Notes.csx ServiceConfiguration.Local.cscfg

At this step the operations which are available are: add a note, display the content of a note and move into another notebook by simply editing the query string.

To deploy the application in Azure, a cspkg file needs to be built, using the following command (note that credentials for development account needs to be replaced in code by credentials for management portal):

-- cspack ServiceDefinition.csdef

Next, we present some debugging methods that can be applied to big data.

Stack Trace

Late forms of GHC permit a dump of a stack follow (of cost focuses) when an error is raised. Keeping in mind the end goal to empower this, compile with -prof and run with +RTS - xc. (Since just the cost focus stack will be printed, you might need to include - fprof-auto - fprof-cafs to the aggregation venture to incorporate all definitions in the following.) Since GHC version 7.8, the errorWithStackTrace function can be used to automatically dump the stack follow.

The following is a simple example.

  crash = sum [1,2,3,undefined,5,6,7,8]
  main  = print crash

 > ghc-7.6.3 test.hs -prof -fprof-auto -fprof-cafs && ./test +RTS -xc
 *** Exception (reporting due to +RTS -xc): (THUNK_2_0), stack trace:
   GHC.Err.CAF
   --> evaluated by: Main.crash,
   called from Main.CAF:crash_reH
 test: Prelude.undefined

A CAF (constant applicative form) is a super combinatory that isn’t a lambda abstraction . Constant expressions and partial functions are included. A CAF does not contain free variables because it is super combinatory; but actually, it does not contain any variables because they are not lambda abstractions. Still, it could have identifiers that refer to another CAF.

x 3 where x = (*) 2

A CAF can be placed on the top level of a program. It can be compiled in a piece of graph shared with all users, or from shared code that overwrites itself when it is first inspected with a graph. The following CAF could go to the top of a program without bounds, but it could also be accessed within the code from one or more functions. Garbage collectors can reclaim these kinds of structures if each function has an associated list of CAFs to which it is making a reference. When a function is targeted by a garbage collector, then its associated CAFs are collected.

ints = from 1 where from n = n : from (n+1)

The following is an example of errorWithStackTrace.

import GHC.Stack

main = putStrLn $! (show $! fact  7)

fact :: Int -> Int
fact 0 = 1
fact 1 =1
fact 3 = errorWithStackTrace "error"
fact n | n > 0 = n * fact (n-1)
  | otherwise = errorWithStackTrace "wrong"

∼/D/r/testTraceing $ ghc -prof -fprof-auto main.hs
[1 of 1] Compiling Main             ( main.hs, main.o )
Linking main ...
∼/D/r/testTraceing $ ./main
main: error
Stack trace:
  Main.fact (main.hs:(6,1)-(10,74))
  Main.main (main.hs:3:1-36)
  Main.CAF (<entire-module>)

Printf and Friends

Printf debugging is a technique through which the flow of an execution is traced, and targeted values are printed. The easiest method to print a message on screen is to make use of Debug.Trace.trace.

trace :: String -> a -> a

According to its library, when called, trace outputs the string in its first argument, before returning the second argument as its result.

The following is the usual context in which trace is used.

myfun a b | trace ("myfun " ++ show a ++ " " ++ show b) False = undefined
myfun a b = ...

A benefit is that the enable and disable actions of trace are in a single line of comment.

You should remember that because of lazy evaluation, traces print if the value that they wrap is ever requested.

The trace function is situated in the base package. The htrace package characterizes a trace function like the one in the base; however, it has a space for better visual impact. Different tools can be found in the debug class on the Hackage page.

A more capable option for this method is Hood, which works well with the current ghc conveyance. Hugs has it effectively incorporated. Include an import Observe and begin embedding observations in the code. Note that because Hugs is no longer under development, it is only for readers who want to experiment or to explore the system; for example:

import Hugs.Observe
f'  = observe "Informative name for f" f
f x = if odd x then x*2 else 0
And then in hugs:
Main> map f' [1..5]
[2,0,6,0,10]

>>>>>>> Observations <<<<<<

Informative name for f
  {  5  -> 10
  ,  4  -> 0
  ,  3  -> 6
  ,  2  -> 0
  ,  1  -> 2
  }

The preceding code results a report of all calls of f and the result for every call. The GHood library appends a graphical support for Hood.

The Safe Library

The safe library of functions belongs to Prelude, which can crash. In the event that you get an error message (for example, pattern match failure, head []), you can then utilize headNote (“extra infomation”) to get a more detailed error message for that specific call to head. The safe library likewise has functions that contains default values and wrap their calculations in Maybe as needed.

Offline Analysis of Traces

The most evaluated troubleshooting methods are situated in offline investigation of traces. The following tools are used in academic research. It is possible to work with older versions of Haskell.

Haskell Tracer HAT

Hat is perhaps the most evaluated source-level tracer. It is a large suite of tools.

The impediment of conventional Haskell tracers is that they could change the entire program or require a particular run-time framework. They are not generally best suited with the most recent libraries, but you can put them to use every now and again.

Hoed: The Lightweight Haskell Tracer and Debugger

Hoed is a tracer and/or debugger that brings many techniques from HAT. It can be used in untransformed libraries. It is used for debugging more programs than the traditional tracer.

To localize a fail, you need to do an annotation of the suspected function, and then follow the compilation as usual. When the program runs, it retrieves data about the targeted function. The last stept is connecting to a debugging session on a web browser.

Dynamic Breakpoints in GHCi

Active breakpoints and intermediary valuables observation are enabled by the GHCi debugger.

The breakpoints can be established directly in the code from the GHCi command prompt, as shown in the following example.

*main:Main> :break Main 2
Breakpoint set at (2,15)
*main:Main> qsort [10,9..1]

Local bindings in scope.

  x :: a, xs :: [a], left :: [a], right :: [a]
qsort2.hs:2:15-46> :sprint x
x = _
qsort2.hs:2:15-46> x

The preceding is a computation without type and without being evaluated. seq is used for forcing the evaluation. :print is used for recovering its type.

qsort2.hs:2:15-46> seq x ()
()
qsort2.hs:2:15-46> :p x
x - 10

When a breakpoint is reached, the bindings are explored in the scope, and the evaluation of an expression is enabled, like in the GHCi prompt. The laziness could be explored in the :print instruction. A comprehensive description of how to use breakpoints in Haskell is at https://downloads.haskell.org/~ghc/7.4.1/docs/html/users_guide/ghci-debugger.html .

Source-Located Errors

The LocH library gives wrappers over assert to generate exceptions for source-located errors . The following exemple located fromJust.

import Debug.Trace.Location
import qualified Data.Map as M
import Data.Maybe
main = do print f
f = let m = M.fromList
                [(1,"1")
                ,(2,"2")
                ,(3,"3")]
        s = M.lookup 4 m
    in fromJustSafe assert s
fromJustSafe a s = check a (fromJust s)
This will result in:
$ ./a.out
a.out: A.hs:12:20-25: Maybe.fromJust: Nothing

This could be done automatically , making use of the LocH preprocessor. For example, a program that fails, as follows…

$ ghc A.hs --make -no-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ ./A
A: Maybe.fromJust: Nothing

This could be converted into src-located through addition of the following line.

import Debug.Trace.Location

The last step is to recompile.

$ ghc A.hs --make -pgmF loch -F -no-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ ./A
A: A.hs:14:14-19: Maybe.fromJust: Nothing

Other Tricks

When GHC is utilized, a program will show a stack trace in the console if there is an error condition.

Locating a Failure in a Library Function

The easiest way to locate a mismatch run-time error in the program, error that raises from libraries that provide the functions head, tail, fromJust or another similar, is to avoid using these functions; as a alternative, use explicit matching.

The following is an example.

g x = h $ fromJust $ f x,

The references to the g, f, and h functions are lost. An error occurs when f returns Nothing. Instead, consider the following.

g x = let Just y = f x in h y,

GHC displays

Main: M1.hs:9:11-22:
Irrefutable pattern failed                                                     for pattern Data.Maybe.Just y

It has indicated the source of the error.

Mysterious Parse Errors

GHC supplies -ferror-spans to indicate the beginning and the end of a wrong expression (for example, x:xs->x instead of (x:xs)->x).

Infinite Loops

To avoid infinite loops , let’s consider the loop function called with one parameter.

Activate the -fbreak-on-error (:set -fbreak-on-error in GHCi).
Call the statement with :trace (:trace loop 'a').
Press Ctrl+C when the program is blocked in the loop, such that the debugger breaks that loop.
Utilize the :history and :back commands for localization of the loop.

Summary

In this chapter, you saw

what big data means and the phases through which it passes to obtain relevant information.
how Haskell can be integrated with big data and the tools that do this.
techniques for debugging.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Debugging Techniques Used in Big Data

Create new playlist

Sign In

Sign Up