© Stefania Loredana Nita and Marius Mihailescu 2017

Stefania Loredana Nita and Marius Mihailescu, Practical Concurrent Haskell, https://doi.org/10.1007/978-1-4842-2781-7_14

14. Interactive Debugger for Development and Portability Applications Based on Big Data

Stefania Loredana Nita and Marius Mihailescu1

(1)Bucharest, Romania

In computer programming and engineering, debugging represents a process with multiple steps through which a problem is identified, and the source of the problem is isolated and it is corrected. In the last step of debugging, the programmer needs to test the modification to make sure that it works as expected.

In software development, debugging means that code errors from a computer program are located and corrected. It also represents a part of a testing process; an integrated part of the whole software development life cycle. The programmer could begin to debug as she writes code and the process is kept in gradual stages of developing software product. If the program is complex, the code could be more easily debugged through unit testing (in the first part of debugging), code reviewers, pair programming, or other types of testing.

When an unexpected behavior is discovered, the programmer has to identify the code that generated this error. In this step, it could be useful to inspect the logs of the code and to use a debugger tool or the default debugging component of the IDE.

A common approach is to set a breakpoint in the code that is suspected to generate the error, and to run the code line by line in debugging mode. The debugger of an IDE usually lets the programmer examine the memory and variables, running the program to the next breakpoint or to the next line. Some debuggers provide the capability to change a variable’s values while the program is in debugging mode, or even to change a line of code.

As a funny fact, the term debugging is actually named after a moth. In 1940, while Admiral Grace Hopper was working a Mark II Computer at Harvard University, she found a moth stuck in a relay, so she said that she and her team were “debugging” the system.

As you saw in previous chapters, big data applications can be integrated into Haskell; so for the moment, it is not necessary to create a dedicated debugger for big data application written in Haskell. In this chapter, we present a debugger incorporated into GHCi, proposed and implemented by Simon Marlow et al. in “A Lightweight Interactive Debugger for Haskell” ( https://pdfs.semanticscholar.org/f718/caebfb4b70212b8553ae0d865a9e6702f041.pdf ).

Approaches to Run-Time Type Reconstruction

Type reconstruction means that the data type of an expression is determined automatically. It is also known as type inference . For debuggers for functional programming like Haskell, there are two fundamental ways to do run-time type reconstruction.

  • Backward traversal of the call stack. All typed values that are polymorphic are generated in calls to polymorphic functions, but in the call place, the arguments are actually monomorphic. Let’s take an example: c = map ord [1, 2, 3]. Inside the definition of map, there is no information about the type of the elements. When a map function is called inside the body of c, the type of elements is determined, so an x element has type Integer. This manner becomes the default approach for all cases. The debugger needs to have access to the function call stack and to the chain with binding-time calls for every variable of scope. The type of the function could also inform the arguments, like having f :: (Int -> Int) -> Int and calling f (x -> x). The x needs to be type Int.

  • Decoding the types. This is done through inspection of heap representations. In this approach, information should be added to the code, such as specific type tags; but it is less portable than the first one.

Run-Time Type Inference

To evaluate arbitrary expressions, a run-time type inference could be invoked at any point, started by the programmer when it is needed, upon an established run-time term t. Run-time type inference occurs in two phases.

First, a type T for t is deduced, where the term represents a structure that contains only constructor applications.

A431532_1_En_14_Figa_HTML.jpg

In this definition, □ means that the expressions and functions were not evaluated in the run-time term.

Before continuing, note that in computer science, unification represents an algorithm through which an equation between symbolic expressions is solved. In the second phase , T and T' are unified, where T' represents the type of t determined at compile time. Note that T' could require type variables. A refined type for t is achieved through substitution obtained from unification between T and T' applied on T'. This process is applied to all types in other run-time terms. If type variables still exist after the two stages, the substitution is corresponding to an unknown run-time type.

There is a difference between a run-time type inference and a compile time inference. In a standard type inference, a more specific type should be deduced (instead of the main type) because the program will work fine, although it is possible for a type check failure to occur. In run-time, it is usually not safe to assume a type that is not the most general. In a run-time type inference, the type variables need to be considered from an existential point of view, not a universal point of view, because they are actually a specific type.

The debugger proposed by Simon Marlow et al. combines these two types of type inference to obtain as much as possible data about types, but assuring the safety.

A difficult operation is when the user creates its own type (with newtype), because at run-time, they could not be deduced. Let’s take the following example: newtype T = T Int. If we declare a variable with type T Int, it could not be differentiated by a variable of type Int. This problem arises because GHC represents both types in the same way under the hood, and that no type tags are kept at run-time (in contrast to JVM or .NET run-times).

The debugger that Simon Marlow et al. created improved the type for these: a source term that corresponds to the run-time term, which is displayed to the user. The source term could have gaps in the sense that there are expressions that could not be evaluated, but the user will not see the gaps; instead, they are allocated new variable names, which is useful because it could be used in further expressions.

As an example, the following term is partially evaluated at run-time.

t=Just □:(Just (1:  □) : □)

In Figure 14-1, the term t is shown in the heap, in which the dark rectangle means that the expression is not evaluated. Let’s assume that we obtain that term t is of partial type [a].

A431532_1_En_14_Fig1_HTML.gif
Figure 14-1. Heap representation for the example

In the run-time type inference, a constraint is generated for every closure in which the convention is that the data constructor type is placed on the right side, and the left side is constructed based on the subterms’ types of the heap. The constructors have the following signatures.

(:) :: a -> [a] -> [a]
[] :: [a]
Just :: a -> Maybe a

All subterms are visited, generating the constraints in Figure 14-1.

t = [a1]
t1 -> t2 -> t = a2 -> [a2] -> [a2]
t3 -> t1 = a3 -> Maybe a3
t4 -> t5 -> t2 = a4 -> [a4] -> [a4]
t6 -> t4 = a5 -> Maybe a5
t7 -> t8 -> t6 = a6 -> [a6] -> [a6]
t7 = Int

The primal equation is obtained from information about the type at compiling time. The equations are solved using a classical technique of unification and, as whole result, is obtained a substitution for the solution, where are included the types for every closure, inclusive t :: [Maybe [Int]]. What it is obtained, it is unified with the type of t :: [a] from compilation time, and finally, the result is the substitution a -> Maybe [Int] that will be applied in the runtime environment for refining the types for what is inner it.

RTTI and New Types

RTTI is the abbreviation for Runtime Type Information. After the compiler checks the types, it eliminates the new types. New types constructors are not tracked in the heap, but they still shown up on the right side of constraints and in the signature of type at compile time. In this case, more implicit equations are needed to solve the constraints. When a new type is declared as newtype Set a = Set [a], it leads to the equation Set a = [a].

Reconsider the previous example with a minor modification, such that t :: Set a represents the information about the type of t from compile time. The constraints for types are obtained through inspecting the heap. The term t has the same representation in the heap.

t = [a1]
t1 -> t2 -> t = a2 -> [a2] -> [a2]
t3 -> t1 = a3 -> Maybe a3
t4 -> t5 -> t2 = a4 -> [a4] -> [a4]
t6 -> t4 = a5 -> Maybe a5
t7 -> t8 -> t6 = a6 -> [a6] -> [a6]
t7 = Int

In this case, the unification could not be made because in one equation for t is type Set α1, and in the other, it is [α1], even though there is isomorphism between them.

New type equations should be applied, if needed, to make a successful unification between static and run-time types in as few places as possible. The algorithm for the proposed debugger does the following: the constraints generated through inference attempt to be unified; if they cannot be unified, then there new type equivalencies for the terms that failed in unification are applied. This approach is a little difficult, but the heuristic works fine on classical examples.

Termination and Efficiency

The number of closures that are processed are proportional to the number of constraints generated by the RTTI algorithm; it is finite, so the unification applied on a suite of closures will terminate. The constraints are generated at the same time as the unification, but it is possible that the process through which they are generated will not terminate if it is applied on cyclic data structures. A solution to this could be keeping a log for the nodes that were visited to recover the termination.

For this debugger, the creators have talked about two improvements that are based on availability for an entire re-created type.

The constraints are solved in a breadth manner first, and unification is realized in a series of stages. When a full type is met, the process stops and returns it. Even if this solution is strange, we need to remember that this it is not the same thing as a type inference problem. In this case, we need to concentrate only on the term from the top level in the context in which many engendered constraints are used in typing subterms. This approach also improves termination that works on cyclic data structures. Some exceptions are cases in which cyclic structures have a fully unrolled spine and suspended contents.

The second approach focuses on situations in which the type for each subterm needs to be recovered, and it is based on walking in depth-first through all tree subterms . When the fully homomorphic type is retrieved, it is spread through the tree and the unification is replaced by matching. This process needs attention, because there is the possibility that the type variables will appear deep in the tree.

Practical Concerns

In the previous sections, you saw how a partial type is reconstructed. In practice, there are some issues. One of them is the inspection of a structure from a run-time term that is not totally evaluated in a Haskell program and obtaining the type signatures for constructors utilized in the term.

The system supplies an operation called unpackClosure# whose purpose is to inspect closures. Essentially, a closure has two components: an info pointer (used to point to a structure) and an info table that

  • depicts the closure’s layout and the code that should inspect it.

  • contains payload where the fields of the closure are stored.

The data constructor that is corresponding to a closure is determined by looking in the information table, and then the type signature needs to be retrieved. At this step, the information from the table is completed with a special field that contains the fully qualified name of the constructor. Due to the uniqueness of the name of a program, the information about data constructor from internal GHC data structures could be easily retrieved.

The authors made this change to the technique of compiling GHC to allow debugging. The space complexity is small because there are only some data constructors.

In some situations, the fields from a heap data structure do not match the data constructors in the source code.

  • Additional type-class dictionaries could be stored in the constructor due to existential quantification (called existential dictionaries).

  • It is possible that strict fields are not unpacked. As an example, a strict field with type (a, b) is seen as two fields with types a and b, instead of as a single field.

In GHC, every constructor is managed as a record that contains the types of fields from the source code representation of the constructor. When types are reconstructed, it should be used last to match types to type values.

Implementation in Haskell

The debugger should consume as few resources as possible, and it needs to be integrated with GHCi for two reasons.

  • The connection between the representation of the source code for a data constructor and the representation in the heap could be complicated while heap is traversed. To know how to represent in run-time, it is necessary to know how GHC makes the derivation of the representation.

  • A fully interactive Haskell evaluation is necessary when the debug is done.

Accessibility is a characteristic of the debugger . It needs to operate with everything and to be accessible all the time. Profiling libraries are included in the debugger, which includes cost-center stacks that are very useful to a debugger.

The implementation of the debugger does not depend on the user interface being available through GHC API. The compiling and dynamic evaluation of GHC uses a programmatic interface from the GHC API . The GHCi interface is based on text, and it is created on the peak of the GHC API. GHC API is useful in the interoperability of product systems, in which different programming languages could be combined. This is the case with big data, where different tools are used for every stage through which data passes until becoming relevant knowledge. If Haskell is one of the programming languages used in one of these stages, then the GHC API could be called to build the Haskell code.

A light version of the API for debugging is shown here.

runStmt :: Session -> String -> IO RunResult
resume :: Session -> IO RunResult
data RunResult
= RunOk
| RunFailed
| RunException Exception
| RunBreak BreakInfo
getResumeContext :: Session -> IO [Resume]
data Resume
resumeStmt :: Resume -> String
resumeBreakInfo :: Resume -> BreakInfo
abandon :: Session -> IO ()

An interactive statement is started by calling runStmt by the client of GHC API. The result returned by the runStmt could be RunBreak, which means that a breakpoint has stopped the execution. Next, the client could request getResumeContext (using the :show context command) for finding the place where the breakpoint was. The getResumeContext function returns a list of Resume, because it is possible that a list of breakpoints actually exists; when the program is in a breakpoint, it could run a statement that leads to another breakpoint, and so on.

For every Resume, information about breakpoints could be requested through resumeBreakInfo, which returns a value whose type is BreakInfo, which stores information about the module or the placement of the source code with a breakpoint.

An execution could be resumed with the resume command or with the abandon command, which exits the execution. There are more options regarding breakpoint: list, enable/disable, single-step (:step command), trace (:trace command) and history (:history, :back, :forward).

A common challenge for every debugger is the way compiled code and source code are related. Intuitively, it is needed to manage a relation between the original source code and the compiled one in every step of compiling. The good news is that the problem is already solved by the Haskell Program Coverage tool ( https://wiki.haskell.org/Haskell_program_coverage ). To determine coverage, you need to find out if the expression was introduced at run-time for every targeted place in the initial source code. If an expression that is introduced at run-time has a secondary impact, this should be mentioned in a table that contains information about coverage in the current run. When coverage information about an expression E from the source code needs to be retrieved, it is replaced by a tick with some parameters. In this step, a tick is just an annotation, and the corresponding source code is easy to find, based on a list of mapping ticks.

Breakpoints are similar to coverage ticks. A tick technique is used for annotating the program with breakpoint sites. Still, there are differences between ticks used for breakpoint and those used for coverage.

  • The way in which locations of ticks are discovered is different.

  • Breakpoint places are annotated using a set of free variables, but this information is not needed with coverage.

When a breakpoint is found while evaluating, the debugger should do the following.

  • The interpreter verifies if the place of breakpoint is empowered by :break or if it is in a single step execution. If none of this happens, then the execution goes on normally.

  • If one of these happens, then GHCi takes control, but it permits the computation to be done later and the values of the free variables can be accessed.

Remember that runStm starts executing a new statement, which fails (RunFailed), successfully completes (RunOk), throws an exception (RunException), or meets a breakpoint (RunBreak). resume retakes the newest computation in the breakpoint. The following is the implementation of runStmt.

runStmt stmt = do
status_mvar <- newEmptyMVar
break_mvar  <- newEmptyMVar
let  on_break info = do
putMVar status_mvar (Break info)
takeMVar break_mvar
forkIO $ withBreakAction on_break $ do
result <- try stmt
putMVar status_mvar (Complete result)
result <- takeMVar status_mvar
case result of
Complete (Left ex) -> return (RunException ex)
Complete (Right r) -> return RunOk
Break info -> do
setResume session (break_mvar, status_mvar)
return (RunBreak info)

Threads and MVars are used in implementing breakpoints. A new thread is used to run the computation, and the result is stored in status_mvar. The on_break action could be invoked when a breakpoint occurs, depending on the result of the withBreakAction function. The status_mvar is used by the thread to store the result. When a breakpoint is met, on_break is run by the interpreter, which communicates through status_mvar to the principal thread that a breakpoint has occurred and is waiting for break_mvar. The principal thread retains the necessary data for resuming the actual computation in Session when it receives Break and the thread returns RunBreak to the caller.

If an exception occurs, the debugger could stop the execution, no matter if it arises in the compiled code (for example, an evaluation of head [] will throw an exception at compiling time). This is a natural choice, because exceptions could be thrown by a primitive of the compiler (raise#), in whose implementation the breakpoint handler is invoked if it is set. The breakpoint handler represents IO actions sent to withBreakAction. The behavior is the same as when a breakpoint occurs, except that a location for the breakpoint does not exist. There is an issue when the user hits the Ctrl+C keyboard combination, because it raises an asynchronous exception that does not work with #raise. A solution is to catch this exception and throw it again as a synchronous exception.

The following briefly shows how the debugger works, implementing a simple example of Data.List.lines.

lines    :: String -> [String]
lines "" =  []
lines s  =  let (l, s') = break (==' ') s
in  l : case s' of
[]      -> []
(_:s'') -> lines s''

To compile a program, just load it as normal. Let’s look at how lines behave for leading and trailing newline.

*Main> lines "
a"
["","a"]
*Main> lines "a "
["a"]

We can place a breakpoint somewhere in the program. This represents a place where the execution will be interrupted, such that it is allowed to check the values of the local variables. The breakpoint could be placed on a line with an expression, or on the top-level function.

*Main> :break lines
Breakpoint 1 activated at lines.hs:(4,0)-(8,24)

Execution is interrupted when the breakpoint is met in lines.

*Main> lines "a
"
Stopped at lines.hs:(4,0)-(8,24)
_result :: [String]
[lines.hs:(4,0)-(8,24)] *Main>

The user is notified that a breakpoint has occurred and the prompt is changed to show the actual source location. The _result variable is linked with the value of the expression from the breakpoint, which permits the user to work with it. The parameters can be checked only if pattern matching occurs. The :step command is used to debug step by step.

[lines.hs:(4,0)-(8,24)] *Main> :step
Stopped at lines.hs:(6,10)-(8,24)
_result :: [String]
s’ :: [Char]
l :: [Char]
[lines.hs:(6,10)-(8,24)] *Main>

The execution is interrupted in the second equation in the program, at the extreme expression in the body of let. Using the :list command, the source code around the actual breakpoint is shown with the actual expression highlighted. The values for s' and l could be checked, bounded in let expression:

[lines.hs:(6,9)-(8,23)] *Main> (l,s')
("a"," ")

The lines were divided as expected. If a step-by-step approach is further used, the next piece of code will be executed.

[lines.hs:(6,13)-(8,23)] *Main> :step
Stopped at lines.hs:8:15-23
_result :: [String]
s" :: [Char]

We can show the value of s".

[lines.hs:8:15-23] *Main> s"
""

Clearly, the recursive call will now enter the base case of lines, returning the empty list. This explains why lines drop a trailing newline from the input.

In Haskell, programs run using the laziness strategy . Laziness is useful, but in many cases, it decreases performance because it adds overhead to everything. To avoid the issue of laziness, Haskell uses strict analysis, which tries to identify arguments of the function that are always evaluated, and thus they could be evaluated by the caller instead. This approach could bring big improvements.

In Haskell, type inference occurs at compile time, when all the types are checked. Implementations may erase types at run-time, as they have compile-time proof of type safety.

Summary

This chapter provided a short description how GHC works. Due to its modularity and compositionality, it is more suitable to big data than other programming languages. When large volumes of data are involved, it is useful that a program system’s tools are able to automatically support modularity of the interactions between components. Haskell has the capability to manage software complexity very well.

As Don Stewart claims in its presentation “Haskell in the Large” for Google Tech Talk, Haskell is also useful in big data applications because

  • the errors are caught earlier.

  • accidental interactions between components is limited

  • pieces in isolation could be easily changed.

  • it provides strong and expressive types that lead to machine-checkable and modular software.

  • it is impossible that values be combined in nonsenses ways.

  • phantom types could be used. Augments types with origin/security/other metadata makes it possible to prove security properties, information flow properties, very lightweight, but high power/expressiveness, and first steps down the road to GADTs and type level programs.

  • algebraic data types could be used. For example, we need to create only JSON (JavaScript Object Notation) data. For this, we define JSON grammar as a (recursive) data type.

    data JSValue
    = JSNULL
    | JSDouble Double
    | JSString String
    | JSRecord [(String, JSValue)]
                | JSArray [JSValue]
  • Abstract data types could contain primitives as new variants. For example, a key value store.

    data Query
    = MetadataKey Is Key
    | MetadataValue Is Value
    | Size Is Integer [Integer]
    | SizeRange Is Integer Integer
    | IngestTime Is Datetime
    | Custom Is String


    data Is = Is | IsNot
  • Interfaces between Haskell and extern systems are done by interpreters and compilers. For example, we want to compile a Query:

    compile :: [Query] -> QueryString
    compile [] = "*:*"
    compile qs = String.intercalate " " (map compileQuery qs)


    compileQuery :: Query -> String
    compileQuery (MetadataKey is k) =
          implode [compileIs is, "(", "customMetadataContent:" , ""k.", escape k , """, ")"]
    compileQuery(MetadataValue is v) =
          implode [compileIs is, "(", "customMetadataContent:" , ""v.", escape v , """, ")"]C
    compileQuery (Size is s ss) =
          implode [compileIs is, "(", "size:" , "("
              , String.intercalate " " (map compileInteger (s:ss)) , ")" , ")"
    ]
    ...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.109.205