APPENDIX F: Answers to Chapter Challenges

Appendix006.jpg

Chapter 2

  1. 1. Juno would be your safest bet, if you haven’t used Julia before. Otherwise, you can utilize the same advantage by using IJulia with the latest Julia kernel (shown at the top right part of the IJulia screen).
  2. 2. JuliaBox is the most well-known option. If, however, you don’t have a Google account or you don’t wish to give Google access to your code, you can use the one at tutorialspoint.com.
  3. 3. IDE over REPL: Reasons include: ability to load/save scripts (.jl files), more user-friendly interface, easier to organize your code, easier to clean up code, code is retained if Julia crashes, built-in features such as file explorer, color-coding of the code makes scripts easier to understand.

IJulia over other IDEs: Reasons include: scripts are much easier to share with non-programmers, plots are usually in the same area as the code, code can be exported as a variety of formats (e.g. .jl scripts, .ipynb notebooks, .html pages, .pdf documents), easier to use if you are coming from a Python or Graphlab background, formatting capabilities for relevant text.

  1. 4. Internal reasons: Auxiliary functions allow you to organize your thoughts in a more high-level and functional way. They make the whole process of building a solution much more efficient, and make debugging much more straightforward and quick.

External reasons: Auxiliary functions allow other users of your code to grasp its functionality faster, and they make your code more reusable (the auxiliary functions can be used by themselves in other applications). They also make your main function much simpler (allowing for high-level understanding of what’s going on).

  1. 5. It’s a function that combines (wraps) several auxiliary functions to accomplish a greater objective. If there is only one wrapper function, it is often referred to as the main function. Wrapper functions are common in complex applications and in various packages.
  2. 6. sqrt(): It yields the square root of a number. It takes a non-negative variable of type number as an input. It also works on arrays of such numbers, as well as complex numbers (variables that are of type number but not of type real).

indmin(): This provides you with the index of the minimum value of a numeric array. It takes an array as an input, of any dimension. It also works on other collection types.

length(): This identifies the number of elements of a collection or the number of characters in a string. It takes an array, dict, range, or abstractstring as an input, and always yields a single integer.

  1. 7. Assuming that y and z are of the same size, the expression always yields a float between 0 and 1 (inclusive). The first part of the expression (sum(y==z)) basically yields a number between 0 and n, where n is the number of elements in y. This same number is the value of the latter part of the expression (length(y)). The expression is the ratio of the two, so it must be between 0 and 1.
  2. 8. For a filename Array A.csv in path d:data, use the command writecsv(“d:\data\Array A.csv”, A).
  3. 9. Pick the most important variables, say A, B, and C, and a name for your data file, say “workspace - Friday afternoon.” Then load the JLD package:

  using JLD

Afterward, run the following command:

      f = open(“workspace - Friday afternoon.jld”, “w”)

      @write f A

      @write f B

      @write f C

      close(f)

You may need to change the first command slightly, if you wish to add a path, as in the previous question.

  1. 10. Max() works on a pair of numbers and takes two inputs of type number. Maximum() is the generalization of max(), applicable on a series of numbers; it takes a single input: the array containing these numbers. For finding whether a or b is larger (where a, b <: Number), you need to use the former: max(a, b).
  2. 11. Use Pkg.add(“NMF”) as follows:

  Pkg.update()

  using NMF

  1. 12. No. kNN, like any other classifier, is unable to work with text data. The distance calculation that lies at its core only works with numeric data. Nevertheless, kNN could be used in text analytics with the right data engineering, as we’ll see in Chapter 6.
  2. 13. The fastest way to do this is to change the reference to the distance() function in the kNN() function to that of the Manhattan distance, if you already have that knowledge. If not, you can edit the distance() function by replacing the line

  dist += (x[i] - y[i])^2

  with the following:

  dist += abs(x[i] - y[i])

Since you might need to revert to the original distance calculation when working with different datasets, you may want to keep the original line as a comment (i.e. add a “#” in front of it).

Chapter 3

  1. 1. Yes, I have!
  2. 2. Julia functions are your best choice in most cases. However, if you have a function implemented in C, you may want to make use of it since in many cases it may improve performance slightly.
  3. 3. A dictionary structure (dict) would be most suitable.
  4. 4. It does make sense if you know what you are doing and don’t plan to share your functions with other users. If you accidentally use these functions with inputs of irrelevant types, however, this may cause Julia to throw exceptions or errors. Also, the performance of the function is bound to be suboptimal.

Chapter 4

  1. 1. Yes, using the multiple dispatch characteristic of Julia. So, if function awesome_fun(A::Array) is designed to calculate the entropy of an array A, you can extend it by writing another function awesome_fun(f::ASCIIString) that calculates the entropy of the contents of a file f. This way, you can run awesome_fun() on either one of these two cases seamlessly.
  2. 2. The hdist() function takes abstractstring type inputs and a and b are character variables.
  3. 3. Yes. Multiple dispatch.
  4. 4. The answer is:

  function word_counter(text::AbstractString)

     words = split(text, “ “)

     word_count = length(words)

     return word_count

  end

  text = “Data science is the coolest field today.”

  println(word_counter(text)) # yields the number 7

  1. 5. The answer is:

  function non_space_characters_prop(text::AbstractString)

      chars = split(text, “”)

      N = length(chars)

      n = sum(chars .!= “ “)

      return n / N

  end

  text = “Data science is the coolest field today.”

  non_space_characters_prop(text) # yields the number 0.85

  1. 6. Here the main function is digit_freq(). Functions convert_to_string() and find_most_popular_item() are auxiliary.

  function convert_to_string(A::Array{Float64, 1})

    temp = string(A)

     return temp[2:(end-1)]

  end

  function find_most_popular_item(D::Dict)

    ind = indmax(values(D))

    temp = collect(keys(D))  

    return temp[ind]  

  end

  function digit_freq(A::Array{Float64, 1})

    temp = convert_to_string(A)

    freqs = Dict{Char,Int64}()

  

    for character in temp

   if character in “1234567890”

    if haskey(freqs, character)

      freqs[character] += 1

    else

      freqs[character] = 1

    end

   end

    end

    digit = find_most_popular_item(freqs)  

    return digit

  end

For large enough samples of numbers, it is expected that ‘1’ will be the most common digit (Benford’s law).

Chapter 5

  1. 1. This is the part of the data science pipeline that has to do with getting the data ready for analysis. It involves data preparation, exploration, and representation. Often referred to as “data wrangling,” it is one of the most necessary aspects of data science.
  2. 2. It is important as it makes the data ready for data exploration, taking care of the missing values, pinpointing potential issues, cleaning it, normalizing it, and in many cases removing at least some of the outliers.
  3. 3. The data science pipeline is geared towards complex and/or messy data, with the goal of creating a data product and/or actionable insights about the future. Other data analysis pipelines focus more on creating some summary or interesting conclusion, usually related to the past or present. Also, the data in other data analysis pipelines is relatively simpler and cleaner.
  4. 4. We can clean it up a bit and transform it into something like this:

  Variable Text: “the customer appeared to be dissatisfied with product 1A2345 released last may”

  Variable RelatedProduct: 1A2345

  1. 5. It entails the understanding of the dataset through examination of the variables’ distributions, how they are related to each other, and their relationship to the target variable. All this is done by looking into the geometry of the feature space and the application of various visuals.
  2. 6. The main process involved is encoding the data into the most appropriate kind of data types. In many cases, it also involves the featurization of the data (especially in the case of text data).
  3. 7. This is anything related to the discovery of patterns and connections within the data, usually directly applicable to the data learning phase that ensues.
  4. 8. This entails getting the computer(s) to learn from the dataset we have prepared in the previous stages. This involves the intelligent analysis of that data and the use of regression, classification, clustering, and other techniques to derive some kind of generalization or applicable insight.
  5. 9. This is the process of deploying the model created in the previous stage into production. This is usually done through the development of an API, an application, or a dashboard; it aims to make the model accessible to an audience different from the data science and engineering teams (often the clients of the organization owning this product).
  6. 10. During this phase a number of visuals and reports are created. The target audience is management, and sometimes other parts of the organization owning this pipeline. This is different from the data product, which is more often than not made available to a broader audience, either at a price or for promotional purposes. The insight, deliverance, and visualization phase also encompasses the results and insights from the data product (e.g. how successful it was, what feedback was received by the users) and is essential for the evaluation of the whole data science project. It is a return to the starting point, in a way, often spawning a new cycle for the project.
  7. 11. It is highly non-linear since there are often detours from the “normal” flow, as new tasks come about, or things don’t go as expected.
  8. 12. An app that calculates your average calorie consumption based on your movement throughout the day, a recommender system on a social medium, or a risk evaluation system for potential clients.
  9. 13. These tend to be more polished and convey more high-level information.
  10. 14. Yes, since every one of them contributes to either transforming, analyzing, or distilling the data as it is gradually transmuted into information. Omitting certain ones may be possible in some cases (e.g. when all the data is tidy and ready to be analyzed), but this is quite rare.

Chapter 6

  1. 1. Data engineering is important since it allows you to identify the most information-rich aspects of your data and prepare them for further analysis.
  2. 2. Data frames are capable of handling missing values (coded as NA). Also, it is possible to host all kinds of data types in the same data frame.
  3. 3. By opening the file and parsing it using the JSON package:

  f = open(“file.json”)

  X = JSON.parse(f)

  close(f)

  1. 4. There are three main strategies:
  2. a. You can process it on a cluster or cloud environment.
  3. b. You can process it natively by breaking it down to manageable pieces.
  4. c. You can load the whole thing into an SFrame, after installing the corresponding package.
  5. 4. It entails the following processes, based on the kind of data we have:
  6. a. For numeric data: eliminate missing values, handle outliers.
  7. b. For text data: remove unnecessary characters, remove stop words (in the case of text analytics).
  8. 3. Because they allow you to do more efficient resource management, and also represent your data in a way that makes sense to both the user and to the Julia functions used in the stages that ensue.
  9. 4. We’d do this by normalizing them, ideally to (0, 1). However, a normalization to [0, 1] would also be very good, assuming there are no outliers in the dataset. Alternatively, a normalization using the mean and the standard deviation is also a feasible option.
  10. 5. The abstractstring type is the obvious choice. However, various subtypes of integer would be useful, particularly if you are planning to do a frequency analysis. The Boolean type is also useful if you want to perform a lexicon analysis or something along these lines. In all cases, a sparsematrixCSC data type will come in handy, especially if you are dealing with a dataset having a large vocabulary.
  11. 6. A bitarray, a Boolean array, or an array of Int8 type would work.
  12. 7. OnlineNewsPopularity: Correlation

  Spam: Jaccard Similarity, Mutual Information, or Similarity Index

Chapter 7

  1. 1. Remove one of them, taking into account its correlation to the other variables and its similarity to the target variable.
  2. 2. Never. You can only disprove it, with a certain degree of confidence.
  3. 3. If the target variable is ordinal and each class is equally different from its neighboring classes. Since this is rarely the case, it’s best not to use correlation in classification problems.
  4. 4. Be sure to perform data preparation first, creating as many plots as you can. An observation could be something like “variable A appears to be closely related to variable B,” “variable C seems to be a very good predictor for the target variable,” “variable D has a few outliers,” “variables need to be normalized,” etc.
  5. 5. Yes, although it works best on variables following the normal distribution. Since this rarely happens, it is often used regardless.
  6. 6. No. In order to draw statistically significant conclusions you need a larger sample.
  7. 7. Use a scatter plot.
  8. 8. Visualizing the whole dataset, transforming it into a smaller dimensionality, without much distortion. Based on this, you can deduce that this method allows for better depiction of meaningful patterns, such as clusters, and assessment of classification difficulty.

Chapter 8

  1. 1. Yes, as the number of features is very high for most data analytics methods. Any one of the described methods would work. If labels are available, use the methods based on feature evaluation, as they would be considerably faster and less prone to memory overflow problems for this particular dataset.
  2. 2. Yes, as the number of features is still somewhat high for many data analytics methods. Any one of the available methods would work, though PCA would be preferable for performance reasons.
  3. 3. A method based on feature evaluation, as the dataset is too small for PCA to work properly.
  4. 4. Possibly, though not as a first choice. If the attempted data analytics methods don’t work well, or if some of the features are shown to be redundant based on the data exploration stage, try applying one of the data analytics methods.
  5. 5. No. Data science projects usually involve a large number of features, making the total number of possible feature combinations extremely large. Applying such a method would be a waste of resources, while the additional benefit of the optimal reduced feature set may not improve the performance of the data analysis much more than the result of a sub-optimal feature reduction method.
  6. 6. The use of additional information residing in the target vector (i.e. what we are trying to predict).
  7. 7. In this case the ICA method would be best, as it yields statistically independent meta-features.
  8. 8. Any one of the more sophisticated approaches would work. If you don’t mind playing around with parameters, the GA-based method would be best.

Chapter 9

  1. 1. It is possible, but only if there is a discreet variable in the dataset and it is used as the target variable. After the stratified sampling, you can merge the two outputs together and use the new matrix as your sample dataset.
  2. 2. Stratified sampling would maintain the signals of the minority class(es) to some extent. Using basic sampling might result in a sampling with the smaller class(es) poorly represented.
  3. 3. Randomization. As any statistics book would stress, this is key for any decent sampling, as it is the best way to deal with potential biases in the sample.
  4. 4. Yes, although it’s not a method you would see in any data science book. To normalize the total cost to be between 0 and 1, first find the maximum possible cost (assume that all elements of each class are misclassified in the worst possible way, i.e. yielding the highest misclassification cost). Then take the ratio of the total cost of a given classification with the maximum cost.
  5. 5. No, unless you break the problem into three smaller ones, each having a binary result (e.g. “class 1” and “Any other class”). Then you can create an ROC curve for each one of these sub-problems.
  6. 6. You can use KFCV for any problem with a target variable that doesn’t have many different values. So, although this is designed for classification problems only, you could theoretically use it for regression too, if the number of distinct values in the target variable were relatively small.

Chapter 10

  1. 1. Distances are important in clustering and any other distance-based algorithms. This is due to the fact that they directly influence the function and the results of these algorithms. Picking the right distance metric can make or break the clustering system, yielding either insightful or nonsensical outputs.
  2. 2. Use DBSCAN or some variant of it.
  3. 3. Those metrics rely on the presence of a target variable, containing information independent from the data points in the feature set. Applying these metrics on clustering would be confusing, if not entirely inappropriate.
  4. 4. Only numeric data can be clustered, although non-numeric data could be clustered if it were transformed to binary features first. All data must be normalized before being used in clustering, in order to obtain unbiased results.
  5. 5. Partitional clustering is not limited to two or three dimensions, while the objective is a meaningful grouping of the data at hand. t-SNE aims to limit the dimensionality to a manageable amount without significantly distorting the dataset structure, so that it can be visualized well. Theoretically the output of t-SNE could be used for clustering, although that’s not often the case.
  6. 6. Technically no, since its output is not directly applicable in the rest of the pipeline. However, it is useful in datasets having a certain level of complexity, as it allows us to understand them better and make more interesting hypotheses for these datasets.
  7. 7. High dimensionality can be a real issue with clustering, as the distance metrics generally fail to provide a robust way to express the dissimilarities of the data points (particularly when it comes to the diversity of these dissimilarities). However, this issue can be tackled effectively by making use of dimensionality reduction methods, such as PCA.
  8. 8. You can multiply this feature by a coefficient c that is greater than 1. The larger the coefficient, the more pronounced the feature will be in the clustering algorithm.

Chapter 11

  1. 1. Try either a random forest, a Bayesian network, or an SVM. If it’s a regression problem, statistical regression would also be an option. If the dimensionality is reduced to a manageable number of features, a transductive system would also work. The fact that the original features are independent of each other would make any tree-based approach ideal, though a Bayesian network would also benefit.
  2. 2. Any one of the tree-based systems would do. Bayesian networks would also work well in this case. Depending on the amount of data available and the dimensionality, ANNs and transductive systems would also be viable options.
  3. 3. Neural networks, since they can handle noisy data well, are ideal for large datasets, and excel at the task of classification.
  4. 4. No. Even a robust supervised learning system would benefit a lot from intelligent data engineering.
  5. 5. Tree-based systems and, in the case of regression applications, the generalized statistical regression model.
  6. 6. Neural networks are an extension of the statistical regression, where everything is automated. Each node in the second layer can be seen as the output of a statistical regression model, making a 3-layer ANN an ensemble of sorts, comprising statistical regression systems.
  7. 7. Dimensionality. The use of feature selection or of meta-features instead of the original features would solve the problem effectively.
  8. 8. Deep learning systems are not panaceas. They work well for certain types of problems, where data points are abundant, and where resource allocation is generous. It would be overkill to use them for every problem encountered, especially if interpretability is a requirement.

Chapter 12

  1. 1. They help us model and visualize complex datasets in a simple and intuitive way, and can also tackle certain datasets (e.g. social network data) that are practically impossible to model with conventional data science techniques.
  2. 2. If we denote each feature as a node and its similarities to (or distances from) other features as the corresponding edges, we can graph the whole feature set. Then we can see how independent the features are using basic graph analysis. Also, if we correlate the features with the target variable (using an appropriate metric), we could evaluate the various features. All this analysis can help us get an idea of the robustness of the feature set.
  3. 3. Possibly, depending on the data. Certain datasets should be modeled this way, but sometimes conventional techniques work well enough or better than graphs. Graph analysis generally works best in cases where the relationship among the data points or features or classes are important.
  4. 4. Without a doubt. In fact, we already implemented such a system in MATLAB several years back. However, this never reached the public domain as there were other more pressing innovations that took precedence. Anyway, an MST based classification system could be implemented as follows:
  5. a. Inputs: Use a feature set for training, labels for training, a feature set for testing, or a distance metric.
  6. b. Training: Create an MST for each class based on the data points of the training set and their distances (calculated using the given distance metric), and calculate the average edge weight.
  7. c. Create an MST, compare the average edge weight of the new MST, compare it with the original MST of that class, assign the class label based on the case with the biggest improvement (lower difference in the average edge weights). Repeat this process for each data point in the test set and for each class.

It would be best to try building it yourself before checking out the provided code (MSTC.jl). Nevertheless, you are welcome to use the code provided, as long as you reference it properly.

Despite its simplicity, the MST-based classifier is powerful due to its holistic nature (even though this is not so apparent at first). However, it inherits the shortcomings of all distance-based classifiers, and it doesn’t scale well.

  1. 4. No. You need to do a lot of preprocessing first, and model the data in such a way that the graph makes sense.
  2. 5. The answer is:

  function MST2(g, w) # MAXimum Spanning Tree

    E, W = kruskal_minimum_spantree(g, -w)  

    return E, -W

  end

  1. 6. No, because there is additional information related to the graph that is not contained in the graph object of either package. Most importantly, the weights data is not stored in the graph data file.

Chapter 13

See corresponding IJulia notebook.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.251.217