At this stage in the Guerrilla Analytics workflow of
Figure 19,
data has been extracted from a source system and received by the analytics team. This data has been successfully loaded into the target DME. You have now reached a stage in the analytics workflow where you can do some actual analytics. This involves writing program code in your data analytics environment. Program code has several purposes in analytics work.
The outputs of your program code will be one or more datasets and perhaps some graphical outputs. Keeping in mind the highly dynamic Guerrilla Analytics environment, you need to strive to write code that is easy to understand and review and that preserves the provenance of your data. You need to do this with minimal documentation and process overhead.
7.1.1. Example Activities
Here are some examples of analytics program code.
• A single SQL code file that connects to a relational database goes through a supplier address table and identifies the address country for each supplier by looking for recognized zip code patterns in the supplier address fields. This derived address country is added into the dataset as a new data field.
• A single R code file that reads a CSV data file, classifies its columns into variables, runs a statistical regression analysis on the variables, and then outputs the analysis results as tables and as plots to image files.
• A Python script that runs through a directory of thousands of office documents, calls an external tool to convert these to XML format, and saves the XML file beside its original office document with the same name. This process is to prepare the data for further entity enrichment with another tool.
• Twenty code files are run in a particular order to manipulate and reshape data so it can be imported into a data-mining tool.
• A SQL code file that creates a predefined subset of data according to business rules and exports this subset of data into a spreadsheet with mark-up columns for the customer to review and complete.
• A collection of code files that incorporate spreadsheet inputs from users and build them into a data repository so they can be summarized for Management Information (MI) reporting and checked for inconsistencies.
• A direct export of a dataset in the DME so the customers can do their own work with the data.
There are of course many more examples. I have deliberately chosen these examples to draw attention to some important Guerrilla Analytics themes.
• Many languages: A variety of data manipulation languages may be in use in the team either because of the functionality required or because they are familiar to particular team members. These languages will have code files with different structures because of the languages’ designs.
• Multiple code files: Some work requires a single program code file while other work involves several code files that must be executed together in a particular order. It is the analyst’s choice how the data flow is split into code files.
• Multiple outputs and output formats: Some work involves multiple outputs in a variety of formats such as data samples in spreadsheets and graphical charts in image files.
• Multiple environments: Some work is running on the file system and some work is running within a database or other DME.
• Human-keyed inputs: Some work is incorporating user inputs and mark-ups.
• Tiny code: Even the simplest code – extracting a copy of a dataset – is something that needs to be traceable to maintain data provenance.