Chapter 9


Choosing your technologies

You’ll find it’s become much easier to start collecting and using big data thanks to recent developments in technology, but the range of technologies can be bewildering. Do an image search for the most recent ‘Big Data Landscape’ and you’ll see what I mean. We should never start our big data journey by focusing on technology, but we won’t get far into the journey without it.

We’ll talk about your technology stack, which is the set of components that together make up your technology solution. For example, your production stack might be Java or C++ code running on an Ubuntu (Linux) operating system, which may in turn be running code in Docker containers on HP servers, possibly cloud-based. Most components of your stack will already be in place, courtesy of your IT department.

In assembling your technology for big data, you’ll need to make decisions such as:

  1. What computer hardware will you use?
  2. Where will that hardware be located?
  3. How will you construct the data pipeline (including choice of source systems and sensors, transfer mechanisms, data cleaning processes and destination databases or applications)?
  4. What software will you use: programming languages, libraries, frameworks and third-party tooling?
  5. How will you deliver the results to the internal and external end users?

Some of these decisions are more critical than others, and many will be dictated or heavily influenced by your industry sector and your organization. The second question above, hardware location, is the newer consideration in the world of IT, and I’ll devote more space to it below.

Choosing your hardware

Look closely at your current and projected requirements for data volume, processing and transfer. Most software applications will specify minimum and recommended hardware specs, including details of number and power of processors, memory (RAM), storage (disk) and networking capabilities. Big data solutions will often require clusters of machines, typically 3–6 to cover basic functionality, but scaling up to tens of thousands for large applications. Neural networks are fastest when run on specialized processors, such as GPUs, rather than on standard CPUs, so don’t plan to re-deploy standard processors for such applications.

Choosing where your technology is located: cloud solutions

We introduced cloud computing in Chapter 5, where we described public and private clouds, the latter being when a large company dynamically allocates centralized computing resources to internal business units. Cloud technologies include hardware and software applications such as email, databases, CRM systems, HR systems, disaster recovery systems, etc.

A study by Dell reported that 82 per cent of mid-market organizations across the globe were already using cloud resources in 2015, with 55 per cent of organizations utilizing more than one type of cloud.1 Organizations actively using cloud reported higher revenue growth rates than those who weren’t. The main perceived benefits of cloud computing are outlined in Figure 9.1.

Cloud computing enables you to quickly provision cloud-based hardware, a service known as Infrastructure as a Service (IaaS). This is key to your big data initiatives. The underlying principle in choosing technology for big data is moving quickly and with flexibility. IaaS provides this, allowing you to scale storage and processor capacity within minutes.

Figure 9.1 Top perceived benefits of cloud computing.

Figure 9.1 Top perceived benefits of cloud computing.1

Case study – 984 leftover computers

To illustrate the advantage of rapidly scaling infrastructure without committing to purchase, consider Google’s earlier image detection efforts. Initially built using CPUs running on 1000 computers, Google’s cost for the hardware was roughly one million dollars. They subsequently redeployed the project on GPUs, which worked so much better that they were able to run the model at a fraction of the cost on just sixteen computers (roughly $20,000).71 Most companies could not afford to make such a large hardware purchase for an experimental project, nor to explain to their finance department the hundreds of computers purchased but no longer needed.

You’ll still need to install software on the cloud-based hardware (the operating system, middleware, storage, etc.), but if you want to move straight to running applications, you can utilize a Platform as a Service (PaaS) offering, which may be proprietary or may be an open-sourced product implemented and maintained by a service provider. In this way, your analytics programme can outsource both the hardware and the foundational software and work directly on your application.

You may have concerns regarding security in the cloud. For public clouds and Software as a Service (SaaS), security is in fact the biggest barrier for companies considering the cloud. In the Dell study cited above, 42 per cent of companies not yet using cloud said security was the reason, far higher than any other reason. European companies often want to keep their data within Europe, particularly following Edward Snowden’s revelations and the resulting turmoil in Europe’s Safe Harbor Provisions.

In industries such as finance, where security, reliability and compliance are particularly critical, companies have traditionally opted to manage their own data centres to keep tighter control of security and reliability. However, earlier security concerns in these sectors continue to be alleviated by cloud providers, and companies in the financial, pharmaceutical, and oil and gas sectors have started utilizing cloud technologies.72

Some companies attest that running applications in the cloud leads to more secure applications, as it forces them to leave behind insecure legacy software. Being more state of the art, applications built for the cloud are generally designed with high levels of control, with better monitoring capabilities and better overall security. The cloud providers give a degree of consistency that enhances security, and they are themselves very conscious of securing their assets.

Moving, cleaning and storing your data: data pipelines

You’ll need to architect your data pipeline, selecting data warehousing and middleware, such as messaging systems for transferring information in real time (e.g. Kafka, RabbitMQ, etc.)

Moving and cleaning data is generally the most time-consuming part of an analytics effort. You can purchase an ETL tool to do much of the heavy lifting in data processing. It should provide useful ancillary functionality, such as for documentation. A good ETL tool can make it easy to add a new data source, pulling data not only from a traditional database but also from newer sources such as web analytics servers, social media, cloud-based noSQL databases, etc.

You’ll also need to select and prepare the destination database(s). As we discussed earlier, there are hundreds of database solutions to choose from. If you’re deeply rooted in a vendor technology, you may want to continue within that vendor’s product ecosystem, or you may consider adding new technologies, running several systems in parallel or as separate projects. Migrating from a proprietary to an open-source database can bring significant cost savings. One company recently reported cutting its cost per terabyte in half. You’ll also need to invest significant effort in setting up the logical and physical structures of the database tables in a way that best fits your intended use.

Choosing software

We should emphasize again that extensive libraries for analytics have been developed for the major programming languages, and you should start your analytics effort by working with what is already available. Popular languages and tools such as Python, R, SAS and SPSS already include extensive analytic libraries with large support communities. In Python, a developer can build a neural network with only a few lines of code by leveraging existing software packages such as Keras and TensorFlow.

Don’t expect to find off-the-shelf software that completely solves your analytic challenge, but existing software should give you a good head start, particularly if it integrates seamlessly with your data pipeline and automates data processing. Keep in mind that solutions still need to be customized to your problem, and you’ll want to apply subject-matter expertise to engineer the model features that work best for your application. In addition, it is often the case that off-the-shelf solutions are simply implementing a common analytic model.

When purchasing analytic software, you should always ask yourself the standard questions that you would ask for any software purchase (cost, reliability, required training, etc.).

Keep in mind

Don’t expect to find an off-the-shelf solution that solves your problem without substantial additional effort.

Delivery to end users

If you’re building an analytic tool that will be used in a production environment, such as delivering a real-time recommendation for a customer or setting an optimal price based on real-time supply and demand, you’ll need to choose a delivery technology that fits the technical requirements and constraints of your delivery end-point. For example, your web page may access content by calling a REST service on your analytic server or by executing a direct database call within the network.

Internal users will access your outputs either directly from a database, on reports or dashboards, or using a self-service BI tool. Reports and dashboards have the advantage of standardization and quality control and can be generated manually or with special-purpose software.

But data in reports ages quickly and may exclude important details. Their readers also cannot dig deeper to get more insights. This is one reason self-service BI is so important, and BI tools have come a long way over the past few years in providing this functionality. Like MS Excel but much more powerful, these self-service tools allow users to create graphs and pivot tables and explore relationships and segments not shown in other reports. Look for these self-service capabilities when choosing your BI technology.

Considerations in choosing technologies

As you select technologies for your big data projects, consider the points outlined below.

  1. Capabilities matching business requirements Look critically at your need-to-haves vs nice-to-haves, and consider how those may develop over time. Products may be ‘best in class’ because of features not important for your application. Interview your stakeholders to understand requirements such as:
    • How frequently should data be refreshed?
    • What needs to happen in real time rather than batch (once daily, typically overnight)?
    • Which data sources will be accessed?
    • Is it important that the system is available 100 per cent of the time?
    • What technologies can your colleagues easily work with, based on their skill sets?
    As you speak with technology vendors and users in other organizations, you’ll become aware of additional features and use cases that may not have surfaced during your internal discussions.
    You should consult a wide range of stakeholders, including:
    • Budget holders, who will oversee costs and have preferences for CapEx or OpEx.
    • Legal and privacy officers, who will have requirements related to data location, governance, fair use and accessibility.
    • IT teams, who will help you leverage technologies and skills already in your organization (Python, for example, is commonly used across IT teams). They will also have technical requirements that you must satisfy.
    • Business units, who will have requirements related to usability and delivery. Their input will certainly impact your choice of BI tools and could potentially impact any part of the big data technology stack, with requirements related to latency, accuracy, speed, concurrency, consistency, transparency or delivery.
    You may need to set aside your agile mindset when choosing technology. Some solutions, after an initial test period and a limited proof of concept, require a significant deployment decision. In such cases, conduct a thorough requirements analysis before making significant investments or deployment efforts.
    To illustrate, assume you work in the financial services, where security, reliability and compliance are critical. Companies in this industry have traditionally opted to manage their own data centres and keep complete control over security and reliability. They avoid early adoption of technologies, particularly open-source. Early versions of Spark and Kafka were not even an option, as they did not support SSL security protocols.
    In financial services, you would have stringent requirements related to auditing, which is typically more difficult with open-source software. Whereas most companies plan their systems assuming a certain degree of system failure, you would require extreme reliability from each system component.
    If you were in financial services, your big data technology choices would thus be guided by the following principles:
    • You would not be an early adopter of new technologies.
    • You would choose the most reliable software, regardless of whether it is open-source.
    • You would maximize reliability by purchasing support.
    • You would be very cautious when deciding to use cloud-based servers.
  2. Technology recommendations You’ll find it’s often quite difficult to evaluate a technology. You’ll see product features listed on marketing material, but you need insights into usability, performance, reliability, and the spectrum of undocumented features that will determine the success or failure of the technology within your organization.
    Start by gathering insights and experiences from within your own organization and professional network. If your organization holds a Gartner or Forrester subscription, you’ll want to set up analyst interviews and request relevant analyst papers. If you don’t have such a subscription, you can often speak with one of these analysts at a conference. Bear in mind that their expertise may be stronger in vendor rather than open-source technologies.
    Some independent thought leaders publish reviews and recommendations, but be aware they are often paid for their endorsements. Look also on the online forums, including slack channels, which provide a continuous stream of insights into technologies. These are frequented by some very knowledgeable practitioners, and user voting systems help keep quality high. In fact, the developers of the technologies are themselves often active on such forums.
    Take care when attempting to replicate solutions chosen by others. Small differences in requirements can lead to completely different technology requirements. To illustrate, Spark is a widely referenced technology for streaming analytics, so we may see frequent mention of it online. But because Spark processes data in micro batches, it is generally not appropriate for solutions requiring a latency of under 500 milliseconds (½ second), and Apache Flink, a technology that originated in Germany, would probably be more appropriate for such applications.
  3. Integration with existing technology Consider how you’ll integrate your analytic solution internally as well as with your customers’ technologies. Try to choose solutions that are modular (and hence provide more versatility). However, the pricing and usability benefits of packaged capabilities, combined with automated data transfer features, may make more coupled solutions attractive. Larger vendors tend to create solutions that span multiple applications, including basic analytics within a visualization tool (e.g. Tableau), machine learning within a cloud environment or a larger software suite (e.g. Microsoft, SAS or IBM), ETL and delivery solutions coupled with a data warehouse (Microsoft’s BI stack) or AI capabilities within a CRM system (Salesforce’s Einstein). For such applications, you’ll want to consider whether such an offering fits your requirements in a way that better optimizes data flow or minimizes incremental software costs. Understand the technology platforms of your target B2B customers, which may lead you to develop integrations with or parallel solutions within those technologies or cloud environments.
  4. Total cost of ownership Many organizations see cost as a large barrier to using big data. In Dell’s 2015 survey, the top barriers to increasing the use of big data included the costs of IT infrastructure and the cost of outsourcing analysis or operations. Consider both direct and indirect costs, including licensing, hardware, training, installation and maintenance, system migration and third-party resources. Your IT department should already be familiar with this costing process, having performed similar analysis for existing technology.
    These costs continue to fall, and if you’ve done your homework in preparing your business case, you should be able to choose the projects and solutions that will result in positive ROI.
  5. Scalability Consider how the technology can handle increases in data, replications, number of users and innovative data sources. Consider also how the licensing model scales. The license for BI tools may be manageable when deployed to a dozen users, but prohibitive at the point where you want to empower several hundred employees with its self-service capabilities. A lack of planning in this area can lead to some painful budgeting moments later.
  6. Extent of user base If you choose a fringe technology, it will impact your ability to find external support as well as to hire and train internal staff to operate the technology. The broader the adoption of the technology, particularly within your geography and industry sector, the more likely you will be able to hire qualified staff. There will also be more support available, both from third parties and from online forums such as stack overflow and slack groups. Similarly, a widely used, open-source technology is more likely to be kept up to date and to have bugs and usability issues quickly flagged and repaired.
  7. Open source vs proprietary If you use open-source technology, you’ll be able to quickly leverage the efforts of the wider community and save development time and licensing fees. As we mentioned above, your situation may dictate that you use proprietary technologies considered to be tried and true, and which come with strong service-level agreements.
  8. Industry buzz Recruiting talent within the big data and data science domains is very difficult. Using the newest software frameworks, databases, algorithms and libraries will increase your ability to recruit top talent.
  9. Future vision of the technology If your organization is an early technology adopter, you’ll want to give preference to technologies that are quick to integrate and adapt to the technology space around them. For example, we mentioned earlier how Python is often the first language supported by new big data technologies, but that many algorithms in academia are developed in R. In addition, early consumers of new data types will want to choose an ETL or BI tool known to quickly add new data sources.
    Ask vendors about their forward-looking visions. One of Gartner’s two axes in their Magic Quadrant is ‘completeness of vision,’ which incorporates vendors’ product strategies.
  10. Freedom to customize the technology Will you be satisfied using the technology out of the box, or will you want to view and modify the code? If you are integrating the technology into a product for resale, check the licensing restrictions.
  11. Risks involved with adopting a technology Cutting-edge technologies will be less well tested, and hence higher risk. An outsourced as a Service brings additional reliance on third parties, and vendor solutions depend on vendor support.

Big data technologies are fascinating, and they are developing rapidly. But you can’t build a programme on technology alone. In the next chapter, I’ll talk about the most critical resource you’ll need, which is also the most difficult to secure: your analytics team.

Takeaways

  • You’ll need to make choices related to hardware, use of cloud, data transfer, analytic tools and data delivery (BI).
  • Companies are increasing use of cloud solutions, but some concerns remain.
  • As-a-Service offerings can free you to focus on your core differentiators.
  • Stakeholder requirements and preferences will play a crucial role in technology decisions, particularly for BI tooling.
  • Consider several important factors as you decide between competing technologies.

Ask yourself

  • What parts of your infrastructure and software could you replace with as-a-Service offerings to allow you to focus more on your core differentiators?
  • Are you experiencing integration difficulties from utilizing too many different technologies? What steps are you taking to assess these difficulties and to standardize your technology where necessary? Consider the tradeoff between costs and benefits for such a process.
  • Who in your company or professional network can provide you with broad, unbiased insights into available technologies? Consider what industry conferences might be helpful in this.
  • Consider your organization’s growth projections. How long will it be before the technologies you are using today are either unable to handle your data needs or become prohibitively expensive for the scale you’ll need to use them at?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.55.198