A: What is clustering in Data Science?

Credit: Pixabay

In computer science the term "clustering" refers to performing one or multiple related tasks through multiple computers working together simultaneously.

This provides typically more capacity to complete a single task or a large number of tasks that can be executed in parallel (simultaneously).

For example, access to a database can be provided through many computers in order to provide simultaneous access to many users.
A typical example of that would be a large search engine such as Google which apparently runs on the best part of a million data servers.

Another use case is when multiple machines work simultaneously on trying to solve one particular problem.
The algorithms developed to solve the problem need to be carefully designed so that the problem can be divided into simpler tasks which can be parallelised (run simultaneously).

Some use cases for the above would be, weather forecasting or artificial intelligence for game analysis, such as alphago.

In other scenarios, clustering is used to provide redundancy of service. That is, if a server fails another server in the cluster takes over.

One of the difficulties with clustering is the distribution of work evenly between nodes (servers). This role is usually fulfilled by a few selected nodes in the cluster who must keep track of the workload currently taking place in each work node.

Another problem is the management of the nodes themselves. In particular, how is the cluster expanded or reduced by adding or removing nodes without affecting the service provided by existing nodes?

All of the above requires careful design of the cluster and usually involve specific solutions rather than generic designs which meet many needs.

Clustering and Cloud Computing

Credit: Pixabay

The advent of hardware vitualization and cloud computing now means that clusters are often composed of nodes running on virtual machines or Operating System containerization which can be deployed across vast geographical distances.

Typically cluster nodes communicate with each other over Local Area Networks using Ethernet protocols.

As a result when clusters are deployed over IP Wide Area Networks, such as the Internet, these Ethernet protocols need to be tunnelled over IP networks.

In practice, there are many instances where clusters are deployed in a couple of places, rather than a large number of locations, for safety reasons.

In Summary

Computer clusters can be used to increasing the computing capability of a particular service, such as a search engine or scientific analysis.

Clusters can also be used for resilience against a single node (computer) failure.

Both resilience and capacity are often implemented within a single cluster.

Clusters can be deployed across multiple geographical areas or within a limited of locations where data safety is desired.

StemQ Notice: This post was originally submitted on StemQ.io, a Q&A application for STEM subjects powered by the Steem blockchain.