This section describes how the processing works, starting out by giving an overview of the workflow.
Take a look at Figure 1. In order to describe each part as well as the interaction between the parts, we will use a three-layered model. First, the analyst sets up some data science operations and configures them. For this, the analyst uses the OPA_TAD extension for RapidMiner on the client device [read more about the client software in the documentary: Client].
The client extension provides some configuration files in XML format, which are used for layer two. By using our RapidMiner extension, the client sends the XML file to the edge server, which receives said files in the second layer We limit the access of executable code on the cluster in order to prevent security breaches. For that reason, the edge server checks the configuration files. If they have been set up correctly, the edge server pushes the cluster nodes to start the specific jobs, i.e. the edge server ask the master node (yarn) to start and manage the jobs within the cluster. Finally, the cluster nodes executes the jobs via Apache Spark in layer three.
Frequently Asked Questions
We will now dive deeper into the execution and job processing, but just in case you have question regarding the terms used above, have a look at the following links:
What the hell is an edge server? [Extern]
What is an XML File? [Wikipedia]
How does the translation work exactly? [Read more in our documentary.]
Why does OPA_TAD restrict individual code to be executed on the server? [Read more on our startpage (german).]
What is a yarn and a cluster? [Read more about our cluster here]
How does Apache Spark work? [Wikipedia]
This section discusses how the processing is carried out. We will therefore switch from the abstract level to a more concrete description of the Java classes. The Java code is located at the OPA_TAD Server extension. (Take a look at our Java doc files, too! [Link]).
Our processing engine aims at creating a formal class hierarchy that allows flexible use of Apache Spark. Moreover, the configuration of the data science operations, configured by the client needs a counterpart on the server.
While the Apache Spark code is being executed on the cluster, the edge sever starts these Apache Spark jobs. Locate the processing in Figure 1: this server extension is executed by the edge server (on layer two). The processing takes place after the edge server has received the configuration files.
To execute the configurations provided by the client, we introduced a sibling on the edge server: the so-called operator classes. Each data science operation on the client is represented as an operator class in the server extension. The aim of the operator classes is to represent these data science operations on the server and provide the configurations, which are declared by the analyst via a graphical user interface, to the processing.
Since our goal is to provide a flexible structure, we decided to create so called CreationFunction (CF) classes. The CF classes provide the configuration for the Apache Spark codes and store the concrete Apache Spark functions within their object files (as field). Each operator class has a representation as CF. Furthermore, each CF provides a create function, which starts a SparkFunction. Since a few specific data science operations (represented as operator class) have very different representations in Apache Spark, the aim of the CF is also to decide which SparkFunctions should be executed.
The SparkFunction classes consist of transformations and actions (the terminology of spark operations). Normally, each Apache Spark operation needs a so-called environment. Within this environment, the beginning and the end of a data science job (on the cluster) are defined. To deploy all desired data science operations together, we introduced a special topology for these operations. This topology is generated by the CF: since each CF holds another CF, we get a list of CF. The first CF holds a reference to a special CF, which declares the beginning of the Spark execution environment, while the CF topology will be executed first by the last CF in the list, which then triggers the CF before it.
The concept discussed here gives a brief overview. As you might imagine, there are many more class files (a lot of them as extension for the CF classes). You can discover them for yourself at our JavaDoc files here [Link].