Skip to content

System Overview

The journey of a measure

To better understand how BBData works, we review the journey of a measure into the BBData pipeline.

components

In (1), different kind of sensors produce measures. Those sensors might come from different manufacturer and use various data encodings. To ensure the compatibility of the platform with any kind of equipment, BBData uses the concept of virtual objects. A virtual object has metadata, such as name, unit and type and can be mapped to a real sensor through an ID. This mapping is handled by collectors (2). A collector creates a bbdata record for each measure and sends it to the input api in a structured JSON format. In case some sensors lack an internal clock, the collector can also produce a timestamp. The input api (3) is a JavaEE REST service running on GlassFish whose role is to validate the incoming measure and ensure its authenticity using the object’s ID and a secure token. If the check succeeds, security informations are dropped and the resulting JSON is added to an Apache Kafka message queue (4) for processing. The content of this queue is also dumped periodically to HDFS, so the raw inputs can be replayed anytime. Before processing, the measure is first ”augmented” with the virtual object’s metadata pulled from a MySQL database (5). The result is stored in a second message queue using a compressed format (6).

BBData also provides a MQTT Endpoint (10) which allows IoT devices to use a lighter protocol if necessary. The endpoint uses the same data processing pipeline as the input api (3), meaning that you'll be able able to access your measures using the output api (8) as if you were using the input api (3). As of now, it is not possible to subscribe to a MQTT topic, only the publish functionnality (with the three level of QoS) is supported. The format of the measure is expected to be a JSON similar to the format used by the input api (3) and is sent in the publish packet payload. The MQTT Endpoint (10) will then rapidly forward the measure to kafka (11), so that it is replicated and saved on disk as quickly as possible. It is then validated (12) using the same validation process as the input api (3); authenticity of the object's ID and token are verified as well as the provided data type for the required fields. In cases where the measure is valid, it is sent into the standard processing queue (4), otherwise it is sent into an error queue (13). Users will be alerted by email (14) every 15mins in case one of their measure falls into the error queue.

In BBData, processing covers a wide area of tasks. It goes from the saving of raw values into a persistent store to the detection of anomalies or the computation of time aggregations. Each of these tasks is handled by a specific processor. Processors are independent streaming applications running in a hadoop cluster. They subscribe to the augmented Kafka topic, carry their task and save their output, if any, in a Cassandra database. This design makes it possible to add or remove processors without any impact on the system. We have currently two kind of processors, both implemented with Apache Flink4: the first saves the raw records to Cassandra, the second one computes live time aggregates (mean, max, last measure, standard deviation) with a granularity of fifteen minutes, one hour and one day.

Users and building automation applications can access the data and manage virtual objects through a standard REST interface called the output api (7) or via HTML5 web applications.

The journey of a BBData user

Here is an overview of what a user has to do to make use of BBData.

user apis

Say Foo just decided to use BBData. Foo has one sensor connected to a raspberry pi which measures both the temperature and the humidity of the room every minute.

The first thing Foo needs to do is to create two virtual objects using the BBData admin interface (1). Among the informations, Foo gives the type (float for both) and unit (degrees and percentage) of the objects. The system will then give Foo a unique ID for each sensor, which foo can use to send new measures.

To ensure only Foo can post data linked to his virtual objects, BBData uses token. A token is a string of 32 characters that has to be present for any new measure to be accepted by the system. Using the interface, Foo thus creates a token for each of his virtual objects. The configuration phase is done.

Foo will then write a little program on the raspberry pi which sends an HTTP POST to the Input API every time a new measure is available. Depending on the measure type (temperature or humidity), Foo's program will specify a different object id and token in the request body.

That's it ! All his data are now safe and stored on the DAPLAB. Foo can use the visualisation webapp (2) to visualise his data as graphs or use the output API directly to get raw values or aggregations during a period of time (3).

Now, let's say Foo's coworkers are also interested in the temperature and humidity of Foo's room. Foo could create a group on the BBData admin interface (1), add his collegues to it and then give them read-only access to his sensor's data. If he changes his mind afterwards, a simple clic on the admin interface will revoke the permissions.

BBData Main Advantages

Here are some of the major advantages of BBData:

  1. Uniformed and Standardised Storage: with BBData, all measures are stored using the same abstraction, allowing users to interact with data from any source in the same way;
  2. Horizontal Scalability: the BBData pipeline is built on top of Hadoop technologies and configured to be able to process terabytes of data if needed without loosing performances;
  3. User Management and Permissions: every datasource in BBData belongs to a user, and the user can control who has access to its data in a fine-grained manner. This makes data both protected and easily shared among group of users;
  4. Stream Processing: in many systems, users have the possibility to get statistics and derived informations about their data such as mean, max, aggregations, etc. But most of the time, those statistics are computed on-the-fly and on-demand. In BBData, processors run as soon as the data are available and already compute all the statistics a user could need, so queries are very fast.

Technologies

The BBData ecosystem runs on Hadoop and uses free, open-source technologies.

technologies

Software versions

Version
YARN 2.7.3
HDFS 2.7.3
Apache Flink 1.2.0
Apache Kafka 0.10.0
Apache Cassandra 3.0.7
GlassFish 4.1 (patched)
MySQL 5.7.14
RAML 1.0

Source code

The sources for the different components of BBData are separate git projects stored on the GitLab of the HEIA-Fr under the group BBData. Here is a small overview of the different projects:

APIs:

The API is split between the input API, used to post new measures, and the output API, which takes care of permissions, object and user management and querying. Both are Java EE applications running on GlassFish.

Soon, BBData will offer an MQTT endpoint for submitting new values. This is developed in the input-mqtt-gateway project.

Processing:

We have currently 3 processors, each of them written as Flink applications. The project flink-basic-processing contains the code for the two most important processors: (a) augmentation, (b) saving of raw values.

The project flink-aggregations contains the code for doing aggregations on float data.

Web Interface:

The code for the admin interface is available in the webapp project. It is a simple NodeJS application using AngularJS and Bootstrap4.

Development:

The docker-infrastructure project contains docker image definitions and docker-compose files to run parts or all of the BBData pipeline on a local computer. This is for development only, as the pipeline has many components that are resource consuming.

data-faker is a little commandline tool written in Go to generate and submit fake measures to the input API.

bbcheck-pytg is a little python program that monitors the different parts of the pipeline and sends a message using Telegram in case of failure.

Other:

dbs contains the scripts to setup the mysql database and the cassandra keyspace.

(deprecated) bbdata-commons is a Java project (maven) defining common utility classes for dealing with measures and UTC dates. It has since been replaced by jodatime-utils hosted on gitlab and available in bintray.

(deprecated) wiki doesn't contains any source, but was used to centralized information on its wiki (gitlab feature). It is replaced by the current website.

(deprecated) raml-and-tests contained the RAML definition of the output API which has been migrated to the output API project itself and a test suite using rest-assured. Testing is now mostrly done using Postman.