Skip to content


We assume that the machines of the cluster are running CentOS 6+.


To deploy the BBData system, you need a Hadoop cluster with Cassandra and Kafka available (see /components/#software-versions for versions).

Other requirements are Java, maven and NodeJS.


Install the latest version of MySQL. Here are the steps to follow in CentOS7:

sudo yum localinstall mysql57-community-release-el7-8.noarch.rpm
sudo service mysqld start
yum repolist enabled | grep "mysql.*-community.*"
sudo yum install mysql-community-server

sudo vim /etc/my.cnf
## change the bind-address  to the public ip of the machine instead of
## example: 
##   bind-address=

sudo service mysqld start
mysql_secure_installation # secure you installation
sudo service mysql restart

Once MySQL is running, create a database and a user for bbdata:

-- launch the mysql console via sudo mysql -u root -p
-- mysql
create database bbdata2;
create user 'bbdata-admin'@'localhost' identified by 'CHANGE_ME';
create user 'bbdata-admin'@'%' identified by 'CHANGE_ME';

grant all on bbdata2.* to 'bbdata-admin'@'localhost';
grant all on bbdata2.* to 'bbdata-admin'@'%';
grant select on mysql.proc to 'bbdata-admin'@'localhost';
grant select on mysql.proc to 'bbdata-admin'@'%';
flush privileges;
-- end mysql

Note that the user must have access in read mode to the mysql.proc table in order to run procedures.

Finally, create the bbdata database. For that, download the sql script from the bbdata/dbs project and execute it using:

mysql  -u bbdata-admin -p bbdata2 < bbdata.sql
This will setup all the necessary tables and procedures.


Ssh on one of the machines running cassandra and ensure the cqlsh program is available by launching a cql interactive shell. For example:

ssh daplab-wn-12.fri.lan

Create the keyspace and the necessary tables using the the cql script available in the repo bbdata/dbs:

cqlsh -f cassandra.cql
This will create the bbdata2 keyspace with a replication factor of 3 and two tables, raw_values and aggregations.


ssh to one of the machines running a kafka broker and execute the following:

# make available
export PATH=$PATH:/usr/hdp/current/kafka-broker/bin
# replace the following with the correct urls of zookeeper
export ZK="daplab-wn-22.fri.lan:2181,daplab-wn-25.fri.lan:2181,daplab-wn-33.fri.lan:2181" 

# create the two topics kafka-raw and kafka-augmented --create --topic bbdata2-input --zookeeper $ZK --partitions 1 --replication-factor 2 --create --topic bbdata2-augmented --zookeeper $ZK --partitions 1 --replication-factor 2
You can specify another replication factor if you need to.


Flink and GlassFish applications use rsyslog as a logging facility. This is especially useful for Flink applications, since they will execute on different machines of the cluster. Rsyslog is a simple way to have the logs available as soon as they are produced instead of waiting for YARN to aggregate logs (this can take days!).

One many linux distributions, rsyslog is installed by default. The only thing left to do is to configure rsyslog to allow udp requests and to redirect messages with the bbdata prefix to a custom file.

To allow incoming udp requests, define the following in /etc/rsyslog.conf:

$ModLoad imudp
$UDPServerRun 514

To filter everything from bbdata, create a file /etc/rsyslog.conf/10-bbdata.conf and add the following:

if $programname == 'bbdata' then /var/log/bbdata.log
& stop

The rule will match any log beginning with bbdata:.

To deal with tabs, new lines and special characters, define the following property in /etc/rsyslog.conf:

$EscapeControlCharactersOnReceive off


Apis run on a slightly modified version of GlassFish 4.1.

Install from bintray

A patched version is available on bintray{: target="blank"}. From the _files tab, download the latest zip or tar archive:

tar xvzf glassfish4.1-patched.tar.gz

and you are ready to go.

Manual install

If you want to patch GlassFish manually, here are the steps to follow:

  1. download and unzip glassfish 4.1 (download here);

  2. modify glassfish/domains/domain1/config/domain.xml : search for the http listeners and change the default ports;

  3. download the latest versions of guava and jersey-guava:

    mvn dependency:get -DremoteRepositories= \
                       -DartifactId=guava \
                       -Dversion=19.0 \
    mvn dependency:get -DremoteRepositories= \
                       -DgroupId=org.glassfish.jersey.bundles.repackaged \
                       -DartifactId=jersey-guava -Dversion=2.23 \

    and copy guava.jar and jersey-guava.jar to glassfish/modules;

  4. download and unzip glassfish 4.1.1 (latest);

  5. copy/paste the libraries modules/bean-validator.jar and modules/bean-validator-cdi.jar from 4.1.1 to 4.1;

  6. download the latest jdbc mysql connector from here and add the jar to the lib folder of the domain (e.g. glassfish/domains/domain1/lib).

Enable the console from remote hosts

To start the server and allow the glassfish admin console to be reached from the outside, launch the asadmin command line (glassfish/bin/asadmin) and enter the following commands:

> start-domain
> change-admin-root-password
> enable-secure-admin
> restart-domain
> exit

Configure the database connections

For MySQL, you need to create a JDBC resouce and connection pool. Create a temporary file with the following content, replacing values in uppercase with your mysql configuration:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE resources PUBLIC "-// GlassFish Application Server 3.1 Resource Definitions//EN" "">

    <jdbc-connection-pool allow-non-component-callers="false" associate-with-thread="false" connection-creation-retry-attempts="0" connection-creation-retry-interval-in-seconds="10" connection-leak-reclaim="false" connection-leak-timeout-in-seconds="0" connection-validation-method="auto-commit" datasource-classname="com.mysql.jdbc.jdbc2.optional.MysqlDataSource" fail-all-connections="false" idle-timeout-in-seconds="300" is-connection-validation-required="false" is-isolation-level-guaranteed="true" lazy-connection-association="false" lazy-connection-enlistment="false" match-connections="false" max-connection-usage-count="0" max-pool-size="32" max-wait-time-in-millis="60000" name="mysql_bbdata_pool" non-transactional-connections="false" pool-resize-quantity="2" res-type="javax.sql.DataSource" statement-timeout-in-seconds="-1" steady-pool-size="8" validate-atmost-once-period-in-seconds="0" wrap-jdbc-objects="false">
        <property name="serverName" value="SERVER_NAME"/>
        <property name="portNumber" value="3306"/>
        <property name="databaseName" value="DATABASE_NAME"/>
        <property name="User" value="DATABASE_USER"/>
        <property name="Password" value="DATABASE_PASSWORD"/>
        <property name="URL" value="jdbc:mysql://SERVER_NAME:3306/DATABASE_NAME?zeroDateTimeBehavior=convertToNull"/>
        <property name="driverClass" value="com.mysql.jdbc.Driver"/>

    <jdbc-resource jndi-name="bbdata2" pool-name="mysql_bbdata_pool" />


Then, use asadmin to create the resources :

glassfish4/bin/asadmin add-resources resources.xml 
glassfish4/bin/asadmin restart-domain

To connect to Cassandra and kafka, APIs use system properties. To make the input and the output api work, you need to define at least two system properties using the asadmin create-system-properties:

glassfish4/bin/asadmin create-system-properties --target domain \
    CASSANDRA_CONTACTPOINTS=, # replace by the cassandra entrypoints IPs

glassfish4/bin/asadmin create-system-properties --target domain \
    KAFKA_BOOTSTRAP_SERVERS= # at least one bootstrap server IP

Input and output APIs

Now that MySQL, Kafka, Cassandra and GlassFish are up, it is time to deploy the applications.

War files

War files for both the input-api and the output-api are available in jenkins ( or on gitlab-forge. You can also clone the repositories and build the war files using maven.


The API projects use GitLab CI to automatically build war files upon new commits. You can list the latest builds by clicking on the "passed" icon on the home page (top left). Each build has a "download" icon on the right.

Wars can usually be deployed as is. Nonetheless, if you didn't follow the steps above in details, here are the main configurations to look at:

  • persistence.xml in src/main/resources/META-INF defines the jdbc resource to use (bbdata2 by default, created by the resources.xml template above);
  • glassfish-web.xml in src/main/webapp/WEB-INF states the context root upon deployment (/input and /api by default);
  • on the src/main/resources folder is configured to use rsyslog by default;

If you want to change any of those, checkout the code, do the modifications and build the wars using maven (mvn clean package).

Deployment on GlassFish

To deploy an application on GlassFish, you have basically two options.

Using the autodeploy folder: you can copy the .war archives to the glassfish4/glassfish/domains/domain1/autodeploy folder. GlassFish will automatically deploy the application using the context root defined in glassfish-web.xml.

Using the admin console: the GlassFish admin console is available on port 4848 by default. Once logged in, navigate to the Applications panel and click on the deploy... button.


Before clicking ok, ensure that the context root is empty ! By default, the interface sets the context root to the name of the war file, which will override the context root stated in glassfish-web.xml.

To ensure everything ran smoothly, have a look at the log file in the domain folder or the rsyslog file:

tail -f glassfish4/glassfish/domains/domain1/logs/server.log
tail -f /var/log/bbdata.log

To run flink application, you must first check that the flink executable is available on the machine. If not, you can download it at here (select hadoop 2.7.3 and scala 2.10.5).


To configure Flink, first create an empty directory, for example bbdata-flink-conf, and create symlinks for all the files in flink/conf:

mkdir bbdata-flink-conf
ln -s flink/conf/* bbdata-flink-conf

Logging is configured in To use rsyslog (see section above), remove the symlink and replace it by a file with the following content:

log4j.rootLogger=INFO, file

# TODO: change package logtest to whatever
log4j.logger.logtest=INFO, SYSLOG

# Log all infos in the given file
log4j.appender.file.layout.ConversionPattern=bbdata: %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# suppress the irrelevant (wrong) warnings from the netty channel handler, file

# rsyslog
# configure Syslog facility LOCAL1 appender
#log4j.appender.SYSLOG.appName=bbdata # define the prefix to use
log4j.appender.SYSLOG.layout.conversionPattern=bbdata: [%p] %c:%L - %m %throwable %n

The default Flink configuration doesn't define checkpoint directories. Open flink-conf.yaml and add or update the following properties, changing the paths if needed:

## configure the checkpoint directories, typically on hdfs

state.backend: filesystem
# Directory for storing checkpoints 
state.backend.fs.checkpointdir: hdfs:///user/flink/checkpoints

# Default savepoints directory (flink 1.2+)
state.savepoints.dir: hdfs:///user/flink/savepoints

# externalized checkpoints (metadata) directory
state.checkpoints.dir: hdfs:///user/flink/ext-checkpoints

You shoud also ensure that the property fs.hdfs.hadoopconf is set and point to the correct directory.

First, export the following variables:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir

To launch a flink application on YARN, you have two options: either launch a flink session on YARN and then run applications inside that session or launch each flink application in its own embedded session. The first one is easier since you will have one web interface to monitor all your applications, but it also means that all applications will share the same flink configuration.

To create a new session, use: -d -n 1 -s 4 -jm 1024 -tm 2048

The options are:

  • -d: run in detached mode
  • -n: number of containers (i.e. task managers) to allocate
  • -s: number of slots per task manager
  • -jm: job manager memory, in MB
  • tm: task manager memory, in MB

Now, you can launch new jobs inside the session using the flink run command line without any special argument:

flink run myapp.jar 

To launch an application in its own session use flink run and specify the jobmanager:

flink run -d --yarnstreaming -jobmanager yarn-cluster -yn 1 myapp.jar 

The options are:

  • --yarnstreaming: start flink in streaming mode
  • --jobmanager yarn-cluster: use the YARN cluster
  • -yn: number of yarn containers (i.e. task managers)
  1. Clone the flink-basic-processing project and build the jar using maven: mvn clean package.
  2. Copy the configuration file template src/main/config/ and fill the blanks.
  3. Launch the two jobs. Assuming a flink YARN session is already running:

    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir
    # launch the augmenter
    flink run target/flink-basic-processing-XX.jar augmenter
    # launch the saver
    flink run target/flink-basic-processing-XX.jar saver
  4. Ensure all is running fine using either the flink interface or the command line: flink list

Clone the flink-aggregations project and build the jar using maven: mvn clean package.


You need one processor (i.e. application process) per granularity. For quarters and hours, this means launching two Flink applications. The only difference between the two is the properties file passed as a first parameter

Copy the file src/main/resources/ and fill the blanks.

Note: to configure the aggregation granularity, you need to update at least window.granularity (size of the window in minutes) and to adapt the window.allowed_lateness (minutes the window is kept in memory for late records) and the window.timeout (how long before we consider the stream to be stopped).

To launch the application (assuming a flink session is already running):

export HADOOP_CONF_DIR=/etc/hadoop/conf
export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir

# launch the aggregations processor for each quarter
flink run target/flink-aggregations-XX.jar 
# launch the aggregations processor for each hour
flink run target/flink-aggregations-XX.jar 
# etc.

Webapp (admin console)

The admin console is a standalone NodeJS application. Procedure:

  1. Install NodeJS and NPM
  2. Clone the webapp project
  3. Run npm install at the root of the project
  4. Open server.js and edit the API_SERVER_HOST variable at the top of the file
  5. Launch it using nodejs server.js
git clone
cd webapp
npm install
nodejs server.js

Note that there is no option to run the application in background. The best way is to run the application in a screen.