Deployment
We assume that the machines of the cluster are running CentOS 6+.
Requirements
To deploy the BBData system, you need a Hadoop cluster with Cassandra and Kafka available (see /components/#software-versions for versions).
Other requirements are Java, maven and NodeJS.
MySQL
Install the latest version of MySQL. Here are the steps to follow in CentOS7:
sudo yum localinstall mysql57-community-release-el7-8.noarch.rpm sudo service mysqld start yum repolist enabled | grep "mysql.*-community.*" sudo yum install mysql-community-server sudo vim /etc/my.cnf ## change the bind-address to the public ip of the machine instead of 127.0.0.1 ## example: ## bind-address=10.10.10.103 sudo service mysqld start mysql_secure_installation # secure you installation sudo service mysql restart
Once MySQL is running, create a database and a user for bbdata:
-- launch the mysql console via sudo mysql -u root -p -- mysql create database bbdata2; create user 'bbdata-admin'@'localhost' identified by 'CHANGE_ME'; create user 'bbdata-admin'@'%' identified by 'CHANGE_ME'; grant all on bbdata2.* to 'bbdata-admin'@'localhost'; grant all on bbdata2.* to 'bbdata-admin'@'%'; grant select on mysql.proc to 'bbdata-admin'@'localhost'; grant select on mysql.proc to 'bbdata-admin'@'%'; flush privileges; exit; -- end mysql
Note that the user must have access in read mode to the mysql.proc
table in order to run procedures.
Finally, create the bbdata database. For that, download the sql script from the bbdata/dbs
project and execute it using:
mysql -u bbdata-admin -p bbdata2 < bbdata.sql
Cassandra
Ssh on one of the machines running cassandra and ensure the cqlsh program is available by launching a cql interactive shell. For example:
ssh daplab-wn-12.fri.lan
cqlsh 10.10.10.12
Create the keyspace and the necessary tables using the the cql script available in the repo bbdata/dbs
:
cqlsh 10.10.10.12 -f cassandra.cql
bbdata2
keyspace with a replication factor of 3 and two tables, raw_values
and aggregations
.
Kafka
ssh to one of the machines running a kafka broker and execute the following:
# make kafka-topics.sh available export PATH=$PATH:/usr/hdp/current/kafka-broker/bin # replace the following with the correct urls of zookeeper export ZK="daplab-wn-22.fri.lan:2181,daplab-wn-25.fri.lan:2181,daplab-wn-33.fri.lan:2181" # create the two topics kafka-raw and kafka-augmented kafka-topics.sh --create --topic bbdata2-input --zookeeper $ZK --partitions 1 --replication-factor 2 kafka-topics.sh --create --topic bbdata2-augmented --zookeeper $ZK --partitions 1 --replication-factor 2
Rsyslog
Flink and GlassFish applications use rsyslog as a logging facility. This is especially useful for Flink applications, since they will execute on different machines of the cluster. Rsyslog is a simple way to have the logs available as soon as they are produced instead of waiting for YARN to aggregate logs (this can take days!).
One many linux distributions, rsyslog is installed by default. The only thing left to do is to configure rsyslog to allow udp requests and to redirect messages with the bbdata
prefix to a custom file.
To allow incoming udp requests, define the following in /etc/rsyslog.conf:
$ModLoad imudp $UDPServerRun 514
To filter everything from bbdata, create a file /etc/rsyslog.conf/10-bbdata.conf
and add the following:
if $programname == 'bbdata' then /var/log/bbdata.log & stop
The rule will match any log beginning with bbdata:
.
To deal with tabs, new lines and special characters, define the following property in /etc/rsyslog.conf
:
$EscapeControlCharactersOnReceive off
GlassFish
Apis run on a slightly modified version of GlassFish 4.1.
Install from bintray
A patched version is available on bintray{: target="blank"}. From the _files tab, download the latest zip or tar archive:
wget https://bintray.com/derlin/glassfish4.1-patched/download_file?file_path=glassfish4.1-patched.tar.gz
tar xvzf glassfish4.1-patched.tar.gz
and you are ready to go.
Manual install
If you want to patch GlassFish manually, here are the steps to follow:
-
download and unzip glassfish 4.1 (download here);
-
modify
glassfish/domains/domain1/config/domain.xml
: search for the http listeners and change the default ports; -
download the latest versions of guava and jersey-guava:
1 2 3 4 5 6 7 8 9 10
mvn dependency:get -DremoteRepositories=http://repo1.maven.org/maven2/ \ -DgroupId=com.google.guava \ -DartifactId=guava \ -Dversion=19.0 \ -Dtransitive=false mvn dependency:get -DremoteRepositories=http://repo1.maven.org/maven2/ \ -DgroupId=org.glassfish.jersey.bundles.repackaged \ -DartifactId=jersey-guava -Dversion=2.23 \ -Dtransitive=false
and copy
guava.jar
andjersey-guava.jar
toglassfish/modules
; -
download and unzip glassfish 4.1.1 (latest);
-
copy/paste the libraries
modules/bean-validator.jar
andmodules/bean-validator-cdi.jar
from 4.1.1 to 4.1; -
download the latest jdbc mysql connector from here and add the jar to the
lib
folder of the domain (e.g.glassfish/domains/domain1/lib
).
Enable the console from remote hosts
To start the server and allow the glassfish admin console to be reached from the outside, launch the asadmin command line (glassfish/bin/asadmin
) and enter the following commands:
glassfish/bin/asadmin > start-domain > change-admin-root-password > enable-secure-admin > restart-domain > exit
Configure the database connections
For MySQL, you need to create a JDBC resouce and connection pool. Create a temporary file with the following content, replacing values in uppercase with your mysql configuration:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE resources PUBLIC "-//GlassFish.org//DTD GlassFish Application Server 3.1 Resource Definitions//EN" "http://glassfish.org/dtds/glassfish-resources_1_5.dtd"> <resources> <jdbc-connection-pool allow-non-component-callers="false" associate-with-thread="false" connection-creation-retry-attempts="0" connection-creation-retry-interval-in-seconds="10" connection-leak-reclaim="false" connection-leak-timeout-in-seconds="0" connection-validation-method="auto-commit" datasource-classname="com.mysql.jdbc.jdbc2.optional.MysqlDataSource" fail-all-connections="false" idle-timeout-in-seconds="300" is-connection-validation-required="false" is-isolation-level-guaranteed="true" lazy-connection-association="false" lazy-connection-enlistment="false" match-connections="false" max-connection-usage-count="0" max-pool-size="32" max-wait-time-in-millis="60000" name="mysql_bbdata_pool" non-transactional-connections="false" pool-resize-quantity="2" res-type="javax.sql.DataSource" statement-timeout-in-seconds="-1" steady-pool-size="8" validate-atmost-once-period-in-seconds="0" wrap-jdbc-objects="false"> <property name="serverName" value="SERVER_NAME"/> <property name="portNumber" value="3306"/> <property name="databaseName" value="DATABASE_NAME"/> <property name="User" value="DATABASE_USER"/> <property name="Password" value="DATABASE_PASSWORD"/> <property name="URL" value="jdbc:mysql://SERVER_NAME:3306/DATABASE_NAME?zeroDateTimeBehavior=convertToNull"/> <property name="driverClass" value="com.mysql.jdbc.Driver"/> </jdbc-connection-pool> <jdbc-resource jndi-name="bbdata2" pool-name="mysql_bbdata_pool" /> </resources>
Then, use asadmin to create the resources :
glassfish4/bin/asadmin add-resources resources.xml glassfish4/bin/asadmin restart-domain
To connect to Cassandra and kafka, APIs use system properties. To make the input and the output api work, you need to define at least two system properties using the asadmin create-system-properties
:
glassfish4/bin/asadmin create-system-properties --target domain \ CASSANDRA_CONTACTPOINTS=10.10.10.13,10.10.10.14 # replace by the cassandra entrypoints IPs glassfish4/bin/asadmin create-system-properties --target domain \ KAFKA_BOOTSTRAP_SERVERS=10.10.10.81 # at least one bootstrap server IP
Input and output APIs
Now that MySQL, Kafka, Cassandra and GlassFish are up, it is time to deploy the applications.
War files
War files for both the input-api and the output-api are available in jenkins (https://jenkins.daplab.ch) or on gitlab-forge. You can also clone the repositories and build the war files using maven.
Tips
The API projects use GitLab CI to automatically build war files upon new commits. You can list the latest builds by clicking on the "passed" icon on the home page (top left). Each build has a "download" icon on the right.
Wars can usually be deployed as is. Nonetheless, if you didn't follow the steps above in details, here are the main configurations to look at:
persistence.xml
insrc/main/resources/META-INF
defines the jdbc resource to use (bbdata2
by default, created by theresources.xml
template above);glassfish-web.xml
insrc/main/webapp/WEB-INF
states the context root upon deployment (/input
and/api
by default);log4j.properties
on thesrc/main/resources
folder is configured to use rsyslog by default;
If you want to change any of those, checkout the code, do the modifications and build the wars using maven (mvn clean package
).
Deployment on GlassFish
To deploy an application on GlassFish, you have basically two options.
Using the autodeploy folder: you can copy the .war archives to the glassfish4/glassfish/domains/domain1/autodeploy
folder. GlassFish will automatically deploy the application using the context root defined in glassfish-web.xml
.
Using the admin console: the GlassFish admin console is available on port 4848
by default. Once logged in, navigate to the Applications panel and click on the deploy... button.
Warning
Before clicking ok, ensure that the context root is empty ! By default, the interface sets the context root to the name of the war file, which will override the context root stated in glassfish-web.xml
.
To ensure everything ran smoothly, have a look at the log file in the domain folder or the rsyslog file:
tail -f glassfish4/glassfish/domains/domain1/logs/server.log tail -f /var/log/bbdata.log
Flink
To run flink application, you must first check that the flink
executable is available on the machine. If not, you can download it at here (select hadoop 2.7.3 and scala 2.10.5).
Configuration
To configure Flink, first create an empty directory, for example bbdata-flink-conf
, and create symlinks for all the files in flink/conf
:
mkdir bbdata-flink-conf ln -s flink/conf/* bbdata-flink-conf
Logging is configured in log4j.properties
. To use rsyslog (see section above), remove the symlink and replace it by a file with the following content:
log4j.rootLogger=INFO, file # TODO: change package logtest to whatever log4j.logger.logtest=INFO, SYSLOG # Log all infos in the given file log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.file=${log.file} log4j.appender.file.append=false log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=bbdata: %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n # suppress the irrelevant (wrong) warnings from the netty channel handler log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file # rsyslog # configure Syslog facility LOCAL1 appender log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.syslogHost=10.10.10.102 log4j.appender.SYSLOG.port=514 #log4j.appender.SYSLOG.appName=bbdata # define the prefix to use log4j.appender.SYSLOG.layout=org.apache.log4j.EnhancedPatternLayout log4j.appender.SYSLOG.layout.conversionPattern=bbdata: [%p] %c:%L - %m %throwable %n
The default Flink configuration doesn't define checkpoint directories. Open flink-conf.yaml
and add or update the following properties, changing the paths if needed:
## configure the checkpoint directories, typically on hdfs state.backend: filesystem # Directory for storing checkpoints state.backend.fs.checkpointdir: hdfs:///user/flink/checkpoints # Default savepoints directory (flink 1.2+) state.savepoints.dir: hdfs:///user/flink/savepoints # externalized checkpoints (metadata) directory state.checkpoints.dir: hdfs:///user/flink/ext-checkpoints
You shoud also ensure that the property fs.hdfs.hadoopconf
is set and point to the correct directory.
Running Flink Applications
First, export the following variables:
export HADOOP_CONF_DIR=/etc/hadoop/conf export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir
To launch a flink application on YARN, you have two options: either launch a flink session on YARN and then run applications inside that session or launch each flink application in its own embedded session. The first one is easier since you will have one web interface to monitor all your applications, but it also means that all applications will share the same flink configuration.
To create a new session, use:
yarn-session.sh -d -n 1 -s 4 -jm 1024 -tm 2048
The options are:
-d
: run in detached mode-n
: number of containers (i.e. task managers) to allocate-s
: number of slots per task manager-jm
: job manager memory, in MBtm
: task manager memory, in MB
Now, you can launch new jobs inside the session using the flink run
command line without any special argument:
flink run myapp.jar
To launch an application in its own session use flink run
and specify the jobmanager:
flink run -d --yarnstreaming -jobmanager yarn-cluster -yn 1 myapp.jar
The options are:
--yarnstreaming
: start flink in streaming mode--jobmanager yarn-cluster
: use the YARN cluster-yn
: number of yarn containers (i.e. task managers)
flink-basic-processing
- Clone the flink-basic-processing project and build the jar using maven:
mvn clean package
. - Copy the configuration file template
src/main/config/flink-basic.properties.template
and fill the blanks. -
Launch the two jobs. Assuming a flink YARN session is already running:
1 2 3 4 5 6 7 8
export HADOOP_CONF_DIR=/etc/hadoop/conf export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir # launch the augmenter flink run target/flink-basic-processing-XX.jar flink-basic.properties augmenter # launch the saver flink run target/flink-basic-processing-XX.jar flink-basic.properties saver
-
Ensure all is running fine using either the flink interface or the command line:
flink list
flink-aggregations
Clone the flink-aggregations project and build the jar using maven: mvn clean package
.
Info
You need one processor (i.e. application process) per granularity. For quarters and hours, this means launching two Flink applications. The only difference between the two is the properties file passed as a first parameter
Copy the file src/main/resources/config.properties-template
and fill the blanks.
Note: to configure the aggregation granularity, you need to update at least window.granularity
(size of the window in minutes) and to adapt the window.allowed_lateness
(minutes the window is kept in memory for late records) and the window.timeout
(how long before we consider the stream to be stopped).
To launch the application (assuming a flink session is already running):
export HADOOP_CONF_DIR=/etc/hadoop/conf export FLINK_CONF_DIR=/path/to/bbdata-flink-conf-dir # launch the aggregations processor for each quarter flink run target/flink-aggregations-XX.jar aggregations-quarter.properties # launch the aggregations processor for each hour flink run target/flink-aggregations-XX.jar aggregations-hours.properties # etc.
Webapp (admin console)
The admin console is a standalone NodeJS application. Procedure:
- Install NodeJS and NPM
- Clone the webapp project
- Run
npm install
at the root of the project - Open
server.js
and edit theAPI_SERVER_HOST
variable at the top of the file - Launch it using
nodejs server.js
git clone https://gitlab.forge.hefr.ch/bbdata/webapp
cd webapp
npm install
nodejs server.js
Note that there is no option to run the application in background. The best way is to run the application in a screen.