Data Café — A Platform For Creating Biomedical Data Lakes

Data Cafe can be used to construct data lakes in Hadoop HDFS from heterogeneous data sources, and query the data lakes efficiently by leveraging Apache Drill.

The steps are outlined below:

Data Cafe Data Sources

Data cafe is implemented for, and tested with the origin data sources:

MongoDB.
MySQL.

Data lake in:

Hadoop HDFS and Hive.

Configure Mongo

Start MongoDB.

$ sudo systemctl start mongodb

You may choose to configure MongoDB to start at the system start-up time.

$ sudo systemctl enable mongodb

Configure and Execute Hadoop

Start Hadoop NameNode daemon and DataNode daemon

$HADOOP_HOME/sbin/start-dfs.sh

Browse the web interface for the name node - http://localhost:50070/
Once the execution is complete, you may stop the daemons with

$HADOOP_HOME/sbin/stop-dfs.sh

Configure and Execute Hive

Run Hive Metastore and Hive

$HIVE_HOME/bin/hive --service metastore &

Start HiveServer2

$HIVE_HOME/bin/hiveserver2

Configure Drill

Make sure Java is installed in order to start Drill.

Launch Drill in Embedded mode

$DRILL_HOME/bin/drill-embedded

If Drill is secured, log in with the credentials.

$DRILL_HOME/bin/drill-embedded -n username -p password

Browse the web interface of Drill - http://localhost:8047/

Make sure to set extractHeader element appropriately in the Drill storage according to your data sources.

   "extractHeader": true,

Building Data Cafe

Data Cafe can be built using Apache Maven 3.x and Java 1.8.x or higher.

$ mvn clean install

Built and tested with Apache Maven 3.1.1 and Oracle Java 1.8.0.

Executing Data Cafe

You may execute Data Cafe by writing a server application on top of the Data Cafe Server as well as a client application by extending the Data Cafe Client.

Server and client samples are provided in the datacafe-server-samples and datacafe-client-samples modules.

Make sure to include log4j2-test.xml into your class path to be able to configure and view the logs. Default log level is [WARN].

The samples described below are PhysioNet MIMIC-III data sources downloaded and configured as MongoDB data bases.

Executor Sample in datacafe-server-samples module.

Executor offers a server sample with Mongo as the origin data source and HDFS as the integrated data source.

To execute the Executor sample,

$ java -classpath lib/datacafe-server-samples-1.0-SNAPSHOT.jar:lib/*:conf/ edu.emory.bmi.datacafe.server.samples.mongo.physionet.Executor

ExecutorClient Sample in datacafe-client-samples module.

ExecutorClient on the other hand, offers a sample client implementation with queries partially auto-generated.

To execute the ExecutorClient sample,

$ java -classpath lib/datacafe-client-samples-1.0-SNAPSHOT.jar:lib/*:conf/ edu.emory.bmi.datacafe.client.samples.ExecutorClient

HzServer in datacafe-server module.

HzServer may be executed to create more Hazelcast members to join the Hazelcast cluster created by the Executor.

$ java -classpath lib/datacafe-server-1.0-SNAPSHOT.jar:lib/*:conf/ edu.emory.bmi.datacafe.hazelcast.HzServer

DatacafeEngine in datacafe-rest module.

DatacafeEngine exposes the RESTful APIs to data lake creation and access. It exposes the Data Cafe server and client methods as RESTful APIs.

$ java -classpath lib/datacafe-rest-1.0-SNAPSHOT.jar:lib/*:conf/ edu.emory.bmi.datacafe.rest.DatacafeEngine

Dependencies

This project depends on the below major projects.

Hazelcast
MongoDB
Apache Hadoop
Apache Hive
Apache Drill
Apache Log4j2
MySQL
SparkJava

Using Data Cafe in your research

Please cite the below, if you use Data Cafe in your research.

[1] Kathiravelu, P. & Sharma, A. (2016). A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes. In Second International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH'16), co-located with 42 nd International Conference on Very Large Data Bases (VLDB 2016). Sep. 2016. LNCS. pp. 101-120. https://link.springer.com/chapter/10.1007/978-3-319-57741-8_7

[2] Kathiravelu, P., Kazerouni, A., & Sharma, A. (2016). Data Café — A Platform For Creating Biomedical Data Lakes.
In AMIA 2016 Joint Summits on Translational Science. Mar. 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 366 Commits
conf		conf
distribution		distribution
documentation		documentation
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notice.txt		notice.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Café — A Platform For Creating Biomedical Data Lakes

Data Cafe Data Sources

Configure Mongo

Configure and Execute Hadoop

Configure and Execute Hive

Configure Drill

Building Data Cafe

Executing Data Cafe

Executor Sample in datacafe-server-samples module.

ExecutorClient Sample in datacafe-client-samples module.

HzServer in datacafe-server module.

DatacafeEngine in datacafe-rest module.

Dependencies

Using Data Cafe in your research

About

Releases

Packages

Contributors 2

Languages

License

sharmalab/datacafe

Folders and files

Latest commit

History

Repository files navigation

Data Café — A Platform For Creating Biomedical Data Lakes

Data Cafe Data Sources

Configure Mongo

Configure and Execute Hadoop

Configure and Execute Hive

Configure Drill

Building Data Cafe

Executing Data Cafe

Executor Sample in datacafe-server-samples module.

ExecutorClient Sample in datacafe-client-samples module.

HzServer in datacafe-server module.

DatacafeEngine in datacafe-rest module.

Dependencies

Using Data Cafe in your research

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages