Stackable Documentation

Welcome to Stackable! This documentation gives you an overview of the Stackable Data Platform, how to install and manage it as well as some tutorials.

Introduction

The Stackable Data Platform allows you to deploy, scale and manage Data infrastructure in any environment running Kubernetes.

You can find an overview of the supported components below, as well as a full list of all supported product versions here.

If you have any feedback regarding the documentation please either open an issue, ask a question or look at the source for this documentation in its repository.

Goal of the project

We are building a distribution of existing Open Source tools that together comprise the components of a modern data platform.

There are components to ingest data, to store data, to process and visualize and much more. While the platform got started in the Big Data ecosystem we are in no way limited to big data workloads.

You can declaratively build these environments, and we don’t stop at the tool level as we also provide ways for the users to interact with the platform in the "as Code"-approach.

We are leveraging the Open Policy Agent to provide Security-as-Code.

We are building a distribution that includes the “best of breed” of existing Open Source tools, but bundles them in a way, so it is easy to deploy a fully working stack of software. Most of the existing tools are “single purpose” tools, which often do not play nicely together out-of-the-box.

Components

We are using Kubernetes as our deployment platform. And we’re building Operators for each of the products we support. The Stackable Data Platform supports the following products:

Apache Airflow

Airflow is a workflow engine and your replacement should you be using Apache Oozie.

Apache Druid

Apache Druid is a real-time database to power modern analytics applications.

Apache HBase

HBase is a distributed, scalable, big data store.

Apache Hadoop HDFS

HDFS is a distributed file system that provides high-throughput access to application data.

Apache Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. We support the Hive Metastore.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache NiFi

An easy to use, powerful, and reliable system to process and distribute data.

Apache Spark

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Apache Superset

Apache Superset is a modern data exploration and visualization platform.

Trino

Fast distributed SQL query engine for big data analytics that helps you explore your data universe.

Apache ZooKeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.