Stackable Documentation
Welcome to Stackable! This documentation gives you an overview of the Stackable Data Platform, how to install and manage it as well as some tutorials.
Introduction
The Stackable Data Platform allows you to deploy, scale and manage Data infrastructure in any environment running Kubernetes.
You can find an overview of the supported components below, as well as a full list of all supported product versions here.
If you have any feedback regarding the documentation please either open an issue, ask a question or look at the source for this documentation in its repository.
Goal of the project
We are building a distribution of existing Open Source tools that together comprise the components of a modern data platform.
There are components to ingest data, to store data, to process and visualize and much more. While the platform got started in the Big Data ecosystem we are in no way limited to big data workloads.
You can declaratively build these environments, and we don’t stop at the tool level as we also provide ways for the users to interact with the platform in the "as Code"-approach.
We are leveraging the Open Policy Agent to provide Security-as-Code.
We are building a distribution that includes the “best of breed” of existing Open Source tools, but bundles them in a way, so it is easy to deploy a fully working stack of software. Most of the existing tools are “single purpose” tools, which often do not play nicely together out-of-the-box.
Components
We are using Kubernetes as our deployment platform. And we’re building Operators for each of the products we support. The Stackable Data Platform supports the following products:
Apache Hadoop HDFS
HDFS is a distributed file system that provides high-throughput access to application data.
Apache Hive
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. We support the Hive Metastore.
Apache Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Apache Spark
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Trino
Fast distributed SQL query engine for big data analytics that helps you explore your data universe.