The Koalja platform is a pipeline construction platform designed to make data processing pipelines simple and easy. Ultimately, users will not have to know about Kubernetes or containers -- they will simply drop/POST code and data into endpoints and necessary outcomes will follow. Scaling of processes can be automated. The provenance of data artifacts is tracked and turned into metadata, using smart tasks and smart links, managed by simple policy. The goal is to make the experience as simple as possible so as to focus on what matters to data scientist or business analyst.
export DOCKERNAMESPACE=<your-docker-hub-account-name>
make build-image docker
# Install heptio contour
kubectl apply -f https://j.hept.io/contour-deployment-rbac
Once Contour is deployed, lookup the loadbalancer Service
in the
heptio-contour
namespace. Make sure to configure DNS appropriately for
all domains (see Domain management) such that
these domains are mapped to the Contour loadbalancer.
# Install the CRD
make install
# Deploy operator
make deploy
For every pipeline, a hostname is created and served on the Contour ingress load-balancer.
The hostname is <pipeline-name>.<namespace>.<domain-suffix>
.
The domain-suffix
can be configured using a ConfigMap
in the namespace of the pipeline
(or globally in the koalja-system
namespace).
Using the ConfigMap
you can specify one or more domain-suffixes
. The suffix that is
used is selected using a label selector for every suffix. The suffix with the most specific
label selector that matches the labels of the Pipeline
will be used.
The ConfigMap
must be named koalja-domain-config
and requires the following data format.
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-domain-config
data:
# These are example settings of domain.
# example.org will be used for pipelines having app=prod.
example.org: |
selector:
app: prod
# Default value for domain, for pipelines that does not have app=prod labels.
# Although it will match all pipelines, it is the least-specific rule so it
# will only be used if no other domain matches.
example.com: ""
The Koalja Pipeline resource controller will launch multiple agents and services
for every pipeline resource. The containers for those agents and services are
configured using custom resources (see config/agents)
and a ConfigMap
to select which resource will be used.
This way, Koalja can support multiple implementations of services & agents and you can choose which one to use. For example you can choose between (default) in memory implementations of ArangoDB database based implementations.
To select non-default implementations, create a ConfigMap
like below and
add it to the namespace in which you're deploying your pipeline resources.
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-services-config
data:
config: |
annotated-value-registry-name: arangodb-annotatedvalue-registry
The Koalja Operator deploys pipeline components that rely on a FileSystem Storage Service for storing long term data assets. This service is deployed separately from the operator itself.
There are multiple implementations of this service.
- Local: An test only implementation that stores data on the FS of the k8s Nodes.
- S3: An implementation that stores data in a S3 compatible object store.
The selection between these implementation is made using an environment
variable named STORAGETYPE
which can be set to either local
or s3
.
The make deploy
target willuses this variable to deploy the appropriate resources.
The S3 storage service must be configured before it can be used. To do so, deploy a yaml file like this:
apiVersion: v1
kind: Secret
metadata:
name: koalja-s3-default-secrets
namespace: koalja-system
data:
access-key: <base64 encoded access key>
secret-key: <base64 encoded secret key>
type: aljabrio/koalja-flex-s3
---
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-s3-storage-config
namespace: koalja-system
data:
default: |
name: <name of the s3 bucket>
endpoint: <endpoint of the s3 server>
secretName: koalja-s3-default-secrets
The ConfigMap
may contain multiple object storages, only a default
object storage is required.
- On GKE:
- Ensure that the nodes support
fusermount
(e.g. use ubuntu images) kubectl apply -f config/platforms/gke
- Ensure that the nodes support
cmd
Package containing entry points (main
functions) for the binaries of this project.config
Directory containing (customizable) deployment configuration.docker
Directory containing docker image build files.docs
Directory containing design & specification documents and experiments.examples
Example pipelines.frontend
Pipeline frontend (react application).hack
Kubebuilder related template files.pkg
Package root for all koalja libraries & services.third_party
Directory containing 3rd party code mainly related to GRPC & protocol buffers.vendor
External go libraries.