Apache NiFi
Visual, role-based pipelines with built-in lineage. Real-time and scheduled acquisition, transformation, and routing across heterogeneous systems.
D4L™ · By Deasil Works, Inc. · 6 U.S. facilities · Bare metal · K8s
D4L is a private cloud built entirely on permissively-licensed open source: Apache NiFi, Trino, Postgres, Cassandra, OpenSearch, Kafka, Iceberg, Ceph, Superset, DataHub, plus the broader catalog further down. We operate it on bare metal Deasil owns across 6 U.S. facilities. The same operational ease you bought Snowflake or Tableau for, at a fraction of the price, with no per-seat tax and no SaaS vendor between you and your data. The pipelines, dashboards, and SQL belong to you. Moving D4L off our iron is a configuration change, not a migration project.
Closed-source data products market on the promise of stability. What you actually buy is a vendor that hides its failures, deprecations, support backlog, and acquisition risk behind a status page. Deasil has been on the other side of those tickets for 25 years. We know what the box hides because we run the equivalent OSS stack ourselves, in the open, on-call.
The OSS projects in the canonical D4L stack run in production at companies whose engineering scale dwarfs any SaaS vendor's. Postgres runs Bloomberg, Reddit, Apple, and the largest Atlassian Jira tenant. Apache Cassandra runs Apple iCloud, Netflix, and Discord. Trino is what Netflix and Bloomberg query Iceberg with. OpenSearch is what AWS itself runs. Apache Kafka is what LinkedIn, Walmart, and Uber move events on. The notion that these projects need a SaaS vendor wrapped around them to be "production ready" is upside down: they are more battle-tested than the proprietary layers built on top of them.
What the SaaS box actually delivers is opacity. When Snowflake has an incident you read a status page. When Tableau deprecates a feature you read a release note. When Salesforce raises prices 9% you read a press release. When Looker is acquired by Google you read a blog post. None of these are inherent to the data system. They are inherent to the relationship.
And the features SaaS markets are usually the surface. Apache Superset ships more chart types and more native database connectors than Tableau. Trino federates more sources than Snowflake natively. OpenSearch's k-NN matches Elastic Cloud at typical RAG scale. NiFi's visual provenance is richer than Fivetran's lineage. Once the operations are taken care of, the OSS stack is usually the more capable one, not the compromise.
The cost of "convenience" is line-by-line visible. For each commercial product on the left, the OSS project D4L would run on the right, with the portability and pricing trade you get when you switch.
No per-seat tax, no Salesforce shareholder. Dashboards are JSON in your git, not artifacts in a vendor cloud.
Same SQL surface, federated across Postgres / Cassandra / OpenSearch. Iceberg is the format Snowflake itself now reads.
The components Databricks repackages, run directly. Notebooks live in JupyterHub. Models live in MLflow. Pipelines live in Airflow.
Visual pipelines with provenance, role-based access, and lineage. The flow files are XML you keep. Moving NiFi off D4L is moving a config.
A fully Apache-2.0 observability stack: metrics, logs, dashboards. No per-host meter and no AGPL network-copyleft surprise (which is why Grafana and Loki are not in the canonical D4L pick).
The S3 API, not the S3 bill. We mount disks once. Egress is a network wire, not a billable event.
Kafka itself is still Apache. We can swap to Redpanda for the same wire protocol with less ops surface.
Same Lucene under the hood. By 2024 OpenSearch has its own governance, foundation, and release cadence.
OIDC, OAuth 2.0, SAML 2.0, 2FA. Federate with whatever IdP your org already runs. No MAU meter.
pgvector is a column type on the database you already run. No second cluster, no separate billing rail.
The same DAGs, the same operators, the same UI. The bill is the cluster, not the project.
The Linux Foundation forks of Terraform and Vault. Same HCL, same APIs, no BSL. The drop-in is real.
If you're paying for one of these, send us your last invoice. We'll come back within a week with a hardware spec, an OSS replacement plan, and a fixed monthly number.
D4L runs any modern, libertine-licensed OSS data application your team has standardized on, configured to your spec, on iron we own, billed at one fixed monthly number. The 18 components below are the canonical D4L stack: a Sample Deployment we ship when the customer does not have a strong preference. The Explorer that follows lets you filter the broader universe of permissively-licensed OSS data tools D4L will operate on request.
Visual, role-based pipelines with built-in lineage. Real-time and scheduled acquisition, transformation, and routing across heterogeneous systems.
One SQL surface across PostgreSQL, Ceph (S3), OpenSearch, Cassandra, and Hive. Federated queries at petabyte scale without copying data.
A web-native, open-source replacement for Tableau and Looker. Charts, dashboards, geospatial. Backed by Trino and PostgreSQL.
Per-user JupyterLab environments with Python 3, R, Julia, Octave, Bash kernels and the standard data-science stack pre-installed.
Twenty-plus years of community development. Full SQL, JSONB, extensions. The reliable spine of nearly every D4L deployment.
Linear scalability, fault-tolerance proven on commodity hardware. The wide-column store under mission-critical write paths.
Apache 2.0 search and analytics for application search, log analytics, and observability. Lucene under the hood. No licensing trapdoors.
SQL over distributed storage with a schema-on-read mindset. The catalog and metastore Trino reads to do its job.
Petabyte-scale, S3-API object storage for structured and unstructured data. Five-terabyte object cap; horizontally scalable to near-limitless capacity.
The metadata platform across the D4L stack. Asset discovery, column-level lineage, ownership, glossary, and data-product modelling for everything we run (NiFi, Trino, Postgres, OpenSearch, dbt, Airflow). Open-source, originally built at Netflix, Apache 2.0 throughout.
OpenID Connect, OAuth 2.0, SAML 2.0 with 2FA. One identity surface across the platform. Federate with your existing IdP if you have one.
Durable event log behind everything that needs an audit trail or a real-time pipeline. Confluent Cloud bills per throughput and per partition; Apache Kafka itself does not.
Python-defined DAGs for ETL and scheduled jobs. Astronomer and AWS MWAA charge per-environment and per-worker for what is otherwise a free Apache project.
The dashboard layer over OpenSearch and Prometheus. The Apache-2.0 fork of Kibana, kept permissive when Elastic moved Kibana to SSPL in 2021. Replaces Grafana in the canonical D4L stack because Grafana itself moved to AGPL the same year.
Pull-based metrics collection and alerting. The standard the entire CNCF ecosystem speaks. Cloud-managed equivalents charge per-metric per-month; the protocol is the same.
The open table format that won. Default for AWS Athena, Glue, and EMR by 2024. Used here as the lakehouse table layer over Ceph so Trino, Spark, and Flink read the same bytes.
PostgreSQL extension for vector similarity. Replaces Pinecone-class managed vector DBs with a column type on the database you already trust. No per-vector pricing, no separate cluster to operate.
The substrate everything else runs on. Deployments, scaling, secrets, networking, storage classes. Your platform with grown-up controls.
Curated from the GitHub universe at >5,000 stars and a permissive license (Apache 2.0, MIT, BSD, MPL 2.0, LGPL 2.1, ISC, PostgreSQL). No SSPL, BUSL, RSAL, or AGPL. D4L will configure and operate any of them, on the same iron, under the same fixed monthly bill. Bring this list to a vendor and ask them to do the same.
Semantic data platform MCP server that composes multiple data tools with bidirectional cross-injection. Tool responses automatically include critical context from other services.
DataHub MCP server and composable Go library for building custom MCP servers that integrate DataHub metadata capabilities. Part of the txn2 MCP toolkit.
Model Context Protocol server for Trino. Lets AI assistants query and explore data warehouses through the Trino federated SQL surface.
Model Context Protocol server for Amazon S3 and S3-compatible object storage. Enables AI assistants to browse buckets, read and write objects, and generate presigned URLs.
MCP test server for validating Model Context Protocol integrations. Used internally for D4L MCP toolkit conformance.
The Postgres development platform. Supabase gives you a dedicated Postgres database to build your web, mobile, and AI applications.
A fancy self-hosted monitoring tool
Apache Superset is a Data Visualization and Data Exploration Platform
The Prometheus monitoring system and time series database.
Distributed reliable key-value store for the most critical data of a distributed system
ClickHouse® is a real-time analytics database management system
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Streamlit — A faster way to build and share data apps.
Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search
Apache Spark - A unified analytics engine for large-scale data processing
Build and share delightful machine learning apps, all in Python. Star to support our work!
TiDB is built for agentic workloads that grow unpredictably, with ACID guarantees and native support for transactions, analytics, and vector search. No data silos. No noisy neighbors. No infrastructure ceiling.
Umami is a modern, privacy-focused analytics platform. An open-source alternative to Google Analytics, Mixpanel and Amplitude.
Apache Kafka - A distributed event streaming platform
SeaweedFS is a distributed storage system for object storage (S3), file systems, and Iceberg tables, designed to handle billions of files with O(1) disk access and effortless horizontal scaling.
Scalable datastore for metrics, events, and real-time analytics
Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Search infrastructure for AI
Label Studio is a multi-type data labeling and annotation tool with standardized output format
2.3x faster than MinIO for 4KB object payloads. RustFS is an open-source, S3-compatible high-performance object storage system supporting migration and coexistence with other S3-compatible platforms such as MinIO and Ceph.
Event Driven Orchestration & Scheduling Platform for Mission Critical Applications
Apache Flink
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
APM, Application Performance Monitoring System
eBPF-based Networking, Security, and Observability
Data Apps & Dashboards for Python. No JavaScript Required.
CNCF Jaeger, a Distributed Tracing Platform
Dolt – Git for Data
Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
A high-performance observability data pipeline.
Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
high-performance graph database for real-time use cases
Open-source IoT Platform - Device management, data collection, processing and visualization.
Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
A reactive notebook for Python — run reproducible experiments, query with SQL, execute as a script, deploy as an app, and version with git. Stored as pure Python. All in a modern, AI-native editor.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
Zipkin is a distributed tracing system
VictoriaMetrics: fast, cost-effective monitoring solution and time series database
QuestDB is a high performance, open-source, time-series database
Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
The official home of the Presto distributed SQL query engine for big data
Workflow Engine for Kubernetes
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
An orchestration platform for the development, production, and observation of data assets.
Apache Doris is an easy-to-use, high performance and unified analytics database.
Apache Pulsar - distributed pub-sub messaging system
MCP Toolbox for Databases is an open source MCP server for databases.
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
Gel supercharges Postgres with a modern data model, graph queries, Auth & AI solutions, and much more.
Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
APM, (Application Performance Management) tool for large-scale distributed systems.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
Proxy based Redis cluster solution supporting pipeline and scaling dynamically
Nightingale is to monitoring and alerting what Grafana is to visualization.
Open source distributed and RESTful search engine.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Change data capture for a variety of databases. Please log issues at https://github.com/debezium/dbz/issues.
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Open-Source Web UI for Apache Kafka Management
CMAK is a tool for managing Apache Kafka clusters
eBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, decrypts TLS without keys. Queryable by AI agents via MCP and humans via dashboard.
The Context Platform for your Data and AI Stack
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
A self-hosted, ad-free, privacy-respecting metasearch engine
Always know what to expect from your data.
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
A truly Open Source MongoDB alternative
Automated developer-oriented status page with alerting and incident support
Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably, socket.io, Phoenix.PubSub, SignalR. Set up once and forever.
Build, Manage and Deploy AI/ML Systems
Open-source cron job and background task monitoring service, written in Python & Django
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).
Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes
AutoMQ is a diskless Kafka® on S3. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds. Single-digit ms latency. Multi-AZ Availability.
Open source transactional distributed database. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure without compromising performance.
Real-time Geospatial and Geofencing
Resolve production issues, fast. An open source observability platform unifying session replays, logs, metrics, traces and errors powered by ClickHouse and OpenTelemetry.
SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.
Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.
Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Build, run, and manage data pipelines for integrating and transforming data.
Apache DataFusion SQL Query Engine
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Apache Beam is a unified programming model for Batch and Streaming data processing.
CloudNativePG is a comprehensive platform designed to seamlessly manage PostgreSQL databases within Kubernetes environments, covering the entire operational lifecycle from initial deployment to ongoing maintenance
A Web UI for Elasticsearch and OpenSearch: Import, browse and edit data with rich filters and query views, create reference search UIs.
Prometheus Alertmanager
SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.
Enterprise job scheduling middleware with distributed computing ability.
Incremental engine for long horizon agents Star if you like it!
Coroot is an open-source observability and APM tool with AI-powered Root Cause Analysis. It combines metrics, logs, traces, continuous profiling, and SLO-based alerting with predefined dashboards and inspections.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
An AI-powered next-generation open source real-time observability system.
Open-source observability for your GenAI or LLM application, based on OpenTelemetry
The Open Source Feature Store for AI/ML
Dynamic, resilient AI orchestration. Coordinate data, models, and compute as you build AI workflows.
OpenTelemetry Collector
Complete open-source monitoring and observability platform.
AI + Data, online. https://vespa.ai
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
Postgres with GPUs for ML/AI apps.
Database backup tool (PostgreSQL, MySQL\MariaDB and MongoDB)
Open Source, Google Zanzibar-inspired database for scalably storing and querying fine-grained authorization data
ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Cost monitoring for Kubernetes workloads and cloud costs
Docker hosts and containers monitoring with Prometheus, Grafana, cAdvisor, NodeExporter and AlertManager
Instant Kubernetes-Native Application Observability
Flink CDC is a streaming data integration tool
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
The open-source Observability 2.0 database. One engine for metrics, logs, and traces — replacing Prometheus, Loki & ES.
Kafka Web UI
Aim — An easy-to-use & supercharged open-source experiment tracker.
Apache Hive
A horizontally scalable, highly available, multi-tenant, long term Prometheus.
Apache Kafka® running on Kubernetes
Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
Real-time monitoring for Proxmox, Docker, and Kubernetes with AI-powered insights, smart alerts, and a beautiful unified dashboard
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
ZenML : One AI Platform from Pipelines to Agents. https://zenml.io.
lakeFS - Data version control for your data lake | Git for data
Open Source Cloud Native Application Protection Platform (CNAPP)
No projects match. Loosen a filter or clear the search.
The canonical components compose into very different production systems. Each diagram below is a real combination D4L customers run today. Swipe or scroll horizontally; click a number above to jump.
Three core sites and three secondary facilities, interconnected on private fibre and federated under one Kubernetes control plane. Egress is wiring, not a billable event.
The rules D4L holds itself to so the customer doesn't have to carry the risk. Each one is the inverse of a way SaaS vendors have hurt their customers in the last decade.
Apache 2.0, BSD, MIT, MPL 2.0, LGPL 2.1, PostgreSQL. Never SSPL, BUSL, RSAL, or AGPL. The license is what determines whether the platform you walk away with is yours to operate elsewhere. We constrain ourselves to permissive and weak-copyleft so portability is a property of the stack, not a promise from the vendor.
Fixed monthly or annual billing against owned hardware. No per-seat licensing. No per-query metering. No per-GB egress. A backfill is free. The bill in March is the bill in December. We over-provision so you do not pay a tax for being efficient.
Every D4L surface is reachable through an industry-standard protocol: S3, JDBC, REST, OIDC, CQL, PromQL. There is no proprietary client to install. The day you decide to leave us, the platform you walk away with is recognisable to anyone who has read the Apache documentation.
A fair question we have heard more than once: isn't D4L just reselling free software? An honest accounting of what you are paying for, who actually wrote the code, and where the credit (and the money) belongs.
The honest answer is no. D4L charges for operational labour and the iron the software runs on. The OSS projects themselves are free, and they remain free for any customer who wants to take them off our hardware and run them somewhere else. What we sell is the on-call engineer at 4 a.m., the Kubernetes upgrade pathway, the Ceph rebalance under load, the lease at the data centre, the disks themselves, and 25 years of running these systems in production. Not the bits on disk.
Compare that to most enterprise SaaS. The vast majority of commercial data products either fully repackage open source — Confluent is Kafka, AWS RDS is Postgres / MySQL / MariaDB, Elastic Cloud is Elasticsearch, MongoDB Atlas is MongoDB, Datadog runs on a FOSS stack — or use OSS for major components, with little or no credit to the upstream projects on the marketing site. Nearly all commercial software is built on, with, or against open source: from compilers to kernels to TLS stacks to format parsers. The exception is the rare pure-proprietary green-field, and even that ships in a Linux container.
D4L does not re-brand, hide, or obfuscate the OSS we run. Every component is named on this page — NiFi, Trino, Postgres, Cassandra, OpenSearch, Kafka, Iceberg, DataHub, Keycloak, Kubernetes, plus the 130+ projects in the Explorer above. Every license is shown. Every upstream is one click away.
If you are profiting from the heavy use of any of these projects, donate to them. OSS thrives on three things: contributors, popularity, and money. D4L provides the second by name on every customer engagement. The third is yours.