D4L - Data Science

D4L is a data platform solution for every data problem existing today. How is this possible? SAAS and closed-source applications cannot adapt, innovate, or keep pace with new and mature open-source solutions evolving daily. The vast open-source ecosystem has achieved critical momentum, with millions of contributors and thousands of companies backing high-profile OSS applications. Therefore, best-in-class OSS adoption and interoperability are the focus of D4L. D4L integrates the best of enterprise-grade OSS applications that specialize in your data problems.

D4L is a modern data platform backed by an experienced team of software developers, systems integrators, and platform architects. D4L harnesses the combinatorial effect of best-of-class open-source software, open standards, and bare-metal hardware. D4L is not a public cloud (AWS/Google) reseller; we assemble custom, secure, private-cloud data solutions. Harness the real power of bare-metal distributed computing at a fraction of the cost of the public cloud with NO vendor lock-in.

OSS and Standards

All D4L components store and communicate data through industry-standard protocols and APIs. For example, store and retrieve structured data via Trino’s JDBC, REST APIs, or petabytes of structured or unstructured data through the Ceph Object Gateway S3 API S3 API.

Interconnected and Secure

Create sophisticated data extraction and transformation pipelines with Apache NiFi’s intuitive graphical interface. Move data from anywhere to anywhere and any level of processing in between — all with OSS and open standards. As a result, D4L never requires you to invest in learning or adapting to a proprietary system.

Fixed Cost

D4L operates on company-owned and maintained hardware at our data center facilities across the U.S. We do not resell or rent from the public cloud. Instead, we offer fixed monthly or annual billing independent of utilization as a fraction of the cost of the public cloud and with superior performance.

Core Application Stack

Apache NiFi

ETL / Data Pipelines

NiFi provides a rich user interface and role-based security to develop and maintain data flows and support for scheduled and real-time data acquisition. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Apache Superset

Data Exploration & Visualization

Apache Superset is a web application and platform for data exploration and data visualization. Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize data, from simple line charts to highly detailed geospatial charts.

JupyterLab

Data Science / Python Integrated Development Environment

JupyterHub brings the power of Jupyter Notebooks to groups of users. D4L’s robust JupyterLab environment consists of thousands of built-in software libraries in many languages, including Python 3, Xeus Python, R, Julia, Octave, and Bash, and nearly all the most popular data science packages available today. In addition, the environment comes pre-installed with numerous applications for working with external data and systems.

PostgreSQL

RDBMS / Structured Database Management / Data Warehouse

PostgreSQL is an advanced, enterprise-class open-source relational database supporting both SQL (relational) and JSON (non-relational) querying. PostgreSQL is a highly stable database management system backed by more than 20 years of community development, contributing to its high resilience, integrity, and correctness.

Ceph Object Gateway

S3 Compatible Data Lake / File Storage

The Ceph Object Gateway provides high-performance object storage compatible with the Amazon S3 cloud storage service. Ceph can store all forms of structured and unstructured data in text or binary formats, providing secure access to individuals and systems. Ceph can keep individual objects up to five terabytes and scale horizontally to near-limitless capacity.

OpenSearch

Data Indexing / Lucene Search / OLAP

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data. Developers build with OpenSearch for use cases such as application search, log analytics, data observability, data ingestion, and more.

Apache Cassandra

Enterprise Key/Value Database

Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

Trino

Distributed Query Engine

Trino is a tool designed to efficiently query vast amounts of data using distributed queries. If you work with terabytes or petabytes of data, you are likely using tools that interact with Hadoop and HDFS. Trino was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs, such as Hive or Pig, but Trino is not limited to accessing HDFS. Trino can be and has been extended to operate over different kinds of data sources, including traditional relational databases and other data sources such as the Ceph Object Gateway, OpenSearch and Apache Cassandra.

Apache Hive

Data Warehousing

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Keycloak

Identity Access Management / SSO

Keycloak allows single sign-on with Identity and Access Management aimed at modern applications and services. Keycloak supports OpenID Connect, OAuth 2.0, and SAML 2.0 along with 2FA.

NVIDIA GPU

High Performance Machine Learning

NVIDIA CUDA® is a parallel computing platform and programming model developed by NVIDIA for harnessing general computing on GPUs. Deep Learning is computationally intensive, specifically requiring floating-point arithmetic over large data sets, a type of operation well suited for GPUs and their original purpose in rendering 3D graphics. CUDA provides the possibility for GPU acceleration for nearly any data science and analytics project.

Kubernetes

Extensible Application Platform / Networking / Container Orchestration

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

Template by Bootstrapious. Ported to Hugo by DevCows.