Let’s Flink on EKS: Data Lake Primer

by Howard Hill
October 22, 2023
by Howard Hill
October 22, 2023

This post is a getting-started guide intended to assist engineers in setting up an open model infrastructure for real-time processing.

Data is established as the driving force behind many industries today, and having a modern data architecture is pivotal for organizations to be successful. One key component that plays a central role in modern data architectures is the data lake, which allows organizations to store and analyze large amounts of data cost-effectively and run advanced analytics and machine learning (ML) at scale.

Here at OpenCredo we love projects that are based around Kafka and/or Data/Platform Engineering; in one of our recent projects, we created an open data lake using Kafka, Flink, Nessie and Iceberg. The first part of this blog is related to the Flink and S3 infra design.

Apache Flink is designed for distributed streams and batch processing, handling real-time and historical data. Flink integrates well with the Hadoop or Presto ecosystem, allowing it to leverage its distributed storage systems like HDFS or AWS S3, for example as the storage engine.

Architecture

Flink is great at data processing for streaming data, providing low-latency performance and advanced windowing functions and has evolved from version 1.4 to 1.17 to now include a Kubernetes Operator. This makes it considerably easier to manage jobs and tasks.

Our data lake is a medallion architecture for this solution, with each bucket having a bronze, silver and gold folder. We provisioned it using Terraform.

Terraform was orchestrated using a Terragrunt format to handle multiple tenants. A tenant is the owner of the data. The main acceptance criteria are to classify the data and segment it by region; security is enabled by Virtual Private Network (VPC) or VPCe for access to the buckets. We expect applications to be deployed in a VPC, in this case, EKS, which runs Flink apps.

terraform {
source = "git::git@github.com:opencredo/terraform-modules.git//s3_datalake"
}

inputs = {
region              = local.environment_vars.region_name #eu-west
tenant_id           = "897823709432" #some randomized id
data_classification = ["GDPR"]
account_id  = "614871886104" #aws account
vpc_ids     = ""vpc-18d8eee21dfcf1807""