Session and Job deploymentsTo upgrade

Let’s Flink on EKS: Data Lake Primer

by Howard Hill

October 22, 2023

by Howard Hill

October 22, 2023

This post is a getting-started guide intended to assist engineers in setting up an open model infrastructure for real-time processing.

Data is established as the driving force behind many industries today, and having a modern data architecture is pivotal for organizations to be successful. One key component that plays a central role in modern data architectures is the data lake, which allows organizations to store and analyze large amounts of data cost-effectively and run advanced analytics and machine learning (ML) at scale.

Here at OpenCredo we love projects that are based around Kafka and/or Data/Platform Engineering; in one of our recent projects, we created an open data lake using Kafka, Flink, Nessie and Iceberg. The first part of this blog is related to the Flink and S3 infra design.

Apache Flink is designed for distributed streams and batch processing, handling real-time and historical data. Flink integrates well with the Hadoop or Presto ecosystem, allowing it to leverage its distributed storage systems like HDFS or AWS S3, for example as the storage engine.

Architecture

Flink is great at data processing for streaming data, providing low-latency performance and advanced windowing functions and has evolved from version 1.4 to 1.17 to now include a Kubernetes Operator. This makes it considerably easier to manage jobs and tasks.

Our data lake is a medallion architecture for this solution, with each bucket having a bronze, silver and gold folder. We provisioned it using Terraform.

Terraform was orchestrated using a Terragrunt format to handle multiple tenants. A tenant is the owner of the data. The main acceptance criteria are to classify the data and segment it by region; security is enabled by Virtual Private Network (VPC) or VPCe for access to the buckets. We expect applications to be deployed in a VPC, in this case, EKS, which runs Flink apps.

terraform { source = "git::git@github.com:opencredo/terraform-modules.git//s3_datalake" }


inputs = {
 region              = local.environment_vars.region_name #eu-west
 tenant_id           = "897823709432" #some randomized id
 data_classification = ["GDPR"]
 account_id  = "614871886104" #aws account
 vpc_ids     = ""vpc-18d8eee21dfcf1807""



    

    

                
                    
        
            
                Related articles
            
            
                
                                        
                            	
		

		
							
					    
				
					

		
			
																
				News   April 11, 2025
			
							
					OpenCredo Becomes Trifork UK, Marking a New Era of Innovation and Expansion
					OpenCredo is becoming Trifork, marking the beginning of an exciting new chapter. This transition supports the Trifork Group’s vision to expand its global footprint and strengthen its presence in the UK.
				
			
						
		
		
		

		
							
					    
				
					

		
			
																
				Blog   December 4, 2024
			
							
					Meet OwlSight: Our New Configurable Data Monitoring Platform
					Introducing OwlSight, a configurable data monitoring platform. Learn about the new tool we're developing to solve common data quality problems we see when working with our clients.
				
			
			Data Analysis Data Engineering 
			
		
		
		

		
							
					    
				
					

		
			
																
				Blog   November 14, 2024
			
							
					Data Orchestration Showdown: Airflow vs. Dagster
					Effectively managing data workflows is critical for success. In this blog we compare Apache Airflow and Dagster, two orchestration tools with unique approaches to solving similar challenges. Discover their key differences and find the best fit for your data engineering needs!
				
			
			Data Analysis