Automating Large-Scale ETL Migration with AI

From Talend to Python & Spark, without months of manual work

Machine Learning
Financials & BFSI

Our client needed to migrate approximately 300 data pipeline jobs from Talend, a proprietary ETL platform, to a modern Python and Spark-based stack. Manual migration would have required months of repetitive, error-prone work. We built NextGen, an AI-assisted migration tool that performs SQL-centric extraction and Java code simplification.

Implementation time:

2-3 months

Talend jobs processed

300

estimated reduction in migration time

30%

size of all jobs

0.4GB

The Opportunity

Talend migrations needed a scalable alternative to manual rewrites.

Opaque, auto-generated pipelines

Talend’s visual jobs compile into massive Java files with embedded SQL and complex variable substitutions, making the real business logic hard to inspect and extract.

Manual migration doesn’t scale

Reverse-engineering each job by hand requires hours of repetitive work per pipeline, turning large migrations into months of error-prone effort.

High risk at enterprise scale

At hundreds of jobs, small inconsistencies compound into operational risk, higher costs, and long delivery timelines that are difficult to justify or maintain.

The Solution

We built NextGen, an AI-assisted migration tool that supports two complementary migration approaches: SQL-centric extraction and Java code logic simplification. Together, they automate the most time-consuming parts of the process while keeping engineers in control of validation and integration

Approach 1 – SQL-Centric Migration

In the first approach, NextGen parses Talend XML files to extract embedded SQL queries. It cleans and normalizes the SQL, resolves variable substitutions, and sends only compact, high-signal SQL to Azure OpenAI for translation into Spark SQL and PySpark.

By avoiding raw Talend exports and auto-generated Java noise, this approach reduces prompt size, improves translation accuracy, and keeps LLM usage cost-effective.

Approach 2 – Java Logic Reduction & Translation

In the second approach, NextGen operates directly on the Talend-generated Java code. Instead of translating it as-is, the tool programmatically strips away redundant constructs and non-essential code, while preserving the original execution logic and data flow.

The resulting simplified Java representation is then sent to the LLM for translation into Python and Spark constructs.

Meet the team of this project

Adriano Campinho

Data Scientist

Inês Ferreira

Senior Data Scientist

The Impact

NextGen transformed a months-long manual migration into an automated, repeatable pipeline. While human review is still required, the most time-consuming parts of the process were eliminated.

Across nearly 300 jobs, the tool reduced overall migration time by approximately 30%, improved consistency across pipelines, and allowed teams to focus on higher-value engineering tasks instead of repetitive rewrites.

Book a call with us

Nuno Brás

Co-founder@DareData

Get to know your new AI & Data Partner
Book now

Meet part of the team of this project

Adriano Campinho

Data Scientist

Inês Ferreira

Senior Data Scientist

Book a call with us

Get to know your new Data & AI Partner
Book now

See it in Action

Liked this solution?

Discover how AI can transform your customer support into a faster, more consistent, and cost-efficient operation.

By downloading you're confirming that you agree with our Privacy Policy.
Thank you! You will receive an email shortly.
Oops! Something went wrong while submitting the form.