fbpx

Blog

Building the Batch Processing Framework

Building the Batch Processing Framework

Written by Hugh Han

As a heavily-regulated fintech company, Valon takes service-level agreements (SLAs) very seriously. Many of these SLAs involve processing large sets of data, which is trickier at scale.

Back in 2021, we began exiting our minimum viable product phase and entered the very beginning of our hyper-growth phase. Specifically, we faced the problem of scaling from a few hundred loans to hundreds of thousands of loans. We sought a solution for our scaling requirements that didn’t sacrifice developer productivity.

1. Problem

While a few hundred thousand loans may not seem like a lot, consider that each loan is associated with numerous sets of entities supporting Valon’s complex business logic. Some of our business logic involves (but is certainly not limited to):

  • Supporting various financial accounts (e.g. consumer, escrow, investor) for each loan.
  • Handling payments and notifications for homeowners.
  • Handling money movements and reconciliation.
  • Supporting data ingestion and sanitation protocols.

So it turns out that a few hundred thousand loans results in the number of entities attached to each loan being of an order of magnitude of tens of millions.

1.1 Prior to Batch Processing

Before the Batch Processing framework existed, workloads that processed large datasets were executed serially. This implied increased workload latencies as our datasets scaled up, putting our critical SLAs at risk.

Before the Batch Processing framework existed, workloads that processed large datasets were executed serially. This implied increased workload latencies as our datasets scaled up, putting our critical SLAs at risk.

Even at small scale, the serial GenerateInvestorReportCron workload could take many hours.

2. Goals

We wanted to design a solution that allowed for the parallel execution of workloads. Consider a variation of the GenerateInvestorReportCron workload that uses a parallelized approach where each loan and underlying transaction could be processed concurrently.

This type of execution model allowed us to horizontally scale—at the infrastructure level—the number of workers responsible for processing these workloads, ensuring that we’ll continue to meet our SLAs as we scale.

3. Requirements

3.1 Performance and Reliability

The solution must allow for the performant and reliable processing of an arbitrarily large number of entities to support our growth—at minimum—for the next two years.

3.2 Tracking Workload Statuses

The solution must provide the ability to observe the execution status of a workload (e.g. INITIATED, EXECUTING, SUCCESS, FAILURE, or CANCELED). Such a feature allows for:

  • Real-time monitoring and observability of large workloads.
  • The ability to execute functions that depend on a completion status of the workload.
3.3 Developer Experience

The solution must provide an intuitive interface for developers. We believe that developer tools should empower engineering productivity, not hinder it. The ideal solution would allow engineers to:

  • Submit large workloads using the same core libraries used to build other services at Valon.
  • Deploy these workloads following the same deployment cadence as all other services at Valon.
  • Ephemerally store execution results (e.g. data) generated by these workloads.
  • Not worry about infrastructure complexity (e.g. memory or vCPU configurations).

4. Alternatives Considered

When considering the solution space, some questions we asked ourselves were:

  • Could multithreading or multiprocessing initiated at the application level fit our needs?
  • Could increasing the number of workers for a workload in an ad-hoc manner fit our needs?
  • Could a popular, open-source solution (e.g. Apache Beam, Apache Spark) fit our needs?

When diving a bit deeper, we quickly realized that while each of these alternative solutions seemed simple at the surface level, none of them was ideal.

Multithreading or multiprocessing initiated at the application level is constrained by the hardware that the application runs on. This means that application developers would need to be cognizant of how to optimally configure infrastructure-level details if multithreading or multiprocessing. Similarly, increasing the number of workers for a workload also requires knowledge of infrastructure-level details.

Open-source solutions would have prevented developers from submitting large workloads using the same core libraries or deployment cadences used by all other core services at Valon. Such solutions would have resulted in stand-alone builds, more complex deployment pipelines, and a steeper learning curve for application developers.

Note that each of these alternative solutions violates our requirements in 3.3 Developer Experience, and hence none of these alternatives was taken.

5. Concepts

The design of the Batch Processing framework revolved around two major entities:

  • Batch Process: a workload composed of many units (batch process components) of work.
  • Batch Process Component: a unit of work, many of which compose a workload (batch process).

6. Design

A high-level architecture diagram of the design of the first iteration of the Batch Processing framework is shown below.

There are a few core pieces to this design.

  • Service Interfaces: entry point for developers to invoke the Batch Processing framework
  • Internal Services + Message Queue: used to power execution of batch processes
  • Cron Jobs: used for eventual consistency
  • Database Tables: data stores for batch processing metadata

Each of the internal services would be able to autoscale up and down depending on the throughput of the incoming traffic, allowing application developers to not have to worry about infrastructure-level details.

6.1 Defining a Batch Process

Developers would be able to define a custom batch process by subclassing the following abstract base classes.

In abstract (no pun intended), developers would be required to answer the following questions in their subclass definitions:

  • What inputs are required for the workload to execute?
  • What inputs are required for each unit of work to execute? How are those inputs generated?
  • How should each unit of work execute?

Developers would be able to submit workloads to the Batch Processing service after defining a subclass.

6.2 Submitting a Batch Process

Submission of batch processes would be done via a submit function within the Service Interfaces. When this function is invoked, a batch process record would be persisted to the database, and a corresponding batch process message would be added to a message queue.

The message would then be received by the BatchProcessMessageHandler service, after which batch process components are generated and persisted to the database. A corresponding batch process component would be added to a message queue for each generated batch process component. Each batch process component message would then be received by the BatchProcessComponentMessageHandler service, where execution is done and results are ephemerally persisted.

A BatchProcessingConsistencyCron cron job would run on a schedule to advance the state of any batch process or batch process component for eventual consistency.

6.3 Retrieving a Batch Process

Developers could retrieve the status of a batch process via the get function within the Service Interfaces. This could be used to await batch processes via polling; after a batch process completes, dependent code could be executed by the client code.

6.4 Canceling a Batch Process

Developers could cancel a batch process via the cancel function within the Service Interfaces. When this function is invoked, the status of the batch process would be set to CANCELED, and all resulting batch process components would no-oped as soon as they are received from the message queue.

7. Example Usage

Consider the GenerateInvestorReportCron example from 2 Goals. A batch process used to execute a parallelized version could be defined as follows.

Afterwards, the following two crons could be implemented as follows.

  • ProcessInvestorReportResultsCron, which initiates the batch process.
  • GenerateInvestorReportCron, which generates the report using the results of the batch process.

8. Conclusion

Today, the Batch Processing framework is widely used at Valon by every engineering team with strict SLAs; it is at the core of almost every single one of our heavy workloads.

In this post, we’ve described the initial design of our Batch Processing framework: our in-house solution to processing large workloads at scale. Since the time of writing, we’ve made many improvements to this original design. Stay tuned for future blog posts on how we’ve improved upon the existing design.

Interested in joining us and working on the Batch Processing framework or other interesting challenges? We’re hiring!

9. Acknowledgements

Special thanks to Michael Morel, Paul Veevers, Sean Snyder, and Shobhit Srivastava for their contributions to the first iteration of the Batch Processing Framework.