aws logo

AWS Partner Story: JPMC Bank

Frame 1178

Project Summary

JPMC’s main requirement was to have a Data ingestion solution that could collect the data from 40+ various data sources having data in different schemas and formats, align the data in common convention so that it can be consumed by other services and pipelines to perform reporting and analysis. Overall, they need a data lake architecture for reporting and analysis. The required data ingestion pipeline needs to be fail-proof to not to miss any incoming data and should be capable of handling datafiles with thousands of records. As we are dealing with Personally identifiable information so the data should always be encrypted in flight and at rest to be HIPAA compliant.

Challenges

  • JPMC required a solution that would enable them to not only take and store data but also analyse, and display analytics data generated from a range of data sources with varied access protocols & formats, delivering insights to aid the departments in giving better Investment support and advice.
  • Pipeline should be able to accommodate the millions of records that are uploaded for various transactions each month and store them in a data warehouse.

Benefits

  • Data storage and analysis can be done in a single infrastructure with the aid of new architecture.
  • Lambda processes millions of data points per month by dynamically scaling up and down in response to traffic.
  • Cost effective.
  • Providing data insight more quickly.
  • Reduces the time it takes to launch new consumer services by offering continuous data storage, on-demand data analysis, and dashboards with analysed data.

Partner Solution

Before implementing the solution, we needed to find answers to the below
questions:

  • What will be the frequency of the incoming data?
  • How much information would need to be filtered in the early stages?
  • What kind of data formats would be necessary for processing?
  • What will the data’s hotkey or partition key be when it is accessed for
    reporting and analysis?

Our Solution

In order to maintain our architecture primarily serverless, we used API Gateway, Lambda functions, S3 bucket, Redshift, Glue, Arora, Sage maker, and SQS. This serverless design enables cost savings and increased scalability without having to worry about infrastructure administration.

API gateway provides multiple endpoints for storing, viewing, and getting analyzed data combined with Lambda functions and Glue which are capable enough to handle the amount of data that we collect on a monthly basis. Using Redshift to handle and store large-scale data and datasets.

We use CloudWatch to manage metadata; it saves pipeline metrics that enable our lambda functions to make decisions during runtime and keeps track of the history of datasets obtained from source systems.

We have utilized the SQS service as a Dead letter queue to manage pipeline failures. This queue will retain messages and activate alerting services when a pipeline fails. SageMaker has been preferred since we have large data that has to be analyzed in order to extract the relevant information.

To keep costs down, Aurora has been utilized to store analyzed data.

JPMC case study mu stack

How API gateway connects the whole architecture.

For data storage, the API gateway’s first and only point is connected to an S3 bucket where we have configured custom triggers that are queued in SQS from which Lambda can select the triggered jobs in batches. The information from the source and target tables were automatically cataloged by an AWS Glue crawler as part of our pipeline, and AWS Glue ETL tasks used these catalogs to retrieve data from S3 and publish it to Redshift.

The information from the source and target tables were automatically cataloged by an AWS Glue crawler as part of our pipeline, and AWS Glue ETL tasks used these catalogs to retrieve data from S3 and publish it to Redshift.

Accessing the Data.

In our pipeline, the second endpoint of the API gateway is responsible for retrieving data from Redshift using Lambda so that that we can execute necessary operations on the data before delivering it to the shared dashboard so that users can see the appropriate results.

Utilising data for analysis

The next tool we have is SageMaker, which is in charge of processing all the data and performing analysis on it to reveal hidden patterns and trends insights of transactional that can then be used to further comprehend the ongoing trends in various groups of users and also to suggest to users what kind of investment plans to make or how to cut costs.

Here, utilizing Redshift is more appropriate because it works as both a data warehouse for SageMaker and a data store for our architecture.

We must send a query to the database for each record, which adds to the workload and lengthens execution times. This was fixed by using multithreading in the lambda function, which processes each record separately. We helped down the cost by leveraging multithreading to reduce the lambda function’s execution time to a few minutes.

We used the secret manager service to manage database credentials, keeping each one encrypted with KMS and limiting access with the right IAM roles and policies

Handling pipeline failures.

Handling lambda execution failures, which could occur for any reason during runtime, was one of the pipeline’s challenges. As a result, we came up with SQS as a Dead Letter Queue (DLQ) with failure on retries as a solution.

If a lambda function fails to execute, it will initially retry three times to fix any dependency problems. If this doesn’t work, a message will be displayed with all the relevant information, such as the dataset name, lambda name, error message, timestamp, etc. will be produced in DLQ, which will be set up as a lambda on failure destination. To send notifications later, DLQ will be polled by any other service.

Deployment and maintenance.

For this pipeline, our deployment process is automated through Gitlab pipelines as we are managing our code repositories using Gitlab. For infrastructure deployment, we have used terraform scripts as Infrastructure as a code. We have maintained two separate environments, one for development and one for production use. Using Gitlab’s automated CI/CD pipeline we have triggers on code lookup and deployment.

Technologies

  1. API Gateway
  2. S3 Bucket
  3. AWS Lambda
  4. AWS Sagemaker
  5. CloudWatch
  6. Amazon Redshift
  7. Amazon Aurora
  8. SQS(Queue Service)
  9. Amazon Glue

Success Metrics

JPMC Bank needed a data lake architecture for reporting and analysis.

The required data ingestion pipeline needs to be fail-proof so as not to miss any incoming data and should be capable of handling data files with thousands of records. As we are dealing with Personally identifiable information so the data should always be encrypted in flight and at rest to be HIPAA compliant.

JPMC required a solution that would enable them to not only take and store data but also analyze, and display analytics data generated from a range of data sources with varied access protocols & formats, delivering insights to aid the departments give better Investment support and advice.

Handling lambda execution failures, which could occur for any reason during run time, was one of the pipeline’s challenges. As a result, we came up with SQS as a Dead Letter Queue (DLQ) with failure on retries as a solution.

we used API Gateway, Lambda functions, S3 bucket, Redshift, Glue, Aurora, Sage maker, and SQS. This serverless design enables cost savings and increased scalability without worrying about infrastructure administration.

API gateway provides multiple endpoints for storing, viewing, and getting analyzed data combined with Lambda functions and Glue which are capable enough to handle the amount of data that we collect on a monthly basis. Using Redshift to handle and store large-scale data and datasets.

We use CloudWatch to manage metadata; it saves pipeline metrics that enable our lambda functions to make decisions during runtime and keeps track of the history of datasets obtained from source systems.

We have utilized the SQS service as a Dead letter queue to manage pipeline failures. This queue will retain messages and activate alerting services when a pipeline fails.

SageMaker has been preferred since we have extensive data that has to be analyzed in order to extract the relevant information. To keep costs down, Aurora has been utilized to store analyzed data.