image 75

AWS Partner Story: Pinellas County Human Services – Helping the ones in need

Frame 1178

About Company

Pinellas County Human Services (PCHS) has been creating solutions for a stronger community by serving those most in need since 1955. With a network of over 105 partner agencies and managing 190 plus contracts and grants, Human Services helps Pinellas County residents obtain access to medical care, emergency financial assistance, help connect to county judicial resources, optimize benefits for Veterans and Dependents, investigate consumer complaints, and help those who are experiencing homelessness.

Pinellas County Human Services has partnered with the Pinellas County Department of Health and the Turley Family Health Center to provide prevention-focused health care to eligible Pinellas County residents. The Pinellas County Health Program moves clients from a “sick care” model toward a “disease management” model using medical homes.

Executive Summary

Pinellas County’s main requirement was to have a Data ingestion solution that could collect the data from 40+ various data sources having data in different schemas and formats, align the data in common convention so that it can be consumed by other services and pipelines to perform reporting and analysis. Overall, they need a data lake architecture for reporting and analysis.
The required data ingestion pipeline needs to be fail-proof to not miss any incoming data and should be capable of handling data files with thousands of records. As we are dealing with personally identifiable information so the data should always be encrypted in-flight and at rest to be HIPAA compliant.


• PCHS needed a solution that would enable them to capture, analyze, and visualize analytics data generated from a variety of data sources with varying access protocols and structures, providing insights to help the departments to provide better health services and medical policies.
• Pipeline should support thousands of records uploaded every month from different Hospitals and Medical homes servers.
• Also, it had to protect information with encryption in transit and at rest. It also scales elastically to handle peak loads.
• HIPAA-Compliant for sensitive patient data protection.


• Eliminates the need to provision and manage infrastructure to run each microservice.
• Lambda automatically scales up and down with load, processing millions of data points monthly.
• Cost Saving.
• HIPAA-compliant solution.
• Giving faster insight into data.
• Speeds time to market for new customer services, since each feature is a new microservice that can run and scale independently of every other microservice.
• Decouples product engineering efforts from the platform analytics pipeline, enabling the creation of new microservices to access data stream without the need to be bundled with the main analytics application.

Partner Solution

Before implementing the solution, we needed to find answers to the below questions:
• What will be the frequency of the incoming data?
• How much quantity of data would be there for filtering at the initial stage?
• What kind of data formats could be required to process?
• What will be the hotkey/partition key for the data while querying it for reporting and analysis?

Pinellas County 1

Our Solution

Based on the above requirements, we used Lambda functions, S3 bucket, DynamoDB, and SQS along with KMS for encryption, to keep our architecture to be entirely serverless. Having this serverless architecture helps to reduce cost and achieve higher scalability without worrying about infrastructure management.

Lambda functions are capable enough to handle the amount of data that we collect from several sources on monthly basis. Using Python libraries like Pandas and NumPy helps us transform the data very efficiently.

S3 is best for data lake solutions as it can hold data in almost any possible format and by using storage classes, also we reduced the cost of storing the data for the long-term using storage classes and lifecycle events.

For metadata management we have used DynamoDB, it stores our pipeline metrics that help our lambda functions to take decisions during runtime, also it tracks the history of datasets coming from source systems.

To handle pipeline failures, we have used the SQS service as a Dead letter queue that will hold messages and trigger notification services on pipeline failures.

As we must deal with Personal Identifiable Information (PII) we must keep our architecture to be secured under a reliable roof, for which we have used AWS GovCloud. GovCloud safeguards over sensitive data files which makes our solution to be HIPPA compliant.

Using S3 bucket to collect and store the data from Source

Every partner hospital first uploads their data files from their secured networks to our given S3 bucket using Transfer Family service which is highly secured for B2B file transfers. Once they upload the data file to the S3 bucket, the data file will automatically be encrypted using SSE-KMS, using SSE-KMS will keep the data encrypted at rest with encryption keys managed by KMS. On the S3 file upload event, the lambda function will be triggered. We configured S3 Bucket policies to forcefully allow connection over HTTPS which helps to keep the data to be encrypted at transit, also it will regularize the access to only upload the files from hospital networks which restrict them to looking for any other data that doesn’t belong to them.

To increase the querying efficiency in the S3 bucket we have partitioned the data based on the source name as lambda functions used further can easily retrieve the files.

In our pipeline we have used to S3 bucket, first one is the Ingestion bucket which is used to store the data directly coming from source systems, all the lambda functions will keep an eye over the Ingestion bucket to fetch the datafile for processing once it is uploaded by any partner hospital. The second S3 bucket is Raw data bucket that we have used to keep the filtered data that is processed during the ingestion phase by lambda functions, Raw bucket stores the data in parquet format which uses columnar data storage technique and make data to be queried faster, all datasets in Raw bucket are into the similar schema. we only need to look upon only past 6 months of data frequently for reporting and analysis, so with the help of S3 lifecycle policies, we migrate the data to S3 infrequent access storage class after 6 months and then to S3 glacier storage after 24 months. Which helps us to reduce the storage cost in managing historical data.

Processing the datasets

After the file has been successfully uploaded to the Ingestion S3 bucket, our first lambda function i.e., “handle_s3_event_ingestion” will be triggered from the pipeline, this lambda function will receive a payload from invoking event based on which it will take decisions to generate the metadata and invoke next lambda functions in the pipeline. It uses Python 3.7 as its runtime environment. It is configured to handle concurrent executions in case of multiple hospitals upload data files simultaneously. Its main business logic includes the task to generate payload and pass it to the next lambda function from a set of several lambda functions in the pipeline by looking onto the data set source and format types.

Next, we have a set of several lambda functions (“semi_structured_event_workflow_*”) out of which one will be invoked. Every lambda function from the set has a business logic to filter and transform a specific source dataset with a specific data format. Filtering and transformation include operations like removal of duplicate data, normalization of data, map correct data types, addition or removal of certain columns, and some other schema changes. To use third-party libraries, we are using our custom lambda governance layer which includes our predefined functions and classes to use specific functionalities from third-party libraries, this architecture of using lambda layers helps us to achieve abstraction and d gives an extra layer of security over the data. For reading the datafiles from the S3 bucket we have used the Boto3 library as it allows communication with S3 over HTTPS protocol and keeps data encrypted in transit.

Every lambda function will interact with DynamoDB to fetch and update the metadata for the current processing data files, DynamoDB stores information like, dataset name, last run status, fields, and its datatypes, timestamps, S3 storage source and target locations, also some information for related operations that needs to be performed.

After filtering the datasets, some records with missing fields/data need to be validated, for which another lambda function (pii_matching) will be triggered in the pipeline. It will take bad records and dataset names as input from the previous lambda function and perform records matching operations to validate them. “pii_matching” lambda function searches for every bad record in the aurora database and based on config files passed during runtime it will check for exact and proposed matched records. Config files include the fields and operators that need to be applied for searching the record in the database. For all the exact match records we update the dataset in the S3 bucket. But for all the proposed match records we create new entries in the database to manually validate them from data scientists through a web application.

For each record, we need to fire queries to the database which increases the load and results in higher execution time. To resolve this, we used multithreading in the lambda function which takes every record as a single thread and processes it. By using multithreading, we reduced the execution time for lambda function to few minutes and helped reducing the cost.

To manage database credentials, we have used secret manager service where every credential is kept encrypted using KMS and restricted the access by using proper IAM policies and roles.

Handling Pipeline Failures

One of a challenge in the pipeline was handling lambda execution failures, which might be caused due to any reason during runtime. So, we used failure on retries and SQS as a Dead letter queue (DLQ) as a solution for it. Whenever any lambda function fails during execution it will first retry for 3 times to resolve any dependency issues if it still fails on retries, a message including all the details like dataset name, lambda name, error statement, timestamp, etc. Will be generated in DLQ which will be configured as a destination of lambda on failure. Later DLQ will be polled from any other service to send notifications.

Deployment Maintenance

For this pipeline, our deployment process is automated through Gitlab pipelines as we are managing our code repositories using Gitlab. For infrastructure deployment, we have used terraform scripts as Infrastructure as a code. We have maintained two separate environments, one for development and one for production use. Using Gitlab’s automated CI/CD pipeline we have triggers on code lookup and deployment.