Skip to main content

OCR Authorization Numbers

Context and Problem Statement

The CCAH Health Plan requires obtaining an authorization number prior to claim submission. This involves initiating a request with the payer and subsequently receiving a response via fax. The care team manually processes each fax, a time-consuming and error-prone task.

Proposed Solution

Use an OCR tool to process incoming faxes from CCAH and extract the authorization numbers data.

Considered Options

A key aspect to consider is accuracy. The text recognition process must be higly precise to be a viable solution. Wrong authorization numbers data will result in claims rejections and hard to trace errors. A success rate of at least 80% was required. Other aspects like pricing, difficulty of use, documentation and support were also taken into account.

  1. AWS Textract: AWS service
  2. RTesseract: Ruby library to interact with Tesseract Open Source OCR engine by Google.

Decision Outcome

Chosen option: AWS Textract

AWS Textract

Advantages:

  • HIPAA compliant.
  • Ruby SDK is actively maintained.
  • Full documentation and support from AWS.
  • Integrated with other AWS services:
    • Process files stored in our S3 buckets.
    • Notify OCR async job results to SNS topics.
  • Price is acceptable: $1.5 per 500 pages.

Test

To measure the accuracy of AWS Textract the OCR process was tested against 500 already manually-processed faxes. For each fax its OCR data was compared against the loaded in ARC by the care team:

True Positives (TP): 412 (82,4%)
All OCR data matched ARC's data.

False Positives (FP): 7 (1,4%)
AWS Textract was able to recognize all values but one or more values didn't matched ARC's data.

True Negatives (TN): 5 (1%)
One or more values failed to be recognized by AWS Textract.

False Negatives (FN): 76 (15,2%)
All OCR data was correctly recognizes but ARC's data was incorrect.

Precision: 0.9832
Recall: 0.988

The test's results were mainly positive. For more than 80% of the OCRd faxes the correct data was extracted. In 15% of the cases some data was incomplete, in those cases the fax would be discarded and handle it manually. In less than 2% of the analyzed faxes incorrect data was extracted.

RTesseract

Tesseract is the standard OCR solution, it's Open Source and widely used. RTesseract is a Ruby gem to interact with a Tesseract engine.

  • It can only OCR images. PDF should be downloaded and converted to an imagen beforehand.
  • The result is a collection of recognizes words, not a complex construct like lines or pages. Parsing would be more complex.
  • Gem is actively maintained but lacks features.