OCR Authorization Numbers
Context and Problem Statement
The CCAH Health Plan requires obtaining an authorization number prior to claim submission. This involves initiating a request with the payer and subsequently receiving a response via fax. The care team manually processes each fax, a time-consuming and error-prone task.
Proposed Solution
Use an OCR tool to process incoming faxes from CCAH and extract the authorization numbers data.
Considered Options
A key aspect to consider is accuracy. The text recognition process must be higly precise to be a viable solution. Wrong authorization numbers data will result in claims rejections and hard to trace errors. A success rate of at least 80% was required. Other aspects like pricing, difficulty of use, documentation and support were also taken into account.
- AWS Textract: AWS service
- RTesseract: Ruby library to interact with Tesseract Open Source OCR engine by Google.
Decision Outcome
Chosen option: AWS Textract
AWS Textract
Advantages:
- HIPAA compliant.
- Ruby SDK is actively maintained.
- Full documentation and support from AWS.
- Integrated with other AWS services:
- Process files stored in our S3 buckets.
- Notify OCR async job results to SNS topics.
- Price is acceptable: $1.5 per 500 pages.
Test
To measure the accuracy of AWS Textract the OCR process was tested against 500 already manually-processed faxes. For each fax its OCR data was compared against the loaded in ARC by the care team:
True Positives (TP): 412 (82,4%)
All OCR data matched ARC's data.
False Positives (FP): 7 (1,4%)
AWS Textract was able to recognize all values but one or more values didn't matched ARC's data.
True Negatives (TN): 5 (1%)
One or more values failed to be recognized by AWS Textract.
False Negatives (FN): 76 (15,2%)
All OCR data was correctly recognizes but ARC's data was incorrect.
Precision: 0.9832
Recall: 0.988
The test's results were mainly positive. For more than 80% of the OCRd faxes the correct data was extracted. In 15% of the cases some data was incomplete, in those cases the fax would be discarded and handle it manually. In less than 2% of the analyzed faxes incorrect data was extracted.
RTesseract
Tesseract is the standard OCR solution, it's Open Source and widely used. RTesseract is a Ruby gem to interact with a Tesseract engine.
- It can only OCR images. PDF should be downloaded and converted to an imagen beforehand.
- The result is a collection of recognizes words, not a complex construct like lines or pages. Parsing would be more complex.
- Gem is actively maintained but lacks features.