This guide provides a high-level overview of the newly added PaddleOCR application, focusing into the internal structure and advanced functionality, the app performs end-to-end text recognition using a two-stage OCR pipeline accelerated by Hailo-8 and Hailo-10H devices.
The pipeline combines:
-
A text detector to locate text regions
-
A text recognizer to decode the text inside each region
Example Runs
Single Image:
python3 paddle_ocr.py -n ocr_det.hef ocr.hef -i ocr_img1.png
Folder of Images:
python3 paddle_ocr.py -n ocr_det.hef ocr.hef -i ./my_images/
Video File:
python3 paddle_ocr.py -n ocr_det.hef ocr.hef -i input.mp4
Camera:
python3 paddle_ocr.py -n ocr_det.hef ocr.hef -i camera
Optional: Spell Correction
You can optionally improve OCR text accuracy using a spelling correction dictionary powered by symspellpy
python3 paddle_ocr.py … --use-corrector
Full Pipeline Description
The PaddleOCR app uses a multi-threaded, queue-based pipeline to process input efficiently and asynchronously across multiple stages.
Preprocessing
↓
Text Detector (HEF 1)
↓
Detection Postprocess → [No Text] → Visualize
↓
Text Recognizer (HEF 2)
↓
OCR Postprocess
↓
Visualization
↓ [Output]
1. Preprocessing
-
Input source can be:
-
A single image
-
A folder of images
-
A video file
-
A live camera stream
-
-
Each frame is:
-
Resized and padded to fit the detector’s input size (while preserving aspect ratio)
-
Batched (if batch_size > 1)
-
-
Outputs:
-
input_frame
(for visualization) -
preprocessed_frame
(ready for inference)
-
-
Sent to:
detector_hailo_infer
viadet_input_queue
2. Text Detection (HEF 1)
-
Uses the first HEF model to detect text regions
-
Runs asynchronously using
HailoInfer.run()
-
On inference completion, triggers a callback:
- Packs
(original_frame, raw_output_tensor)
- Packs
-
Sent to:
det_postprocess_queue
3. Detection Postprocessing
-
Converts the raw heatmap into bounding boxes using DBPostProcess
-
For each box:
-
Crops the region from the original frame
-
Resizes it to fit the OCR model’s expected input size (with padding)
-
Attaches metadata: frame ID and box location
-
-
If no boxes are detected:
- Sends the original frame with empty OCR results directly to visualization
-
Otherwise:
- Sends:
(frame, [resized_crop], (frame_id, box))
toocr_input_queue
- Sends:
4. Text Recognition (HEF 2)
-
Uses the second HEF model to recognize text in each cropped region
-
Also runs asynchronously using
HailoInfer.run()
-
On completion, a callback sends:
(frame_id, original_frame, ocr_result, box)
toocr_postprocess_queue
5. OCR Postprocessing
-
Collects all OCR outputs for a given frame (tracked by
frame_id
) -
Keeps track of how many boxes are expected for that frame
-
Once all OCR results are collected:
-
Groups them into one bundle:
(frame, list_of_results, list_of_boxes)
-
Sends to:
vis_output_queue
for visualization
-
-
Cleans up memory (removes processed
frame_id
entries)
6. Visualization & Rendering
-
Uses the
inference_result_handler()
to:-
Decode OCR model outputs into readable text
-
(Optionally) apply spell correction using
SymSpell
if--use-corrector
is set
-
-
Draws the results:
-
Left side: original image
-
Right side: same image with OCR results written inside white boxes
-
-
Saves each frame (image/video) to
--output-dir
-
Optionally displays FPS if
--show-fps
is enabled
Threads Overview
Each of these stages runs in a separate thread:
Thread | Role |
---|---|
preprocess_thread |
Prepares and resizes input |
det_thread |
Runs text detection HEF |
detection_postprocess |
Extracts boxes, crops, resizes |
ocr_thread |
Runs text recognition HEF |
ocr_postprocess |
Groups and synchronizes OCR results |
vis_postprocess |
Handles decoding, correction, and rendering |
Internal Queues
Queue Name | Purpose |
---|---|
det_input_queue |
Holds original + preprocessed frames for the detector inference engine |
det_postprocess_queue |
Receives detection outputs (raw tensors + original frames) for postprocessing |
ocr_input_queue |
Carries cropped text regions + metadata to the OCR inference engine |
ocr_postprocess_queue |
Receives OCR model outputs along with original frame and box info |
vis_output_queue |
Collects final grouped results (frame, texts, boxes) for visualization and output |