You have just successfully compiled a custom Android build, flashed it to your reference board, and launched the Camera app. The preview starts, looks crisp, and the UI is responsive. But the moment you tap the shutter button to capture a high-resolution image, the preview freezes. The UI becomes unresponsive, and ten seconds later, the app crashes. Checking `logcat`, you don't see a simple NullPointerException. Instead, you are greeted with the dreaded framework timeout:
E Camera3-Device: RequestThread: Timed out waiting for flush to complete
E CameraService: binderDied: Java client's binder died, removing it
W CameraService: Disconnecting camera client 0 causing commits to fail.
This isn't an application error. This is a breakdown in the contract between the Android Framework and your Hardware Abstraction Layer (HAL). In a recent project bringing up a new image sensor on an AOSP (Android Open Source Project) build for a custom embedded platform, this exact scenario halted development for three days. The issue wasn't the sensor driver—it was a subtle mishandling of buffer fences within the HAL3 pipeline.
Architecture Analysis: The Buffer Traffic Jam
To understand why the camera service hangs, we need to look at the Camera HAL3 architecture. Unlike the legacy Camera API where the HAL pushed buffers directly, HAL3 is request-driven. The framework sends a capture request containing output buffers to the HAL. The HAL must fill these buffers and return them asynchronously.
In our specific environment—running Android 14 on a quad-core ARM64 SoC with 4GB RAM—the deadlock occurred specifically during mixed use cases (e.g., keeping the Viewfinder preview stream alive while capturing a JPEG snapshot). The high-level symptoms pointed to a resource leak, but standard memory profilers showed nothing unusual.
Surface: queueBuffer: Fence::wait returned -62 (Timer expired) alongside the camera service timeout, your HAL is holding onto a buffer longer than the framework allows.
The root cause lies in how the HAL handles the "Result Metadata" versus the "Output Buffers". The Android Camera Service expects the shutter_notify event to happen relatively quickly. If your Image Signal Processor (ISP) takes 500ms to process a JPEG, and you block the metadata return thread waiting for that JPEG, you starve the framework's request queue. The framework thinks the device has hung and initiates a flush, which fails because the HAL is stuck waiting on hardware.
The "Sleep" Misconception
My first attempt to fix this was naive. I assumed the Image Processor (ISP) was simply overloaded. I tried reducing the frame rate in the configuration files and adding small `usleep()` calls in the request processing loop to allow the hardware to "catch up."
This failed miserably. Adding delays only exacerbated the synchronization issues. The Camera2 API is designed to be fully asynchronous. By sleeping in the request thread, I was preventing the HAL from picking up new requests, causing the buffer queue to fill up even faster. The issue wasn't speed; it was order of operations.
The Solution: Decoupling Metadata and Buffers
The fix involved re-architecting the `processCaptureRequest` method in the HAL implementation. We must separate the return of the capture result metadata from the return of the filled buffers. The framework allows (and prefers) partial results.
Here is the corrected logic pattern in C++. We immediately notify the framework that the shutter has occurred, and then we process the buffers in a separate worker thread. Crucially, we ensure that failed requests still return their buffers with an error status, rather than dropping them.
// CameraDeviceSession.cpp snippet
// Correct way to handle asynchronous results in HAL3
Return<void> CameraDeviceSession::processCaptureRequest(
const CaptureRequest& request) {
// 1. Validate the request and input streams
if (request.outputBuffers.size() == 0) {
ALOGE("Request has no output buffers!");
return Void();
}
// 2. IMMEDIATE: Notify Shutter
// This tells the framework "We started capturing".
// Prevents the "Timed out" error on the initial trigger.
NotifyMsg msg;
msg.type = MsgType::SHUTTER;
msg.msg.shutter.frameNumber = request.frameNumber;
msg.msg.shutter.timestamp = systemTime(SYSTEM_TIME_MONOTONIC);
mCallback->notify({msg});
// 3. Dispatch buffer processing to a worker thread
// DO NOT block here waiting for the ISP.
std::thread([this, request]() {
Result result;
result.frameNumber = request.frameNumber;
result.result = mMetadata; // Camera Metadata
// ... (Hardware processing happens here) ...
// 4. IMPORTANT: Return buffers explicitly
// Even if hardware fails, return buffer with STATUS_ERROR
for (auto& buf : request.outputBuffers) {
CaptureResultBuffer resultBuffer;
resultBuffer.bufferId = buf.bufferId;
resultBuffer.status = BufferStatus::OK;
// Or BufferStatus::ERROR if ISP failed
// Acquire fence handling is critical
if (buf.acquireFence >= 0) {
close(buf.acquireFence);
}
result.outputBuffers.push_back(resultBuffer);
}
// Send result back to framework
mCallback->processCaptureResult({result});
}).detach();
return Void();
}
In the code above, lines 20-22 are the game changers. By sending the `SHUTTER` notification immediately, we satisfy the framework's watchdog timer. Moving the heavy lifting to a detached thread (or a dedicated thread pool in a production environment) keeps the request queue moving.
Furthermore, checking the `acquireFence` (line 41) is mandatory. If the framework passes a fence, the HAL must wait on it (or close it if the buffer is being returned immediately). Leaking these file descriptors is a common source of system instability that often requires a reboot or a re-flash via fastboot to clear.
Performance & Stability Verification
After applying this patch and rebuilding the `vendor.img`, I flashed the updated partition using fastboot: `fastboot flash vendor vendor.img`. The results were immediately apparent.
| Metric | Legacy Implementation | Async HAL Patch |
|---|---|---|
| Shutter Latency | 1200ms (avg) | 150ms (avg) |
| Success Rate (100 shots) | 12% (Crashed after ~12) | 100% |
| Viewfinder Jitter | Visible Stutter | Smooth 60fps |
The dramatic reduction in shutter latency (from 1.2s to 150ms) confirms that the bottleneck was indeed software-induced blocking. By adhering to the non-blocking philosophy of the Camera2 API, we allow the pipeline to fill efficiently. The viewfinder jitter disappeared because the preview request thread was no longer being blocked by the still capture processing.
View AOSP Camera DocumentationEdge Cases & Warning
While decoupling the threads solves the deadlock, it introduces a new challenge: concurrency management. If your underlying hardware ISP is not re-entrant or thread-safe, blindly dispatching threads will cause memory corruption.
In cases where the hardware only supports a single active stream, you must implement an internal serialization queue within the HAL. The HAL accepts requests from the framework immediately (to prevent timeouts) but queues them internally and processes them one by one. Do not confuse "Async Framework Contract" with "Parallel Hardware Execution."
adb logcat -b all | grep -E "Camera|HAL" to monitor buffer states. If you see fence leaks, restart the `media.camera` service, but often a full fastboot reboot is cleaner to reset the kernel driver state.
Conclusion
Developing for the Android Camera HAL requires a mindset shift from linear programming to event-driven asynchronous design. The transition from the legacy API to Camera2 and HAL3 in AOSP provides immense power, but it punishes blocking operations severely. By ensuring your `processCaptureResult` logic is non-blocking and strictly managing fence file descriptors, you can achieve the buttery smooth camera performance that modern users expect.
Post a Comment