Add stream sync timeout recovery to avoid hanging on unresponsive devices#702
Add stream sync timeout recovery to avoid hanging on unresponsive devices#702indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to handle host-side stream synchronization timeouts and subsequent device unresponsiveness, specifically targeting issues where the CANN library might hang during cleanup after a device reset. It adds a device_unresponsive flag across the C++ and Python layers, implements aclrtSynchronizeStreamWithTimeout for stream synchronization, and provides 'abandon' methods to clear memory tracking without triggering potentially blocking rtFree calls. In the event of a timeout, the system now performs a direct device reset and, in Python tests, uses os._exit to avoid hanging during interpreter shutdown. Feedback suggests refactoring the manual cleanup logic in DeviceRunner::finalize into a dedicated method for better maintainability and ensuring resource release calls are idempotent. Additionally, there is a recommendation to use RAII guards for managing performance buffers in ProfMemoryManager to improve resource management robustness.
…ices Replace blocking rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout (1s timeout). On timeout, set device_unresponsive flag, skip all rt* cleanup (which would block), reset the device via aclrtResetDevice, and propagate the flag to Python so scene_test calls os._exit() to avoid CANN fini hangs. - DeviceRunner: add synchronize_stream_with_timeout(), device_unresponsive flag, timeout recovery path in finalize(), extract reset_device_and_acl() - MemoryAllocator::abandon(), ProfMemoryManager::stop(skip_device_free) - New device_unresponsive() in C API across all platform backends - ChipWorker + Python bindings: expose device_unresponsive property - scene_test: os._exit(1) when device_unresponsive is set after run
8a1326f to
6718c91
Compare
Replace blocking rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout (1s timeout). On timeout, set device_unresponsive flag, skip all rt* cleanup (which would block), reset the device via aclrtResetDevice, and propagate the flag to Python so scene_test calls os._exit() to avoid CANN fini hangs.