Skip to content

Add stream sync timeout recovery to avoid hanging on unresponsive devices#702

Open
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0425_test
Open

Add stream sync timeout recovery to avoid hanging on unresponsive devices#702
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0425_test

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

Replace blocking rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout (1s timeout). On timeout, set device_unresponsive flag, skip all rt* cleanup (which would block), reset the device via aclrtResetDevice, and propagate the flag to Python so scene_test calls os._exit() to avoid CANN fini hangs.

  • DeviceRunner: add synchronize_stream_with_timeout(), device_unresponsive flag, timeout recovery path in finalize(), extract reset_device_and_acl()
  • MemoryAllocator::abandon(), ProfMemoryManager::stop(skip_device_free)
  • New device_unresponsive() in C API across all platform backends
  • ChipWorker + Python bindings: expose device_unresponsive property
  • scene_test: os._exit(1) when device_unresponsive is set after run

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle host-side stream synchronization timeouts and subsequent device unresponsiveness, specifically targeting issues where the CANN library might hang during cleanup after a device reset. It adds a device_unresponsive flag across the C++ and Python layers, implements aclrtSynchronizeStreamWithTimeout for stream synchronization, and provides 'abandon' methods to clear memory tracking without triggering potentially blocking rtFree calls. In the event of a timeout, the system now performs a direct device reset and, in Python tests, uses os._exit to avoid hanging during interpreter shutdown. Feedback suggests refactoring the manual cleanup logic in DeviceRunner::finalize into a dedicated method for better maintainability and ensuring resource release calls are idempotent. Additionally, there is a recommendation to use RAII guards for managing performance buffers in ProfMemoryManager to improve resource management robustness.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/platform/src/host/l2_perf_collector.cpp
…ices

Replace blocking rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout
(1s timeout). On timeout, set device_unresponsive flag, skip all rt* cleanup
(which would block), reset the device via aclrtResetDevice, and propagate
the flag to Python so scene_test calls os._exit() to avoid CANN fini hangs.

- DeviceRunner: add synchronize_stream_with_timeout(), device_unresponsive
  flag, timeout recovery path in finalize(), extract reset_device_and_acl()
- MemoryAllocator::abandon(), ProfMemoryManager::stop(skip_device_free)
- New device_unresponsive() in C API across all platform backends
- ChipWorker + Python bindings: expose device_unresponsive property
- scene_test: os._exit(1) when device_unresponsive is set after run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant