Conversation
|
|
|
Apparently there are some problems with shutdown that make the tests |
946b8b5 to
338a371
Compare
|
|
||
| --echo # case 1: Abort after resetting TRX_SYS page rollback segments | ||
| let $restart_parameters=--innodb_undo_tablespaces=4 --debug_dbug="+d,after_rseg_reset_abort"; | ||
| let $restart_parameters=--innodb_undo_tablespaces=4 --debug_dbug="+d,after_rseg_reset_abort,ib_log_checkpoint_avoid_hard"; |
There was a problem hiding this comment.
This was an attempt to avoid a test failure on Windows. It did not work out:
CURRENT_TEST: innodb.undo_upgrade_debug
--- C:/buildbot/workers/prod/amd64-windows/build/mysql-test/suite/innodb/r/undo_upgrade_debug.result 2026-03-09 10:49:32.801270600 +0100
+++ C:\buildbot\workers\prod\amd64-windows\build\mysql-test\suite\innodb\r\undo_upgrade_debug.reject 2026-03-09 10:52:37.306331700 +0100
@@ -18,26 +18,15 @@
# restart: --innodb_undo_tablespaces=2 --debug_dbug=+d,after_deleting_old_undo_abort,ib_log_checkpoint_avoid_hard
# restart: --innodb_undo_tablespaces=2
# Should list 2 undo log tablespaces
-undo001
-undo002
# case 3: Abort after successfully deleting the old undo tablespace
# restart: --innodb_undo_tablespaces=3 --debug_dbug=+d,after_deleting_old_undo_success
# restart: --innodb_undo_tablespaces=3
# Should list 3 undo log tablespaces
-undo001
-undo002
-undo003
# case 4: Abort after re-creating new undo tablespaces
# restart: --innodb_undo_tablespaces=4 --debug_dbug=+d,after_reinit_undo_abort
# restart: --innodb_undo_tablespaces=4
# Should list 4 undo log tablespaces
-undo001
-undo002
-undo003
-undo004
# case 5: Abort after re-creating new undo tablespaces successfully
# restart: --innodb_undo_tablespaces=2 --debug_dbug=+d,after_reinit_undo_success
# restart: --innodb_undo_tablespaces=2
# Should list 2 undo log tablespaces
-undo001
-undo002
Result length mismatchThe following code is related to this:
/** Delete the old undo tablespaces present in the undo log directory */
static dberr_t srv_undo_delete_old_tablespaces()
{
/* Delete the old undo tablespaces*/
for (uint32_t i= 0; i < srv_undo_tablespaces_open; ++i)
fil_close_tablespace(srv_undo_space_id_start + i);
DBUG_EXECUTE_IF("after_deleting_old_undo_abort", return DB_ERROR;);
/* Do checkpoint to get rid of old undo log tablespaces redo logs */
log_make_checkpoint();
DBUG_EXECUTE_IF("after_deleting_old_undo_success", return DB_ERROR;);
return DB_SUCCESS;
}On buildbot, this failure only seems to be reproducible on Microsoft Windows. I can reproduce it on Linux by extracting the files from this var.tar.gz and it looks like crash recovery really is broken when there is no checkpoint after deleting the files:
sql/mariadbd --innodb-log-file-size=10m --innodb-page-size=8k --datadir=/tmp/mysql-test/var/log/innodb.undo_upgrade_debug-8k/mysqld.1/data2026-03-09 12:12:28 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2026-03-09 12:12:28 0 [Note] InnoDB: End of log at LSN=117100
2026-03-09 12:12:28 0 [ERROR] InnoDB: Failed to open the undo tablespace undo008
2026-03-09 12:12:28 0 [Note] InnoDB: Retry with innodb_force_recovery=5
2026-03-09 12:12:28 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1484] with error Generic error
2026-03-09 12:12:28 0 [Note] InnoDB: Starting shutdown...
I get the same result for every innodb_page_size. Only the name of the reported missing undo tablespace is occasionally varying. @Thirunarayanan, it seems to me that this is a genuine problem that used to be hidden for some reason.
There was a problem hiding this comment.
Thinking aloud, the problem ought to be solved by ensuring that another checkpoint will be completed before the fil_close_tablespace() loop. Possibly, we will need another log_make_checkpoint() call. I will check where the checkpoints are taking place in my environment, where this failure is not reproducible.
There was a problem hiding this comment.
On my system, I observe that checkpoints are taking place due to calls to log_make_checkpoint(). The problem appears to be that on Windows, log_make_checkpoint() occasionally fails to trigger and wait for a checkpoint.
There was a problem hiding this comment.
In the end, in 26fcc4b I solved this by ensuring that log_make_checkpoint() is called at each critical step of srv_undo_tablespaces_reinit(). Changes of innodb_undo_tablespaces were never fully crash-safe.
9b9ae8a to
26fcc4b
Compare
Concurrent calls to log_checkpoint_low() were possible from multiple threads, and they could cause redundant writes of FILE_CHECKPOINT records. Let us simplify the logic by making the dedicated buf_flush_page_cleaner() thread responsible for checkpoints. log_t::write_checkpoint(lsn_t end_lsn): Add the parameter checkpoint, which replaces log_sys.next_checkpoint_lsn. log_sys.checkpoint_pending: Remove. Only the buf_flush_page_cleaner thread will write checkpoints, hence there is no possibility of a race condition. log_checkpoint_low(), log_checkpoint(): Remove the return value, because there cannot be any concurrent log checkpoint in progress. buf_flush_wait(): Add a parameter for waiting for a full checkpoint. This function replaces buf_flush_wait_flushed(). log_t::checkpoint_margin(): Replaces log_checkpoint_margin(). log_make_checkpoint(): Delegate all work to buf_flush_page_cleaner(). buf_flush_sync_batch(): Add the parameter bool checkpoint, to wait for an empty FILE_CHECKPOINT record to be written. Outside recovery, pass lsn=0. buf_flush_sync_for_checkpoint(): On shutdown, update the systemd watchdog and keep flushing until a final checkpoint has been written. buf_flush_page_cleaner(): Revise the shutdown logic so that all changes will be written out and a checkpoint with just a FILE_CHECKPOINT record can be written. buf_flush_buffer_pool(): Remove. buf_flush_wait_flushed(): Require the caller to acquire buf_pool.flush_list_mutex. logs_empty_and_mark_files_at_shutdown(): Simplify the logic, and return the shutdown LSN. fil_names_clear(): Fix an off-by-one error that would prevent removal from fil_system.named_spaces. innodb_shutdown(): Always invoke logs_empty_and_mark_files_at_shutdown(). srv_undo_tablespaces_reinit(): Simplify the logic and remove the fault injection point after_reinit_undo_abort, which would cause a failure on Microsoft Windows. Changing innodb_undo_tablespaces is not fully crash-safe.
Description
Concurrent calls to
log_checkpoint_low()were possible from multiple threads, and they could cause redundant writes ofFILE_CHECKPOINTrecords. Let us simplify the logic by making the dedicatedbuf_flush_page_cleaner()thread responsible for checkpoints.log_t::write_checkpoint(lsn_t end_lsn): Add the parametercheckpoint, which replaceslog_sys.next_checkpoint_lsn.log_sys.checkpoint_pending: Remove. Only thebuf_flush_page_cleanerthread will write checkpoints, hence there is no possibility of a race condition.log_checkpoint_low(),log_checkpoint(): Remove the return value, because there cannot be any concurrent log checkpoint in progress.buf_flush_wait(): Add special handling forlog_sys.check_for_checkpoint()as well as shutdown.buf_flush_wait_flushed(): Assert thatbuf_flush_page_cleaner()is available.log_make_checkpoint(): Delegate all work to the page cleaner.buf_flush_sync_for_checkpoint(): Update the systemd watchdog. On shutdown, keep flushing until a checkpoint has been written.buf_flush_page_cleaner(): Revise the shutdown logic so that all changes will be written out.buf_flush_buffer_pool(): Remove.buf_flush_wait_flushed(): Require the caller to acquirebuf_pool.flush_list_mutex.logs_empty_and_mark_files_at_shutdown(): Simplify the logic.fil_names_clear(): Fix an off-by-one error that would prevent removal fromfil_system.named_spaces.Release Notes
InnoDB shutdown was simplified.
How can this PR be tested?
This ensures that despite
cmake -DWITH_INNODB_PMEM=ONthepwritebased log write code path will be used.Basing the PR against the correct MariaDB version
mainbranch.PR quality check