Skip to content

[fix](filecache) self-heal stale DOWNLOADED entries on local NOT_FOUND#60977

Open
freemandealer wants to merge 1 commit intoapache:masterfrom
freemandealer:self-heal-NOT_FOUND
Open

[fix](filecache) self-heal stale DOWNLOADED entries on local NOT_FOUND#60977
freemandealer wants to merge 1 commit intoapache:masterfrom
freemandealer:self-heal-NOT_FOUND

Conversation

@freemandealer
Copy link
Contributor

Problem:
In a rare restart window, BE can rebuild file-cache metadata in memory while
the corresponding cache files are not yet durable on disk. If that metadata is
also restored via LRU dump/load, blocks may appear as DOWNLOADED even though
the local files are missing. Subsequent reads then produce false-positive cache
hits, fail on local read, and repeatedly fall back to S3. This preserves
correctness but causes avoidable cache thrashing and latency jitter.

Root cause:
The read path treated DOWNLOADED as a valid local hit source and fell back to
remote reads on failure, but it did not actively invalidate stale metadata when
the local cache file was gone.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Problem:
  In a rare restart window, BE can rebuild file-cache metadata in memory while
  the corresponding cache files are not yet durable on disk. If that metadata is
  also restored via LRU dump/load, blocks may appear as DOWNLOADED even though
  the local files are missing. Subsequent reads then produce false-positive cache
  hits, fail on local read, and repeatedly fall back to S3. This preserves
  correctness but causes avoidable cache thrashing and latency jitter.

Root cause:
  The read path treated DOWNLOADED as a valid local hit source and fell back to
  remote reads on failure, but it did not actively invalidate stale metadata when
  the local cache file was gone.

Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
@Thearas
Copy link
Contributor

Thearas commented Mar 3, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Contributor Author

run buildall

@freemandealer
Copy link
Contributor Author

/review

@doris-robot
Copy link

TPC-H: Total hot run time: 28667 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit dbb7e5c41d34bcccc757a4484d71c4a5f7a0cf65, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17679	4559	4292	4292
q2	q3	10641	784	522	522
q4	4677	353	258	258
q5	7558	1196	1026	1026
q6	181	177	145	145
q7	770	846	660	660
q8	9305	1447	1349	1349
q9	4906	4706	4733	4706
q10	6820	1858	1623	1623
q11	451	263	243	243
q12	759	563	462	462
q13	17773	4217	3406	3406
q14	228	228	218	218
q15	949	795	787	787
q16	750	708	662	662
q17	707	867	412	412
q18	6302	5415	5246	5246
q19	1152	966	611	611
q20	513	487	379	379
q21	4499	1839	1420	1420
q22	360	285	240	240
Total cold run time: 96980 ms
Total hot run time: 28667 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4407	4347	4341	4341
q2	q3	1762	2186	1721	1721
q4	832	1156	760	760
q5	3998	4362	4260	4260
q6	176	172	137	137
q7	1734	1655	1486	1486
q8	2397	2652	2508	2508
q9	7424	7431	7443	7431
q10	2676	2915	2426	2426
q11	524	456	450	450
q12	520	622	435	435
q13	3947	4403	3606	3606
q14	285	291	272	272
q15	819	802	821	802
q16	712	785	719	719
q17	1167	1493	1298	1298
q18	7062	6957	6544	6544
q19	868	873	875	873
q20	2079	2163	2066	2066
q21	4042	3720	3379	3379
q22	473	453	374	374
Total cold run time: 47904 ms
Total hot run time: 45888 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183456 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit dbb7e5c41d34bcccc757a4484d71c4a5f7a0cf65, data reload: false

query5	4760	649	513	513
query6	338	220	207	207
query7	4209	465	268	268
query8	338	241	226	226
query9	8738	2679	2710	2679
query10	522	394	321	321
query11	16956	17440	17096	17096
query12	205	134	134	134
query13	1325	510	369	369
query14	6759	3492	3103	3103
query14_1	2978	2851	3017	2851
query15	217	196	192	192
query16	1040	474	475	474
query17	1945	800	602	602
query18	2606	532	387	387
query19	449	213	178	178
query20	141	128	126	126
query21	212	143	119	119
query22	5429	5258	4663	4663
query23	17152	16717	16564	16564
query23_1	16629	16638	16545	16545
query24	7191	1624	1232	1232
query24_1	1237	1237	1215	1215
query25	541	477	409	409
query26	1225	257	148	148
query27	2793	488	289	289
query28	4494	1886	1889	1886
query29	823	594	505	505
query30	311	249	214	214
query31	884	718	648	648
query32	88	74	77	74
query33	523	369	300	300
query34	917	917	568	568
query35	646	694	606	606
query36	1105	1142	980	980
query37	141	101	83	83
query38	3013	2977	2849	2849
query39	903	862	843	843
query39_1	814	841	820	820
query40	239	159	145	145
query41	87	65	67	65
query42	111	108	105	105
query43	389	383	361	361
query44	
query45	201	195	181	181
query46	885	1025	611	611
query47	2149	2130	2049	2049
query48	312	317	232	232
query49	617	467	391	391
query50	682	283	223	223
query51	4137	4078	4094	4078
query52	108	108	99	99
query53	290	332	284	284
query54	300	273	257	257
query55	89	85	87	85
query56	323	332	309	309
query57	1351	1325	1282	1282
query58	297	289	278	278
query59	2561	2701	2506	2506
query60	352	341	332	332
query61	153	146	150	146
query62	633	592	547	547
query63	320	275	279	275
query64	4900	1288	986	986
query65	
query66	1467	464	353	353
query67	16315	16416	16266	16266
query68	
query69	406	312	279	279
query70	1022	997	937	937
query71	346	328	299	299
query72	2801	2680	2402	2402
query73	532	543	322	322
query74	9949	9930	9741	9741
query75	2834	2752	2484	2484
query76	2307	1049	666	666
query77	359	385	305	305
query78	11031	11329	10675	10675
query79	1137	794	613	613
query80	684	628	559	559
query81	484	279	252	252
query82	1363	149	117	117
query83	368	264	245	245
query84	293	117	106	106
query85	847	474	431	431
query86	380	337	300	300
query87	3197	3112	3087	3087
query88	3521	2682	2689	2682
query89	419	371	342	342
query90	1925	182	173	173
query91	169	158	137	137
query92	78	77	75	75
query93	915	843	533	533
query94	455	326	302	302
query95	595	418	310	310
query96	652	527	230	230
query97	2467	2551	2406	2406
query98	233	231	221	221
query99	1018	983	930	930
Total cold run time: 254363 ms
Total hot run time: 183456 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (15/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.61% (19652/37352)
Line Coverage 36.22% (183455/506560)
Region Coverage 32.53% (142413/437805)
Branch Coverage 33.47% (61742/184448)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (15/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.66% (26209/36574)
Line Coverage 54.42% (274826/505007)
Region Coverage 51.66% (228308/441942)
Branch Coverage 53.04% (98135/185012)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (15/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.66% (26209/36574)
Line Coverage 54.42% (274826/505007)
Region Coverage 51.66% (228308/441942)
Branch Coverage 53.04% (98135/185012)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants