Skip to content

Commit 6db30c9

Browse files
committed
Update by hanx@lhws312.ihep.ac.cn at 2026-01-20 19:52:59
1 parent 56b2413 commit 6db30c9

File tree

4 files changed

+167
-108
lines changed

4 files changed

+167
-108
lines changed

public/images/legacyAdaptor.png

230 KB
Loading
196 KB
Loading
194 KB
Loading

slides.md

Lines changed: 167 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,6 @@ title: Summary of Performance Issues
213213

214214
* **Queue Management**: Job scheduling inefficiencies under peak loads
215215

216-
217216
---
218217
layout: top-title-two-cols
219218
color: gray-light
@@ -224,54 +223,7 @@ columns: is-7
224223

225224
:: title ::
226225

227-
# Memory Peak for Sandbox Store Service
228-
229-
:: left ::
230-
231-
Memory usage of all DIRAC services are monitored
232-
233-
### Memory Overflow
234-
235-
Still investigating
236-
237-
but the DIRAC system is too complex, and the runit process management is outdated.
238-
239-
### Quick Fix
240-
241-
- A daemon process restarts services when services overload.
242-
- Weekly restart of critical services
243-
244-
245-
:: right ::
246-
247-
<div style="width: 100%; height: 20vh; overflow: hidden;">
248-
<iframe
249-
src="https://dci-grafana.ihep.ac.cn/d-solo/9s0Ru8wVk/service-monitoring?var-bin=30m&orgId=1&from=1767977729000&to=1768030252000&timezone=browser&var-ServiceName=WorkloadManagement_SandboxStore&var-HostName=prod-dirac.ihep.ac.cn&theme=light&panelId=3&__feature.dashboardSceneSolo"
250-
style="
251-
width: 200%;
252-
height: 40vh;
253-
transform: scale(0.5);
254-
transform-origin: 0 0;
255-
border: 0;
256-
"
257-
></iframe>
258-
</div>
259-
260-
Service was restarted manually during the campaign to clear memory usage.
261-
262-
And Sandbox recovered after the restart.
263-
264-
---
265-
layout: top-title-two-cols
266-
color: gray-light
267-
align: c-l-l
268-
title: Performance Screenshots and Analysis
269-
columns: is-7
270-
---
271-
272-
:: title ::
273-
274-
# High disk I/O for prod-dirac.ihep.ac.cn
226+
# High disk I/O for DIRAC main server
275227

276228
:: left ::
277229

@@ -282,7 +234,7 @@ Machine metrics for all DCI servers are monitored.
282234
- Disk I/O usage reaches 90% for prod-dirac.ihep.ac.cn
283235
<!-- - Disk I/O bottlenecks during mass data transfers -->
284236
- Read operations queued during high I/O periods
285-
- Sandbox not responding
237+
<!-- - Sandbox not responding -->
286238

287239
### Quick Fix
288240

@@ -319,6 +271,54 @@ Machine metrics for all DCI servers are monitored.
319271
></iframe>
320272
</div>
321273
274+
---
275+
layout: top-title-two-cols
276+
color: gray-light
277+
align: c-l-l
278+
title: Performance Screenshots and Analysis
279+
columns: is-7
280+
---
281+
282+
:: title ::
283+
284+
# Memory Peak for Sandbox Store Service
285+
286+
:: left ::
287+
288+
Memory usage of all DIRAC services are monitored
289+
290+
### Memory Overflow
291+
292+
Still investigating
293+
294+
but the DIRAC system is too complex, and the runit process management is outdated.
295+
296+
### Quick Fix
297+
298+
- A daemon process restarts services when services overload.
299+
- Weekly restart of critical services
300+
301+
302+
:: right ::
303+
304+
<div style="width: 100%; height: 20vh; overflow: hidden;">
305+
<iframe
306+
src="https://dci-grafana.ihep.ac.cn/d-solo/9s0Ru8wVk/service-monitoring?var-bin=30m&orgId=1&from=1767977729000&to=1768030252000&timezone=browser&var-ServiceName=WorkloadManagement_SandboxStore&var-HostName=prod-dirac.ihep.ac.cn&theme=light&panelId=3&__feature.dashboardSceneSolo"
307+
style="
308+
width: 200%;
309+
height: 40vh;
310+
transform: scale(0.5);
311+
transform-origin: 0 0;
312+
border: 0;
313+
"
314+
></iframe>
315+
</div>
316+
317+
Service was restarted manually during the campaign to clear memory usage.
318+
319+
And Sandbox recovered after the restart.
320+
321+
322322
---
323323
layout: top-title
324324
color: gray-light
@@ -371,9 +371,9 @@ columns: is-7
371371

372372
:: content ::
373373

374-
- Logs for all DIRAC services are monitored
374+
<!-- - Logs for all DIRAC services are monitored -->
375375

376-
High load for Configuration Service / Sandbox Store
376+
- High load for Configuration Service / Sandbox Store
377377

378378
![Config service](/public/images/cs_log.jpg)
379379

@@ -383,6 +383,8 @@ columns: is-7
383383

384384
- Added a load-balanced Configuration Service
385385

386+
- Adjusted the scheduling policy to reduce the number of jobs that end at the same time
387+
386388

387389
<!-- --- -->
388390
<!-- layout: top-title-two-cols -->
@@ -448,7 +450,7 @@ columns: is-6
448450

449451
### Long-term Solutions
450452

451-
**DiracX, "the neXt Dirac incarnation"**
453+
**Service Scaling**
452454
- Horizontal scaling of Dirac services
453455
- Database sharding and replication strategies
454456

@@ -457,6 +459,7 @@ columns: is-6
457459
- Network optimization between key sites
458460

459461
**Architecture Evolution**
462+
- DiracX, "the neXt Dirac incarnation"**
460463
- Migration to DiracX services
461464
- Containerization of critical components
462465

@@ -495,25 +498,22 @@ timeline
495498
2006-2007 : DIRAC3<br> Full rewriting, development of the DISET protocol -- still in use today!
496499
: the current DIRAC framework is still based on this work
497500
section Open sourced, wider adoption
498-
2008 : Large-ish reshuffling to become multi-VO
501+
2008 - 2010s : Large-ish reshuffling to become multi-VO
499502
: LHCbDIRAC extension separated from core DIRAC code
500503
: CLIC community adopts DIRAC
501504
: France-Grilles is the first multi-VO DIRAC installation
502505
: 2010s - Belle2, BES3, CTA adopt DIRAC
503-
2021-2022 : Python3 full support
504-
: DIRAC v8.0
505-
2023-2025: First DiracX demo (during)
506-
: LHCb deploys in production alpha versions of DIRAC (v9.0.0aX) and DiracX (v0.0.1aX)
507-
: All existing DiracX components are tested and extended.
508-
: Several scalability and performance issues were addressed
506+
2020s : Python3 full support
507+
: DIRAC v8.0
508+
: First DiracX demo (during)
509509
section DIRAC and DiracX coexisting
510-
Q4 2025 : Release of DIRAC v9.0 and DiracX 0.0.1
511-
: JobStateUpdate service migrated and tested
512-
: Possible adoption by non-LHCb DIRAC users
513-
section DIRAC and DiracX coexisting (plan)
514-
Q1 2026 : Release DiracX 0.1.0
510+
2025 - 2026
511+
: LHCb deploys DIRAC and DiracX
512+
: Release of DIRAC v9.0 and DiracX 0.0.1
513+
: JobStateUpdate service migrated and tested
514+
: Release DiracX 0.1.0
515515
: DiracX introduces a task management system
516-
Q3 2026 : Release DIRAC v9.1 and 0.1.X (or 0.2.0)
516+
Plan : Release DIRAC v9.1 and 0.1.X (or 0.2.0)
517517
: First DiracX-only service.
518518
: Adoption of new Pilot security mechanism
519519
: First DiracX replacements for DIRAC agents or executors using DiracX tasks
@@ -522,63 +522,122 @@ timeline
522522

523523

524524
---
525-
layout: section
526-
color: blue-light
525+
layout: top-title
526+
color: gray-light
527+
align: c
528+
title: System Evolution Plan
527529
---
528530

529-
# Future Upgrade Plans
531+
:: title ::
532+
# DIRAC Upgrade & Migration Steps
533+
534+
:: content ::
535+
536+
# Key Technical Features
537+
538+
<ul class="text-base leading-7">
539+
<li>Containerized deployment to ensure standardized and portable runtime environments</li>
540+
<li><b>Kubernetes orchestration</b> to support automated operations and elastic scaling</li>
541+
<li><b>MySQL</b> + <b>OpenSearch</b> to improve log/metadata search efficiency</li>
542+
<li><b>S3 object storage</b> to optimize job sandbox management and data access performance</li>
543+
<li>A modern authentication system based on <b>OAuth2/OpenID</b> to enhance security</li>
544+
</ul>
545+
546+
547+
<div class="flex justify-between text-xs text-gray-500 mt-2 px-2">
548+
<span>Start</span>
549+
<span>Complete</span>
550+
</div>
551+
552+
<div class="flex items-center gap-4 text-center">
553+
<div class="rounded-xl border px-4 py-3 w-[220px] bg-blue-100">
554+
<div class="font-semibold text-sm">Database Upgrade</div>
555+
<div class="text-xs text-gray-700 mt-1">Add VO fields, rename tables, etc.</div>
556+
</div>
557+
558+
<div class="text-xl font-bold">→</div>
559+
560+
<div class="rounded-xl border px-4 py-3 w-[220px] bg-green-100">
561+
<div class="font-semibold text-sm">Containerized Deployment</div>
562+
<div class="text-xs text-gray-700 mt-1">Host → Container</div>
563+
</div>
564+
565+
<div class="text-xl font-bold">→</div>
566+
567+
<div class="rounded-xl border px-4 py-3 w-[220px] bg-orange-100">
568+
<div class="font-semibold text-sm">Dual-Stack Parallel Operation</div>
569+
<div class="text-xs text-gray-700 mt-1">DIRAC 9 + DiracX</div>
570+
</div>
571+
572+
<div class="text-xl font-bold">→</div>
573+
574+
<div class="rounded-xl border px-4 py-3 w-[220px] bg-red-100">
575+
<div class="font-semibold text-sm">Extension Module Adaptation</div>
576+
<div class="text-xs text-gray-700 mt-1">IHEPDIRAC migration</div>
577+
</div>
578+
579+
<div class="text-xl font-bold">→</div>
580+
581+
<div class="rounded-xl border px-4 py-3 w-[220px] bg-teal-100">
582+
<div class="font-semibold text-sm">DiracX Only</div>
583+
<div class="text-xs text-gray-700 mt-1">Migration of key subsystems</div>
584+
</div>
585+
</div>
586+
587+
530588

531589
---
532590
layout: top-title
533591
color: gray-light
534592
align: c
535-
title: System Upgrade Roadmap
593+
title: Migration
536594
---
537595

538596
:: title ::
539597

540-
# System Upgrade Roadmap
598+
### Migration to DiracX
599+
541600

542601
:: content ::
543602

544-
```mermaid
545-
%%{init: {'theme': 'base', 'timeline': {'disableMulticolor': true}}}%%
546-
timeline
547-
section Phase 1 - Foundation (2026)
548-
Q1-Q2 : Monitoring system enhancements<br>Critical service optimizations
549-
Q3-Q4 : Storage performance improvements<br>DiracX readiness assessment
550-
section Phase 2 - Scaling (2027)
551-
Q1-Q2 : Network infrastructure upgrades<br>Compute resource expansion
552-
Q2-Q3 : Containerization pilot projects<br>Storage system procurement
553-
Q3-Q4 : Advanced orchestration deployment<br>Network upgrade completion
554-
section Phase 3 - Transformation (2028)
555-
Q1-Q2 : DiracX migration completion<br>Production migration approval
556-
Q2-Q3 : Full system validation<br>Advanced monitoring implementation
557-
Q3-Q4 : Performance optimization<br>Scalability validation
558-
```
603+
Services of DIRAC v9 and DiracX will need to live together for some time
604+
605+
<Arrow x1="300" y1="170" x2="370" y2="170" />
606+
<!-- <Line :x1=345 :y1=200 :x2=345 :y2=500 :width=1 /> -->
607+
608+
<Arrow x1="610" y1="170" x2="680" y2="170" />
609+
<!-- <Line :x1=633 :y1=200 :x2=633 :y2=500 :width=1 /> -->
610+
611+
<div style="display: flex; align-items: center; justify-content: center;">
612+
<img id="D_X" src="/public/images/legacy_before_Adaptor.png" class="mx-auto w-1/4"> </img>
613+
<img id="D_Ad" src="/public/images/legacyAdaptor.png" class="mx-auto w-1/4"> </img>
614+
<img id="X" src="/public/images/legacy_after_Adaptor.png" class="mx-auto w-1/4"> </img>
615+
</div>
616+
617+
<SpeechBubble position="r" color='cyan' shape="round" v-drag="[100,350,40,60]">
618+
1
619+
</SpeechBubble>
620+
621+
<SpeechBubble position="r" color='cyan' shape="round" v-drag="[370,350,40,60]">
622+
2
623+
</SpeechBubble>
624+
625+
<SpeechBubble position="r" color='cyan' shape="round" v-drag="[660,350,40,60]">
626+
3
627+
</SpeechBubble>
628+
629+
<SpeechBubble position="t" color='amber' shape="round" v-drag="[160,350,120,180]">
630+
DIRAC and DiracX share the databases
631+
</SpeechBubble>
632+
633+
<SpeechBubble position="t" color='amber' shape="round" v-drag="[430,350,160,180]">
634+
A legacy adaptor moves traffic from DIRAC to DiracX services
635+
</SpeechBubble>
636+
637+
<SpeechBubble position="t" color='amber' shape="round" v-drag="[720,350,120,140]">
638+
DIRAC services can be removed
639+
</SpeechBubble>
559640

560-
<!-- ## JUNO DCI System Evolution Plan -->
561-
<!-- -->
562-
<!-- ### Hardware Infrastructure -->
563-
<!-- - **Compute Resources**: Expand GPU capabilities for ML workloads -->
564-
<!-- - **Storage Systems**: Upgrade to high-performance parallel file systems -->
565-
<!-- - **Network Fabric**: Implement dedicated science DMZ connections -->
566-
<!-- -->
567-
<!-- ### Software Stack -->
568-
<!-- - **Dirac Evolution**: Complete migration from DIRAC v8 to DiracX -->
569-
<!-- - **Containerization**: Full Docker/Kubernetes deployment for all services -->
570-
<!-- - **Orchestration**: Advanced workflow management and automation -->
571-
<!-- -->
572-
<!-- ### Architecture Optimization -->
573-
<!-- - **Microservices**: Decompose monolithic services into scalable components -->
574-
<!-- - **Data Management**: Implement tiered storage with intelligent caching -->
575-
<!-- - **Monitoring**: AI-driven predictive analytics and anomaly detection -->
576-
<!-- -->
577-
<!-- ### Key Objectives -->
578-
<!-- 1. **Scalability**: Support 5x current workload capacity -->
579-
<!-- 2. **Reliability**: Achieve 99.9% system availability -->
580-
<!-- 3. **Performance**: Reduce average job completion time by 50% -->
581-
<!-- 4. **Maintainability**: Simplify operations and reduce manual intervention -->
582641

583642
---
584643
layout: top-title-two-cols

0 commit comments

Comments
 (0)