@@ -213,7 +213,6 @@ title: Summary of Performance Issues
213213
214214* ** Queue Management** : Job scheduling inefficiencies under peak loads
215215
216-
217216---
218217layout: top-title-two-cols
219218color: gray-light
@@ -224,54 +223,7 @@ columns: is-7
224223
225224:: title ::
226225
227- # Memory Peak for Sandbox Store Service
228-
229- :: left ::
230-
231- Memory usage of all DIRAC services are monitored
232-
233- ### Memory Overflow
234-
235- Still investigating
236-
237- but the DIRAC system is too complex, and the runit process management is outdated.
238-
239- ### Quick Fix
240-
241- - A daemon process restarts services when services overload.
242- - Weekly restart of critical services
243-
244-
245- :: right ::
246-
247- <div style =" width : 100% ; height : 20vh ; overflow : hidden ;" >
248- <iframe
249- src="https://dci-grafana.ihep.ac.cn/d-solo/9s0Ru8wVk/service-monitoring?var-bin=30m&orgId=1&from=1767977729000&to=1768030252000&timezone=browser&var-ServiceName=WorkloadManagement_SandboxStore&var-HostName=prod-dirac.ihep.ac.cn&theme=light&panelId=3&__feature.dashboardSceneSolo"
250- style="
251- width: 200%;
252- height: 40vh;
253- transform: scale(0.5);
254- transform-origin: 0 0;
255- border: 0;
256- "
257- > </iframe >
258- </div>
259-
260- Service was restarted manually during the campaign to clear memory usage.
261-
262- And Sandbox recovered after the restart.
263-
264- ---
265- layout: top-title-two-cols
266- color: gray-light
267- align: c-l-l
268- title: Performance Screenshots and Analysis
269- columns: is-7
270- ---
271-
272- :: title ::
273-
274- # High disk I/O for prod-dirac.ihep.ac.cn
226+ # High disk I/O for DIRAC main server
275227
276228:: left ::
277229
@@ -282,7 +234,7 @@ Machine metrics for all DCI servers are monitored.
282234- Disk I/O usage reaches 90% for prod-dirac.ihep.ac.cn
283235<!-- - Disk I/O bottlenecks during mass data transfers -->
284236- Read operations queued during high I/O periods
285- - Sandbox not responding
237+ <!-- - Sandbox not responding -->
286238
287239### Quick Fix
288240
@@ -319,6 +271,54 @@ Machine metrics for all DCI servers are monitored.
319271> </iframe >
320272 </div>
321273
274+ ---
275+ layout: top-title-two-cols
276+ color: gray-light
277+ align: c-l-l
278+ title: Performance Screenshots and Analysis
279+ columns: is-7
280+ ---
281+
282+ :: title ::
283+
284+ # Memory Peak for Sandbox Store Service
285+
286+ :: left ::
287+
288+ Memory usage of all DIRAC services are monitored
289+
290+ ### Memory Overflow
291+
292+ Still investigating
293+
294+ but the DIRAC system is too complex, and the runit process management is outdated.
295+
296+ ### Quick Fix
297+
298+ - A daemon process restarts services when services overload.
299+ - Weekly restart of critical services
300+
301+
302+ :: right ::
303+
304+ <div style =" width : 100% ; height : 20vh ; overflow : hidden ;" >
305+ <iframe
306+ src="https://dci-grafana.ihep.ac.cn/d-solo/9s0Ru8wVk/service-monitoring?var-bin=30m&orgId=1&from=1767977729000&to=1768030252000&timezone=browser&var-ServiceName=WorkloadManagement_SandboxStore&var-HostName=prod-dirac.ihep.ac.cn&theme=light&panelId=3&__feature.dashboardSceneSolo"
307+ style="
308+ width: 200%;
309+ height: 40vh;
310+ transform: scale(0.5);
311+ transform-origin: 0 0;
312+ border: 0;
313+ "
314+ > </iframe >
315+ </div>
316+
317+ Service was restarted manually during the campaign to clear memory usage.
318+
319+ And Sandbox recovered after the restart.
320+
321+
322322---
323323layout: top-title
324324color: gray-light
@@ -371,9 +371,9 @@ columns: is-7
371371
372372:: content ::
373373
374- - Logs for all DIRAC services are monitored
374+ <!-- - Logs for all DIRAC services are monitored -->
375375
376- High load for Configuration Service / Sandbox Store
376+ - High load for Configuration Service / Sandbox Store
377377
378378![ Config service] ( /public/images/cs_log.jpg )
379379
@@ -383,6 +383,8 @@ columns: is-7
383383
384384- Added a load-balanced Configuration Service
385385
386+ - Adjusted the scheduling policy to reduce the number of jobs that end at the same time
387+
386388
387389<!-- --- -->
388390<!-- layout: top-title-two-cols -->
@@ -448,7 +450,7 @@ columns: is-6
448450
449451### Long-term Solutions
450452
451- ** DiracX, "the neXt Dirac incarnation" **
453+ ** Service Scaling **
452454- Horizontal scaling of Dirac services
453455- Database sharding and replication strategies
454456
@@ -457,6 +459,7 @@ columns: is-6
457459- Network optimization between key sites
458460
459461** Architecture Evolution**
462+ - DiracX, "the neXt Dirac incarnation"**
460463- Migration to DiracX services
461464- Containerization of critical components
462465
@@ -495,25 +498,22 @@ timeline
495498 2006-2007 : DIRAC3<br> Full rewriting, development of the DISET protocol -- still in use today!
496499 : the current DIRAC framework is still based on this work
497500 section Open sourced, wider adoption
498- 2008 : Large-ish reshuffling to become multi-VO
501+ 2008 - 2010s : Large-ish reshuffling to become multi-VO
499502 : LHCbDIRAC extension separated from core DIRAC code
500503 : CLIC community adopts DIRAC
501504 : France-Grilles is the first multi-VO DIRAC installation
502505 : 2010s - Belle2, BES3, CTA adopt DIRAC
503- 2021-2022 : Python3 full support
504- : DIRAC v8.0
505- 2023-2025: First DiracX demo (during)
506- : LHCb deploys in production alpha versions of DIRAC (v9.0.0aX) and DiracX (v0.0.1aX)
507- : All existing DiracX components are tested and extended.
508- : Several scalability and performance issues were addressed
506+ 2020s : Python3 full support
507+ : DIRAC v8.0
508+ : First DiracX demo (during)
509509 section DIRAC and DiracX coexisting
510- Q4 2025 : Release of DIRAC v9.0 and DiracX 0.0.1
511- : JobStateUpdate service migrated and tested
512- : Possible adoption by non-LHCb DIRAC users
513- section DIRAC and DiracX coexisting (plan)
514- Q1 2026 : Release DiracX 0.1.0
510+ 2025 - 2026
511+ : LHCb deploys DIRAC and DiracX
512+ : Release of DIRAC v9.0 and DiracX 0.0.1
513+ : JobStateUpdate service migrated and tested
514+ : Release DiracX 0.1.0
515515 : DiracX introduces a task management system
516- Q3 2026 : Release DIRAC v9.1 and 0.1.X (or 0.2.0)
516+ Plan : Release DIRAC v9.1 and 0.1.X (or 0.2.0)
517517 : First DiracX-only service.
518518 : Adoption of new Pilot security mechanism
519519 : First DiracX replacements for DIRAC agents or executors using DiracX tasks
@@ -522,63 +522,122 @@ timeline
522522
523523
524524---
525- layout: section
526- color: blue-light
525+ layout: top-title
526+ color: gray-light
527+ align: c
528+ title: System Evolution Plan
527529---
528530
529- # Future Upgrade Plans
531+ :: title ::
532+ # DIRAC Upgrade & Migration Steps
533+
534+ :: content ::
535+
536+ # Key Technical Features
537+
538+ <ul class =" text-base leading-7 " >
539+ <li >Containerized deployment to ensure standardized and portable runtime environments</li >
540+ <li ><b >Kubernetes orchestration</b > to support automated operations and elastic scaling</li >
541+ <li ><b >MySQL</b > + <b >OpenSearch</b > to improve log/metadata search efficiency</li >
542+ <li ><b >S3 object storage</b > to optimize job sandbox management and data access performance</li >
543+ <li >A modern authentication system based on <b >OAuth2/OpenID</b > to enhance security</li >
544+ </ul >
545+
546+
547+ <div class =" flex justify-between text-xs text-gray-500 mt-2 px-2 " >
548+ <span >Start</span >
549+ <span >Complete</span >
550+ </div >
551+
552+ <div class =" flex items-center gap-4 text-center " >
553+ <div class =" rounded-xl border px-4 py-3 w-[220px] bg-blue-100 " >
554+ <div class="font-semibold text-sm">Database Upgrade</div>
555+ <div class="text-xs text-gray-700 mt-1">Add VO fields, rename tables, etc.</div>
556+ </div >
557+
558+ <div class =" text-xl font-bold " >→</div >
559+
560+ <div class =" rounded-xl border px-4 py-3 w-[220px] bg-green-100 " >
561+ <div class="font-semibold text-sm">Containerized Deployment</div>
562+ <div class="text-xs text-gray-700 mt-1">Host → Container</div>
563+ </div >
564+
565+ <div class =" text-xl font-bold " >→</div >
566+
567+ <div class =" rounded-xl border px-4 py-3 w-[220px] bg-orange-100 " >
568+ <div class="font-semibold text-sm">Dual-Stack Parallel Operation</div>
569+ <div class="text-xs text-gray-700 mt-1">DIRAC 9 + DiracX</div>
570+ </div >
571+
572+ <div class =" text-xl font-bold " >→</div >
573+
574+ <div class =" rounded-xl border px-4 py-3 w-[220px] bg-red-100 " >
575+ <div class="font-semibold text-sm">Extension Module Adaptation</div>
576+ <div class="text-xs text-gray-700 mt-1">IHEPDIRAC migration</div>
577+ </div >
578+
579+ <div class =" text-xl font-bold " >→</div >
580+
581+ <div class =" rounded-xl border px-4 py-3 w-[220px] bg-teal-100 " >
582+ <div class="font-semibold text-sm">DiracX Only</div>
583+ <div class="text-xs text-gray-700 mt-1">Migration of key subsystems</div>
584+ </div >
585+ </div >
586+
587+
530588
531589---
532590layout: top-title
533591color: gray-light
534592align: c
535- title: System Upgrade Roadmap
593+ title: Migration
536594---
537595
538596:: title ::
539597
540- # System Upgrade Roadmap
598+ ### Migration to DiracX
599+
541600
542601:: content ::
543602
544- ``` mermaid
545- %%{init: {'theme': 'base', 'timeline': {'disableMulticolor': true}}}%%
546- timeline
547- section Phase 1 - Foundation (2026)
548- Q1-Q2 : Monitoring system enhancements<br>Critical service optimizations
549- Q3-Q4 : Storage performance improvements<br>DiracX readiness assessment
550- section Phase 2 - Scaling (2027)
551- Q1-Q2 : Network infrastructure upgrades<br>Compute resource expansion
552- Q2-Q3 : Containerization pilot projects<br>Storage system procurement
553- Q3-Q4 : Advanced orchestration deployment<br>Network upgrade completion
554- section Phase 3 - Transformation (2028)
555- Q1-Q2 : DiracX migration completion<br>Production migration approval
556- Q2-Q3 : Full system validation<br>Advanced monitoring implementation
557- Q3-Q4 : Performance optimization<br>Scalability validation
558- ```
603+ Services of DIRAC v9 and DiracX will need to live together for some time
604+
605+ <Arrow x1 =" 300 " y1 =" 170 " x2 =" 370 " y2 =" 170 " />
606+ <!-- <Line :x1=345 :y1=200 :x2=345 :y2=500 :width=1 /> -->
607+
608+ <Arrow x1 =" 610 " y1 =" 170 " x2 =" 680 " y2 =" 170 " />
609+ <!-- <Line :x1=633 :y1=200 :x2=633 :y2=500 :width=1 /> -->
610+
611+ <div style =" display : flex ; align-items : center ; justify-content : center ;" >
612+ <img id="D_X" src="/public/images/legacy_before_Adaptor.png" class="mx-auto w-1/4"> </img>
613+ <img id="D_Ad" src="/public/images/legacyAdaptor.png" class="mx-auto w-1/4"> </img>
614+ <img id="X" src="/public/images/legacy_after_Adaptor.png" class="mx-auto w-1/4"> </img>
615+ </div >
616+
617+ <SpeechBubble position =" r " color =' cyan ' shape =" round " v-drag =" [100,350,40,60] " >
618+ 1
619+ </SpeechBubble >
620+
621+ <SpeechBubble position =" r " color =' cyan ' shape =" round " v-drag =" [370,350,40,60] " >
622+ 2
623+ </SpeechBubble >
624+
625+ <SpeechBubble position =" r " color =' cyan ' shape =" round " v-drag =" [660,350,40,60] " >
626+ 3
627+ </SpeechBubble >
628+
629+ <SpeechBubble position =" t " color =' amber ' shape =" round " v-drag =" [160,350,120,180] " >
630+ DIRAC and DiracX share the databases
631+ </SpeechBubble >
632+
633+ <SpeechBubble position =" t " color =' amber ' shape =" round " v-drag =" [430,350,160,180] " >
634+ A legacy adaptor moves traffic from DIRAC to DiracX services
635+ </SpeechBubble >
636+
637+ <SpeechBubble position =" t " color =' amber ' shape =" round " v-drag =" [720,350,120,140] " >
638+ DIRAC services can be removed
639+ </SpeechBubble >
559640
560- <!-- ## JUNO DCI System Evolution Plan -->
561- <!-- -->
562- <!-- ### Hardware Infrastructure -->
563- <!-- - **Compute Resources**: Expand GPU capabilities for ML workloads -->
564- <!-- - **Storage Systems**: Upgrade to high-performance parallel file systems -->
565- <!-- - **Network Fabric**: Implement dedicated science DMZ connections -->
566- <!-- -->
567- <!-- ### Software Stack -->
568- <!-- - **Dirac Evolution**: Complete migration from DIRAC v8 to DiracX -->
569- <!-- - **Containerization**: Full Docker/Kubernetes deployment for all services -->
570- <!-- - **Orchestration**: Advanced workflow management and automation -->
571- <!-- -->
572- <!-- ### Architecture Optimization -->
573- <!-- - **Microservices**: Decompose monolithic services into scalable components -->
574- <!-- - **Data Management**: Implement tiered storage with intelligent caching -->
575- <!-- - **Monitoring**: AI-driven predictive analytics and anomaly detection -->
576- <!-- -->
577- <!-- ### Key Objectives -->
578- <!-- 1. **Scalability**: Support 5x current workload capacity -->
579- <!-- 2. **Reliability**: Achieve 99.9% system availability -->
580- <!-- 3. **Performance**: Reduce average job completion time by 50% -->
581- <!-- 4. **Maintainability**: Simplify operations and reduce manual intervention -->
582641
583642---
584643layout: top-title-two-cols
0 commit comments