pierre charrue be co preamble preamble the lhc controls
play

Pierre Charrue BE/CO Preamble Preamble The LHC Controls - PowerPoint PPT Presentation

Pierre Charrue BE/CO Preamble Preamble The LHC Controls Infrastructure External Dependencies l d Redundancies Control Room Power Loss Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 2


  1. Pierre Charrue – BE/CO

  2. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 2

  3. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 3

  4. � The Controls Infrastructure is designed to control the beams in the accelerators control the beams in the accelerators � It is not designed to protect the machine nor d d h hi to ensure personnel safety p y � See Machine Protection or Access Infrastructures 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 4

  5. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 5

  6. • The 3-tier architecture – Hardware Infrastructure – Software layers – Resource Tier Applications Layer Applications Layer – – VME crates, PC GW & PLC VME crates, PC GW & PLC dealing with high dealing with high performance performance acquisitions and acquisitions and Client real real- -time time processing processing – – Database where all the setting Database where all the setting tier and configuration of all LHC and configuration of all LHC device exist device exist B Business Layer i L – Server Tier – Application servers – Data Servers Server – File Servers – Central Timing tier tier – Client Tier Client Tier – Interactive Consoles – Fixed Displays – GUI applications – Communication to the equipment goes CMW CMW CMW CMW through Controls Middleware CMW h h C l Middl CMW Hardware DB Resource tier i CTRL CTRL 6 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review

  7. � Since January 2006, the accelerator operation is done from the CERN Control Centre (CCC) is done from the CERN Control Centre (CCC) on the Prévessin site � The CCC hosts around 100 consoles and h h d l d around 300 screens � The CCR is the rack room next to the CCC. It hosts more than 400 servers hosts more than 400 servers 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 7

  8. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d i � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 8

  9. HARDWARE SOFTWARE � Electricity � Oracle � Cooling and Ventilation � IT Authentication � Network � Network � Technical Network/General � Technical Network/General � Oracle servers in IT Purpose Network 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 9

  10. � All Linux servers are HP Proliants with dual power supplies power supplies � They are cabled to two separate 230V UPS sources � High power consumption will consume UPS g p p batteries rapidly � 1 hour maximum autonomy 1 hour maximum autonomy � Each Proliant consumes an average of 250W 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 10

  11. F b Feb 2009: upgrade of air flow and cooling circuits in CCR d f i fl d li i i i CCR � � CCR vulnerability to Cooling problems has been resolved In the event of loss of refrigeration , the CCR will overheat very quickly In the event of loss of refrigeration the CCR will overheat very quickly � � � Monitoring with temperature sensors and alarms in place to ensure rapid intervention by TI operators � The CCR cooling state is monitored by theTechnical Infrastructure Monitoring The CCR cooling state is monitored by the Technical Infrastructure Monitoring (TIM) with views which can show trends over the last 2 weeks : 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 11

  12. • Very reliable network V li bl t k topology • Redundant network routes • Redundant Power Supply in routers and switches switches 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 12

  13. • LHC Controls infrastructure is highly DATA centric • LHC Controls infrastructure is highly DATA centric Additional server for testing: i – Standby database for LSA All accelerator parameters & settings are stored in a DB located in B513 HWC Measurements Logging Controls Configuration Measurements LSA Settings E-Logbook CESAR 2 x quad-core CTRL CTRL 2.8GHz CPU CTRL CTRL 8GB RAM 11 4TB 11.4TB usable Clustered NAS shelf 14x146GB FC disks Clustered NAS shelf • Service Availability 14x300GB SATA disks – New infrastructure has high-redundancy for high-availability – Deploy each service on a dedicated Oracle Real Application Cluster p y pp – The use of a standby database will be investigated • objective of reaching 100% uptime for LSA – The Logging infrastructure can sustain a 24h un-attainability of the DB – Keep data in local buffers Keep data in local buffers – A ‘golden’ level support with intervention in 24h – Secure database account granting specific privileges to dedicated db accounts 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 13

  14. � Needed online for Role Based Access Control (RBAC) and variousWeb Pages used by (RBAC) and various Web Pages used by operators � Not used for operational logins on Linux d f l l � Windows caches recently used passwords y p 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 14

  15. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 15

  16. � Remote Reboot and Terminal Server functionality � Remote Reboot and Terminal Server functionality built in � Excellent power supply and fan redundancy , partial p pp y y , p CPU redundancy, ECC memory � Excellent disk redundancy � Automatic warnings in case of a disk failure i i i f di k f il � Several backup methods : � ADSM towards IT backup � ADSM towards IT backup � Daily or weekly rsync towards a storage place in Meyrin ▪ Data will be recovered in case of catastrophic failure in CCR Data will be recovered in case of catastrophic failure in CCR � We are able to restore a BackEnd with destroyed disks in a few hours � 2 nd PostMortem server installed on the Meyrin site nd P tM t i t ll d th M i it 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 16

  17. � VMEs � VMEs � Can survive limited fan failure � Some VME systems with redundant power supplies � Otherwise no additional redundancy � Remote reboot and terminal server vital � PLCs � Generally very reliable Generally very reliable � Rarely have remote reboot because of previous point ▪ some LHC Alcove PLCs have a remote reboot ▪ some LHC Alcove PLCs have a remote reboot 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 17

  18. � LHC central timing � Master, Slave, Gateway using y g reflective memory, and hot standby switch h � Timing is distributed over dedicated network d b d d d d k to timing receivers CTRx in front ends g 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 18

  19. Isolation of Technical Network from external access � CNIC initiative to separate � CNIC initiative to separate the General Purpose Network from the Technical Network Network � NO dependences of resources from the GPN for operating the machines operating the machines � Very few hosts from the GPN allowed to access the TN TN � Regular Technical Network security scans 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 19

  20. � High level tools to diagnose and monitor the controls infrastructure ( DIAMON and LASER ) � Easy to use first line diagnostics and tool to solve problems or help � Easy to use first line diagnostics and tool to solve problems or help to decide about responsibilities for first line intervention � Protecting the device access : RBAC initiative Protecting the device access : RBAC initiative � Device access are authorized upon RULES applied to ROLES given to specific USERS Group View GroupView � Protecting the Machine Critical Settings (e.g. BLM threshold) ▪ Can only be changed by authorized person ▪ Uses RBAC for Authentication & Authorization ▪ Signs the data with a unique signature to ensure critical parameters have not Navigation Tree been tampered since last update Monitoring Tests Details Repair tools Repair tools 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 20

  21. � Preamble � Preamble � The LHC Controls Infrastructure � External Dependencies l d � Redundancies � Control Room Power Loss � Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 21

  22. � Power Loss in any LHC site : � No access to equipment from this site � No access to equipment from this site ▪ Machine protection or OP will take action � Power Loss in the CCC/CCR � Power Loss in the CCC/CCR � CCC can sustain 1 hour on UPS � CCR Cooling will be a problem CCR C li ill b bl � Some CCR servers will still be up if the 2 nd power so rce i source is not affected t ff t d � 10 minutes on UPS for EOD1 � 1 hour on UPS for EOD2 and EOD9 f 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 22

  23. 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 23

Recommend


More recommend