computing at lhc experiments in the first year of data
play

Computing at LHC experiments in the first year of data taking at 7 - PowerPoint PPT Presentation

Computing at LHC experiments in the first year of data taking at 7 TeV Daniele Bonacorsi [ deputy CMS Computing coordinator - University of Bologna, Italy ] on behalf of ALICE, ATLAS, CMS, LHCb Computing Growing up with Grids


  1. Computing at LHC experiments in the first year of data taking at 7 TeV Daniele Bonacorsi [ deputy CMS Computing coordinator - University of Bologna, Italy ] on behalf of ALICE, ATLAS, CMS, LHCb Computing

  2. Growing up with Grids LHC
Compu)ng
Grid 
(LCG)
approved
by
CERN
Council
in
2001 ✦ First
Grid
Deployment
Board
(GDB)
in
2002 Since
then,
LCG
was
built
on
services
developed
in
EU
and
US ✦ LCG
has
collaborated
with
a
number
of
Grid
Projects It
evolved
into
the
 Worldwide
LCG 
(WLCG)
 ✦ EGEE,
NorduGrid,
and
Open
Science
Grid
(OSG) ✦ CoordinaLon
and
service
support
for
the
operaLons
of
the
4
LHC
 experiments NORDUGRID Grid Solution for Wide Area Grid Solution for Wide Area Computing and Data Handling Computing and Data Handling Compu)ng
for
LHC
experiments 
grew
up
together
with
Grids ✦ Distributed
compuLng
achieved
by
previous
experiments LHC
experiments
started
on
this
environment,
in
which
most
resources
were ‐ located
away
from
CERN ✦ A
huge
collaboraLve
effort
throughout
the
years,
and
massive
cross‐ ferLlizaLons ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 2

  3. WLCG today for LHC experiments 11 
 Tier‐1 
centres,
 >140 
 Tier‐2 
centres
(plus
 Tier‐3 s) ✦ ~150k 
CPU
cores,
hit
1M
jobs/day ✦ >50
PB 
disk ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS]

  4. Site reliability in WLCG 2010
data
taking start
at
7
TeV 2006 2007 2008 2009 2010 2011 Jul’06 Feb’11 Basic
monitoring
of
WLCG
 services 2010
data
taking start
at
7
TeV ✦ at
Tier‐0/1/2
levels Sites
reliability
is
a
key
ingredient
 in
the
success
of
LHC
CompuLng
 2009 2010 ✦ Result
of
a
huge
collaboraLve
work ✦ Thanks
to
WLCG
and
site
admins! ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 4

  5. Readiness of WLCG Tiers Site
Availability
Monitoring ✦ CriLcal
tests,
per
Tier,
per
experiment Some
experiments
built
their
own
 readiness
criteria
on
top
of
basic
ones ✦ e.g.
CMS
defines
a
“site
readiness”
based
 on
a
boolean
‘AND’
of
many
tests Easy
to
be
OK
on
some ‐ Hard
to
be
OK
on
all,
and
in
a
stable
manner... ‐ CMS Tier-1 s 2010
data
taking CMS Tier-2 s start
at
7
TeV ~ plateau 7 40 Sep’08 Mar’11 Sep’08 Mar’11 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 5

  6. LHC Computing models LHC
CompuLng
models
are
based
on
the
MONARC
model ✦ Tiered
compuLng
faciliLes
to
meet
the
needs
of
the
LHC
experiments MONARC
was
developed
more
than
a
decade
ago ✦ It
served
the
community
remarkably
well,
evoluLons
in
progress ATLAS CMS example example T0 T0 ... T1 T1 T1 ... T1 T1 T1 T2 T2 T2 T2 T2 T2 ... T2 T2 ... T2 T2 full mesh “cloud” ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 6

  7. From commissioning to data taking DC04 (ALICE, CMS, LHCb) 2004 “ Data Challenges ”: DC2 (ATLAS) experiment-specific, independent tests (first full chain of computing models on grids) SC1 (network transfer tests) 2005 SC2 (network transfer tests) “ Service Challenges ”: SC3 (sustained transfer rates, since 2004, to demonstrate service aspects: DM, service reliability) 2006 - DM and sustained data transfers More experiment-specific - WM and scaling of job workloads challenges... - Support processes SC4 (nominal LHC rates, - Interoperability disk → tape tests, - Security incidents (“fire drills”) 2007 all T1, some T2s) More experiment-specific challenges... CCRC08 (phase I - II) Run the service(s): 2008 (readiness challenge, all exps, Focus on real and continuous production use ~full computing models) of the services over several years: - simulations (since 2003) 2009 STEP’09 - cosmics data taking, … (scale challenges, all exps + multi-VO overlap, + FULL computing models) pp+HI
data
taking “ Readiness/Scale Challenges ”: 2010 Data/Service Challenges to exercise aspects of the overall service at the same time pp+HI
data
taking - if possible with VO overlap 2011 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 7

  8. LHC data taking 2010 NOTE:
log
scale Remarkable
ramp‐up
in
lumi
in
2010 ✦ At
the
beginning,
a
“good”
weekend
could
 double
or
triple
the
dataset ✦ a
significant
failure
or
outage
for
a
fill
 would
be
a
big
fracLon
of
the
total
data PRELIMINARY Original
planning
for
CompuLng
in
the
first
6
months foresaw
higher
data
volumes
(tens
of
pb ‐1 ) ✦ Time
in
stable
beams
per
week
reached
40%
only
few
Lmes Load
on
compuLng
systems
lower
than
expected,
no
stress
on
resources ✦ Slower
ramp
has
allowed
predicted
acLviLes
to
be
performed
more
frequently This
will
definitely
not
happen
again
in
2011,
we
will
be
resource
constrained ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 8

  9. Networks OPN
links
now
fully
redundant ✦ Means
no
service
interrupLons See
the
fiber
cut
during
STEP’09 ‐ ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 9

  10. Networks in operations Excellent
monitoring
systems ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 10

  11. CERN → T1 data transfers CERN
outbound
traffic
showed
high
performance
and
reliability ✦ Very
well
serving
the
needs
of
LHC
experiments ✦ A
joined
and
long
commissioning
and
tesLng
effort
to
achieve
this All experiments 1 PB STEP’09 CCRC’08 ICHEP’10 challenge challenge Conference (phase I and II) ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 11

  12. An example: ATLAS data transfers GB/s per day MC 2009
Data Data
+
MC Data
taking
+
MC
prod 2010
pp PbPb 2010
data
taking
 start
at
7
TeV reproc reproc reproc reproc reproc
@T1s ATLAS Data
brokering User
subscrip)ons 6 (Analysis
data) T0
export Data
consolida)on MC
transfers
in
clouds (incl.
calib
streams) (MC
transfers
extra‐clouds) 4 Aver: ~2.3 2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2010 Transfers
on
all
routes
(among
all
Tier
levels) ✦ Average:
 ~2.3
GB/s 
 (daily
average) ✦ Peak:
 ~7
GB/s 
 (daily
average) 
 Traffic
on
OPN
measured
up
to
70
Gbps Data
available
on‐site
aker
few
hrs. ATLAS
massive
reprocessing
campaigns ✦ ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 12

  13. An example: CMS data transfers NOTE:
log
scale CMS PhEDEx CMS
improved
by
ad‐hoc
challenges
of
increasing
 complexity
and
by
compuLng
commissioning
acLviLes Massive
commissioning,
now
in
conLnuous
producLon‐mode
of
ops ✦ Can
sustain
up
to
>200
TB/day
of
producLon
transfers
on
the
overall
topology
 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 13

  14. More examples: ALICE and LHCb data transfers GB LHCb 80k LHCb 
data
is
successfully
 transferred
on
a
regular
basis T0 → T1 ✦ RAW
data
is
replicated
to
one
of
the
 T1
sites # done 325k transfers ALICE ALICE 
transfers
among
all
Tiers ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 14

  15. Reprocessing Once
landed
at
the
T1
level,
LHC
data
gets
reprocessed
as
needed ✦ New
calibraLons,
improved
sokware,
new
data
formats (reprocessing passes only) # jobs ATLAS # jobs CMS 6k 4
reproc
campaigns
in
2010 16k ~
a
dozen
of
reproc
 passes
in
2010 ✦ 

Feb’10:
2009
pp
data
+
cosmics ✦ 

Apr’10:
2009/2010
data ✦ 

May’10:
2009/2010
data+MC ✦ 

Nov’10:
full
2010
data+MC
(from
tapes) +
HI
reprocessing
foreseen
in
Mar’11 # jobs ALICE LHCb HI reco: opportunistic 6k usage of resources Pass-2 reco ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 15

  16. Reprocessing profile In
2010,
possible
to
reprocess
even
more
frequently
than
originally
planned ESD � dESD, AOD 1.00 ATLAS Fraction Complete (normalised for each T1) ATLAS 
reprocessed
100%
of
data 0.75 CA DE ✦ RAW→ESD ES FR IT ND ✦ ESD
merge NL UK US 0.50 ✦ ESD
→dESD,
AOD ✦ Grid
distribuLon
of
derived
data 0.25 Actually,
from
7
days
onwards 0 1 2 3 4 5 6 7 8 9 10 11 12 mostly
dealing
with
tails Campaign Day 1.5G
evts CMS About
a
dozen
of
 CMS 
 reprocessing
passes
in
 2010 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 16

Recommend


More recommend