Skip to content

Releases: cmsdaq/hltd

hltd-1.9.5-1

12 May 12:07
Compare
Choose a tag to compare

Deployed on 12 May 2016

List of changes wrt. 1.9.4:

  • FU output bandwidth variable provided in resource_summary (takes into account only DAT files, however those are most of the transfer size). it is summer over appliance FUs. Output of current run only is also provided.
  • boxinfo "max LS with output" variable (min. over all FUs in appliance is given in resource_summary)
  • all events declared error events if copying data fails in output merging to BU
  • detection of DQMHistograms stream based on file extension instead of stream name in the micro-merging script
  • detecting EoR file if it appears before setting up inotify on BU and handle creation of output dir if run all FUs are in cloud, to trigger deletion
  • paralel herod to FUs, increased timeout to 10s and added one retry, and changed to contact BUs that are stale, but reported in last 300 seconds
  • herod/tsunami now kills merging and elastic scripts (there is however a new "brutus" command which allows those to finish)
  • herod/tsunami/brutus command followed by run number in the name will purge only that run
  • retry 5 times on cloud script error on service startup
  • create missing EoLS on FU if all processes were crashed and run has ended in the meantime (to close open lumi on FU before quitting)
  • fixed empty "activeRuns" in elasticsearch when two run numbers are present at the same time (fixes premature exit of River plugin and massive "index closed" errors in F3mon)
  • support for MergeType attribute in output json files (CMSSW patch adding it to json will be pushed afterwars). It will be placed after "TransferDestination" and before "HLTErrorEvents".

hltd-1.9.4-0

22 Apr 10:01
Compare
Choose a tag to compare

hltd version installed in production DAQ on 22.4.2016.

Changes:

  • control network IP addresses of FUs are written in box files, so hltd services on BU does not have to contact DNS at run start (this was timing out sometimes).
  • added run information to hltd logs inserted into elastic
  • boot timestamp propagated from BU to FUs through BU box file, so that FUs can remount if stale handle is detected and boot timestamp differs from the previous one. This will help with stale handle issues after BU is rebooted in case there was no clean shutdown sequence to trigger remounts.
  • added bu_stop_requests_flag which will be set in resource_summary file if all non-stale and non-blacklisted FUs are switched to or in transition to cloud.
  • fix correctly setting quarantined condition flag, after which resources are not restored automatically.
    error stream output not generated if EoLS file was not written out (FU was sometimes creating files for non-existing last lumisection in FU and BU output for error stream when ending run)

hltd-1.9.3-3

22 Mar 16:41
Compare
Choose a tag to compare

Changes:

  • Using correct dat file destination path for micromerge-by-hltd mode
  • catching connection timeout exception which was able to prevent monitoring updates until hltd restart on BU

Installed in CDAQ on 22.3.2016.

hltd-1.9.3-2

18 Mar 12:46
Compare
Choose a tag to compare

Major changes in 1.9.3

  • Removing elasticsearch dependency
  • by default micromerge-by-hltd
  • HOME variable set for CMSSW_8_0_X
  • fix flagging run as quarantined (caused problems when switching to cloud)

hltd-1.9.1-0

08 Feb 09:54
Compare
Choose a tag to compare

Release for 2016 MWGR1.
Highlights:

  • new elasticsearch infrastructure (phasing out local appliance elasticsearch) and index changes for compatibility with elasticsearch 2.2
  • refactored main process code (hltd.py) into modules
  • jsns/data split in BU output
  • new document type (heartbeat) to replace host state monitoring through es-tribe. also provides system information and cloud state, independently from data network availability.

hltd-1.8.0-0

19 Nov 11:02
Compare
Choose a tag to compare

Tagging 1.8.0 hltd and fffmeta release used in production.

hltd-1.7.0-0

11 May 12:53
Compare
Choose a tag to compare

Changes:

  • major refactoring of internal bookeeping of runs; instead of multiple lists used to keep runs in various states (active, ending, all runs ...), a single object is used to have run state
  • json format used for boxinfo files, new information (per-run status) injected into elasticsearch boxinfo files.
  • stale detection in boxinfo files compares pre-write timestamp on FU vs read time on BU from ramdisk (to deal with issue when a very late file update succeeds by a slow FU)
  • resource summary written to elasticsearch
    -improved detection of mapping changes
  • improved herod command when issued on BU: deletes ramdisk and output and forwards command to FUs
  • support for cloud mode switchover, using igniter scripts. detection at hltd start, and even reboot if core files are present in the cloud
  • supports zero-event stream output files and empty lumisections (experimental and could change)
  • if working directory is not a dedicated partition, hltd will report the directory size (compared to a configurable threshold - default is 2 GB)

Known problems:

  • local ES template should be updated at the run start, but is currently broken (queued for 1.7.1)
  • directory size check on FUs can be slow if elasticsearch script doesn't timely delete files (will be improved to ignore monitoring subdirectories in the working directory in 1.7.1)
  • trace logs when elastic.py timeouts are now appearing in central log. These are known issues (temporary timeouts) when creating index and aren't causing any losses in monitoring info, appearing mostly on legacy C6100 nodes. This was suppressed in 1.6.X releases (and will be suppressed again in 1.7.1).
  • paramedic and HQ plugins were removed (unused and significantly increased fffmeta rpm size). Updated to a version of "bigdesk" plugin which works with elasticsearch 1.4.2

hltd-1.6.3-1

22 Apr 12:25
Compare
Choose a tag to compare

Changes compared to 1.6.3-0:

  • DQM changes - run_type is renamed to run_key and the RE that extracts the run_key from the .global file is loosen.
  • 10MB wsize parameter used for the output mountpoint
  • fixed waiting for elastic process (wrong variable name was used). Because of this, the fix in 1.3.0 was not cleaning defunct child processes

hltd-1.6.3-0

16 Mar 18:41
Compare
Choose a tag to compare

New features:

  • fffParameters.jsn in ramdisk hlt subdirectory is now used to read CMSSW SW_ARCH parameters and selected transfer mode.
    • this release is not backwards compatible with the old method of providing only SCRAM_ARCH and CMSSW_VERSION files
    • name of the file is configurable with paramfile_name parameter in [HLT] section of hltd.conf
  • HLT menu invoked with transferMode parameter
    • Incompatible with HLT python menus not definining the new "selectedTransferMode" VarArgs parameter
  • new "fff" init script which manages elasticsearch and hltd services, and fetches updated configuration form HWConfDB before (re)starting services.
    • usage:
      • /sbin/service fff stop|start|restart|status
      • /sbin/service fff hltd stop|start|restart|status
      • /sbin/service fff elasticsearch stop|start|restart|status

Bugfixes:

  • avoid trying to terminate elasticbu.py on shutdown by properly moving internal subprocess object out of scope (avoids killing wrong process in case PIDs were rotated)
  • bus config name written in FFF configuration even if BU data address is not in DNS. hltd will still not be able to mount ramdisk and start if target can not be mounted (will poll until mountpoint is available)

hltd-1.6.2-2

05 Mar 16:24
Compare
Choose a tag to compare

Changes:

  • added lookupcache=positive for NFS ramdisk mountpoints
  • mount point stale file handle detection
  • added missing comma in minidaq machine list (causing two affected FUs were to send their logs into cdaq index)
  • included DQM PR which sets dqm_globallock = True by default