forked from PanDAWMS/pilot
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES
415 lines (350 loc) · 27.9 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
Complete change log for PanDA Pilot version PICARD
--------------------------------------------------
65.0
General changes
- Added taskId argument to addEnvVars2Cmd() (ATLASExperiment)
- Now sending job.taskID to addEnvVars2Cmd() from getJobExecutionCommand() and getJobExecutionCommandOld() (ATLASExperiment, NordugridATLASExperiment)
- Added taskId to FRONTIER_ID (taskId_jobId) in addEnvVars2Cmd(), getEnvVars2Cmd() requested by Rodney Walker (ATLASExperiment)
- Simplified call to mover_put_data() from stageOut() (RunJob, RunJobEvent)
- Simplified call to mover_put_data() from transferActualLogFile(), transferAdditionalFile() (JobLog)
- Simplified call to mover_put_data() from TransferFiles() (DeferredStageout)
- Simplified call to mover_put_data() from moveLostOutputFiles() (pilot)
- Created removeNoOutputFiles() (FileHandling)
- Now removing files from output file list if they are listed in allowNoOutput and do not exist, in __main__(). Requested by Tadashi Maeno (RunJob)
- Renamed output_files_json to extracted_output_files in __main__() (RunJob)
- Now explicitly removing 'madevent' in removeRedundantFiles(), requested by Rodney Walker (ATLASExperiment)
- Removed unnecessary functions, getProdCmd3() and getProdCmd3Old() and updated usages from getJobExecutionCommandOld() (ATLASExperiment)
- Simplified the implementation of getAnalysisRunCommand() in Experiment and moved the actual implementation to ATLASExperiment (Experiment, ATLASExperiment)
- Removed LFC_HOME variable and single usage in class declaration and to_native_lfn() (SiteMover)
- Added --makeflags=$MAKEFLAGS option to asetup in getJobExecutionCommand() and getProperASetup(), requested by Asoka De Silva (ATLASExperiment)
- Added exception for log file in __checkPayloadStdout() (Monitor)
- Created extractOutputFiles() used by __main__ in RunJob instead of extractOutputFilesJSON() (FileHandling, RunJob)
- Removed some debug messages that caused too much stdout leading to large log sizes, e.g. in getSURLDictionary() (SiteMover)
- Removed some debug messages that caused too much stdout leading to large log sizes, e.g. in getDatasetDict() (pUtil)
- Using os.getcwd() as default for wntmpdir instead of /tmp, in set_environment(). Requested by Andrej Filipcic (environment)
- Out-commented call to deprecated getNewQueuedata() in handleQueuedata() (pUtil)
Send DB info in jobMetrics, requested by Rodney Walker
- Created getDBInfo() (FileHandling)
- Added dbTime, dbData job object strings (Job)
- Added call to getDBInfo() from extractJobInformation() (ErrorDiagnosis)
- Now adding dbTime and dbData to jobMetrics in getJobMetrics() (PandaServerClient)
- Added dbTime and dbData to updateJobInfo() (RunJobUtilities)
- Added dbTime and dbData to handle() (UpdateHandler)
Limit number of MemoryMonitor restarts, requested by Rodney Walker
- Added utility_subprocess_launches variables + handling to __main__() (RunJob, RunJobEvent)
Memory monitoring
- Fix for dangling utility_subprocess (RunJob, RunJobEvent)
Same user multi-job request by Emmanouil Vamvakopoulos
- Added allowSameUser boolean and taskID (environment)
- Enabled pilot option -B <allowSameUser> (pilot)
- Sending prodUserID with job request in case allowSameUser is set, in getDispatcherDictionary() (pilot)
- Setting prodUserID in getNewJob() (pilot)
Bug fixes
- Removed tmp return from handleDBRelease() which caused skipping of DBRelease file to fail (Mover)
- Removed duplicated getFileInfoFromRucio() (SiteMover)
- Corrected _output to output in _getRemoteFileSizeLCGLS() (SiteMover)
- Removed -ALL suffix from agis_schedconf.json which seems to prevent the caching of the file to work properly, in getNewQueuedata() (SiteInformation)
- Sending job.prodDBlockToken to RunJobUtilities.updateRunCommandList() from __main__() (RunJob)
- Added prodDBlockToken argument to updateRunCommandList() (RunJobUtilities)
- Now removing --directIn payload option if 'local' is present in prodDBlockToken list, updateRunCommandList() (RunJobUtilities)
- Updated call to updateRunCommandList() in RunJob modules (RunJob, RunJobEvent, RunJobAnselm, RunJobEdison, RunJobHopper, RunJobTitan)
- Changed jobDic argument to job object in storeWorkDirSize() and removed unnecessary loop over job objects which caused problems with big Yoda jobs (FileHandling)
- Sending self.__env['jobDic'][k][1] (job object) instead self.__env['jobDic'] (job dictionary) to storeWorkDirSize() from __checkWorkDir() (Monitor)
- Sending job instead jobDic to storeWorkDirSize() from createLogFile() (JobLog)
- Renamed config to configSiteMover to prevent problems seen on ND with late releases due to presence of a config module from TDAQ in the PYTHONPATH
- Updated config to configSiteMover import (S3ObjectstoreHttpSiteMover, S3ObjectstoreSiteMover, S3SiteMover, SiteMover, mvSiteMover, objectstoreSiteMover, xrootdObjectstoreSiteMover, xrootdSiteMover)
- Removed unused config import (pUtil)
- Now recognizing copytool=fax in mover_get_data() when counting FAX statistics (Mover)
Server update for ND true pilots (pre-placed job def but with server updates) after out of local disk space, requested by Andrej Filipcic
- (i.e. updateServerFlag == True and jobRequestFlag == False)
- Avoiding checking local disk space in runMain(), before the getJob() (pilot)
Alt stage-out
- Added altStageOut data member plus handling (Job)
- Setting altStageOut in mover_put_data() (Mover)
- Changed argument flag to **pdict in allowAlternativeStageOut(), forceAlternativeStageOut() (SiteInformation)
- Using altStageOut in allowAlternativeStageOut() and forceAlternativeStageOut() (ATLASSiteInformation)
- Sending job.altStageOut to forceAlternativeStageOut() and allowAlternativeStageOut() from mover_put_data() and mover_put_data_new() (Mover)
- Turning off any forced alt stage-out for objectstore transfers in forceAlternativeStageOut() (ATLASSiteInformation)
- Created new functions for stage-in/out optimization: getObjectstoreBucketID(), getObjectstorePath(), getObjectstoreDDMEndpoint(),
getAlternativeObjectstoreDDMEndpoint(), getObjectstoreEndpointID(), getObjectstoreKeyInfo() (SiteInformation)
- Updated getLogPath() to use getObjectstoreDDMEndpoint(), getObjectstorePath(), getObjectstoreBucketID() (JobLog)
- Updated getFileInfo() to use getObjectstoreDDMEndpoint(), getObjectstoreBucketID() and shortened logic (no special case for log files) (Mover)
- Updated getNewOSStoragePath() to use getObjectstoreDDMEndpoint(), getAlternativeObjectstoreDDMEndpoint(), getObjectstorePath()
and getObjectstoreBucketID() (Mover)
- Removed unused variables from getDDMStorage() (os_bucket_id, queuename, jobId, ub) (Mover)
- Updated getDDMStorage() to use getObjectstoreDDMEndpoint(), getObjectstorePath(), getObjectstoreBucketID() (Mover)
- Removed unnecessary OS functions in transferToObjectStore() (RunJobEvent)
- Created getObjectstoreDDMEndpointFromBucketID() (SiteInformation)
- Now using getObjectstoreDDMEndpointFromBucketID() in transferLogFile() (JobLog)
- Now using getObjectstoreDDMEndpointFromBucketID() in transferToObjectStore() (RunJobEvent)
- Now using os_ddmendpoint instead of os_bucket_endpoint in addToOSTransferDictionary() (FileHandling)
- Removed unwanted methods getObjectstoreBucketEndpoint(), getObjectstoreName() (SiteInformation)
- Changed os_name to os_ddmendpoint (S3ObjectstoreHttpSiteMover)
- Updated setup() to use getObjectstoreDDMEndpointFromBucketID(), getObjectstoreEndpointID(), getObjectstoreKeyInfo() (S3ObjectstoreSiteMover)
- Added argument label to setup() and its calls (S3ObjectstoreSiteMover)
- Removed deprecated methods getAlternativeOS(), convertBucketIDsToOSIDs(), getOSInfoFromBucketID(), getBucketID(),
findObjectStore(), getOSIDFromName(), getQueuenameFromOSID(), findAllObjectStores(),
findOSEnabledQueuesInFullQueuedata(), getObjectstoresList() (SiteInformation)
- Added getObjectstoreFilename() (ATLASSiteInformation)
Event service
- Setting PILOT_EVENTRANGECHANNEL env variable instead of manipulating jobPars, in __main__() (RunJobEvent)
- Wen Guan: Fix in init function which prevented the S3 site mover from being used (S3ObjectstoreSiteMover)
- Sending os_bucket_id to getQueuedataFileName() from getObjectstoresList(), getNewQueuedata(), getField() (SiteInformation)
- Added os_bucket_id argument to getQueuedataFileName() (SiteInformation)
- Using os_bucket_id in queuedata filename in getQueuedataFileName() (SiteInformation)
- Will abandon getNewQueuedata() if file already exists (filename should now be unique to the bucket id) (SiteInformation)
- Added os_bucket_id argument to getNewQueuedata() (SiteInformation)
- Sending os_bucket_id to getField() from getObjectstoresList() (SiteInformation)
- Added os_bucket_id argument to getField() (SiteInformation)
- Extracting os_bucket_id in get_data(), put_data() (S3ObjectstoreSiteMover, S3SiteMover)
- Sending os_bucket_id to stageOut() from put_data() (S3ObjectstoreSiteMover, S3SiteMover)
- Sending os_bucket_id to stageIn() from get_data() (S3ObjectstoreSiteMover, S3SiteMover)
- Added os_bucket_id argument to setup(), stageIn(), stageOut() (S3ObjectstoreSiteMover)
- Sending os_bucket_id to setup() from stageIn(), stageOut() (S3ObjectstoreSiteMover)
- Sending os_bucket_id to getObjectstoresField() from setup() (S3ObjectstoreSiteMover)
- Sending os_bucket_id to sitemover_get_data() from mover_get_data(), _mover_get_data_new() (Mover)
- Added os_bucket_id argument to sitemover_get_data() (Mover)
- Sending os_bucket_id to get_data() from sitemover_get_data() (Mover)
- Sending os_bucket_id to sitemover_put_data() from mover_put_data(), mover_put_data_new() (Mover)
- Added os_bucket_id argument to sitemover_put_data() (Mover)
- Sending os_bucket_id to put_data() from sitemover_put_data() (Mover)
-- Changes below affects the changes above --
- Created getObjectstoreInfoFile(), getObjectstorePath(), getObjectstoreFilename(), getObjectstoreBucketID() (SiteInformation)
- Renamed old getObjectstorePath() to getObjectstorePathOld() - to be removed shortly (SiteInformation)
- Created getObjectstoresInfo() (SiteInformation)
- Created getObjectstoresField() (SiteInformation)
- Created getObjectstorePath() (SiteInformation)
- Now using getObjectstoresField() instead of getObjectstoreName() and getObjectstoreBucketEndpoint() in transferLogFile() (JobLog)
- Now using getObjectstoresField() and new getObjectstorePath() instead of old getObjectstorePath() in transferLogFile() (JobLog)
- Updated call to getObjectstoresField() in getFileInfo() (Mover)
- Updated two calls to getObjectstorePath(), and using getObjectstoresField() in getFileInfo() (Mover)
- Using new getObjectstoresField() instead of getObjectstoreName() in getNewOSStoragePath() (Mover)
- Using new getObjectstoresField() and getObjectstorePath() instead of old getObjectstorePath() in getDDMStorage() (Mover)
- Updated calls to getObjectstoresField() (S3ObjectstoreHttpSiteMover, S3ObjectstoreSiteMover, SiteInformation)
- Avoiding None values in prodDBlockToken list, in hasOSBucketIDs() (SiteInformation)
- Created getObjectstoreDDMEndpoint() (SiteInformation)
- Created getCPUTimes(), used in RunJobEvent::__main__() (FileHandling, RunJobEvent)
- Updated interpretMessage() to handle multiple files (RunJobEvent)
- Updated listener() to handle list of files from interpretMessage() (RunJobEvent)
Logs to OS:
- Added putLogToOS to Job class (Job)
- Changed argument from eventService to **argdict in doSpecialLogtransfer() (Experiment, ATLASExperiment)
- Now sending log to OS if job.putLogToOS is set, in doSpecialLogFileTransfer() (JobLog)
- Added logToOS boolean argument to PFCxml(), used to add an endpoint tag to the xml (pUtil)
- Sending logToOS to PFCxml() from createFileMetadata() (RunJobEvent, RunJob)
Direct access updates (to enable this functionality for ES)
- Using transferType fax in shouldPFC4TURLsBeCreated()
- Sending job.transferType to getFileAccessInfo() from stageIn() (RunJob)
- Sending transferType to getFileAccessInfo() from shouldPFC4TURLsBeCreated() (Mover)
- Sending transferType to createPFC4TURLs() from PFC4TURLs() (Mover)
- Added transferType argument to createPFC4TURLs() (Mover)
- Sending transferType to getTURLs() from createPFC4TURLs() (Mover)
- Added transferType argument to getTURLs() (Mover)
- Sending transferType to getPrefices() from getTURLs() (Mover)
- Added transferType argument to getPrefices() (Mover)
- Sending transferType to getFileAccessInfo() from getPrefices() (Mover)
- Sending job.transferType to pUtil.getFileAccessInfo() (RunJobHpcEvent)
- Added transferType argument to getFileAccessInfo() (pUtil)
- Using transferType and direct_access_wan in getFileAccessInfo() (pUtil)
- Added transferType argument to getCopyFileAccessInfo() (SiteInformation)
- Sending transferType to getCopyFileAccessInfo() from getDirectInAccessMode() (SiteInformation)
- Removed useFileStager from getCopyFileAccessInfo() and getDirectInAccessMode() (SiteInformation)
- Using transferType and direct_access_wan in getCopyFileAccessInfo() (SiteInformation)
- Moved getDirectAccess(), useDirectAccessLAN(), useDirectAccessWAN() from Mover to FileHandling
- Now using getDirectAccess() in getDirectInAccessMode() (SiteInformation)
- Removed directIn from getCopyFileAccessInfo() (SiteInformation)
- Now using self.__siteInfo.getCopyFileAccessInfo() instead of pUtil.getFileAccessInfo() (RunJobHpcEvent)
- Now using si.getCopyFileAccessInfo() instead of pUtil.getFileAccessInfo() (RunJob)
- Added experiment argument to getTURLs(), createPFC4TURLs(), PFC4TURLs() (Mover)
- Sending experiment to getTURLs() from createPFC4TURLs() (Mover)
- Sending experiment to createPFC4TURLs() from PFC4TURLs() (Mover)
- Sending experiment to PFC4TURLs() from _mover_get_data_new() and mover_get_data() (Mover)
- Added experiment argument to getPrefices() (Mover)
- Sending experiment to getPrefices() from getTURLs() (Mover)
- Now using si.getCopyFileAccessInfo() instead of pUtil.getFileAccessInfo() in getPrefices(), shouldPFC4TURLsBeCreated() (Mover)
- Added experiment argument to shouldPFC4TURLsBeCreated() (Mover)
- Sending experiment to shouldPFC4TURLsBeCreated() from PFC4TURLs() (Mover)
- Removed getFileAccessInfo() which has now become a duplicate of SiteInformation.getCopyFileAccessInfo() (pUtil)
- Renamed getCopyFileAccessInfo() to getFileAccessInfo() (SiteInformation, Mover, RunJob, RunJobHpcEvent)
- Removed sending stageIn boolean to getFileAccessInfo() from getDirectInAccessMode() (SiteInformation)
- Added transferType argument to getDirectInAccessMode() (SiteInformation)
- Sending transferType to getDirectInAccessMode() from getStageInMode() (FAXSiteMover, GFAL2SiteMover, LocalSiteMover, xrdcpSiteMover, xrootdObjectstoreSiteMover)
- Sending transferType to getStageInMode() from get_data() (FAXSiteMover, GFAL2SiteMover, LocalSiteMover, xrdcpSiteMover, xrootdObjectstoreSiteMover)
- Added transferType argument to getStageInMode() (FAXSiteMover, GFAL2SiteMover, LocalSiteMover, xrdcpSiteMover, xrootdObjectstoreSiteMover)
- Extracting transferType in get_data() (FAXSiteMover, GFAL2SiteMover, LocalSiteMover, xrdcpSiteMover, xrootdObjectstoreSiteMover)
- Sending transferType to get_data() from sitemover_get_data() (Mover)
- Sending transferType to sitemover_get_data() from _mover_get_data_new(), mover_get_data() (Mover)
- Added transferType argument to sitemover_get_data() (Mover)
- For transferType direct and fax, now calling si.updateDirectAccess() from __main__() (RunJob, RunJobEvent)
- Created updateDirectAccess() (SiteInformation)
Updates from Wen Guan:
- Now using curl to get keypair (ATLASSiteInformation, pUtil)
- Add hostname check with http proxy (S3ObjectstoreSiteMover)
- Fixed hostname is None issue (S3ObjectstoreSiteMover)
- Solve the region problem with AWS OS (S3ObjectstoreSiteMover, objectstoreSiteMover)
CMS Experiment
- Removed allowAlternativeStageOut() (CMSExperiment)
AMS Experiment
- Changed argument flag to **pdict in allowAlternativeStageOut(), forceAlternativeStageOut() (AMSTaiwanSiteInformation)
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.1:
Bug fixes:
- Correction for unexpected values (dbData/dbTime=null) in jobReport, in getDBInfo() (FileHandling)
- Corrected for missing error code in errorToReport() masking the actual error [problem seen with older pilot version 64.5] (LocalSiteMover)
General changes:
- Added try-statement around cpu time adding in getCPUTimes() to preempt any strange values in jobReport (FileHandling)
- Now removing objectstore json files prior to log file creation, in removeRedundantFiles() (ATLASExperiment)
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.2:
General changes:
- Added debugging (sys.exc_info()) info when list_replicas() fails with 'empty dictionary' in getRucioReplicaDictionary() (Mover)
- Removed duplicate definition of objectstore variable in mover_put_data() (Mover)
- Added &schemes=https,http to all occurrences of ?select=geoip when the schedconfig.httpredirector is set, requested by R. Walker, V. Garonne (aria2cSiteMover)
Bug fixes:
- Removed call to getDDMStorage() which overwrote alternative bucket id (Mover)
- Adding --directIn for transferType fax and direct_access_wan = True, in getAnalysisRunCommand() (ATLASExperiment)
- Setting old/newPrefix="" in getFileAccessInfo() (SiteInformation)
OS log transfers:
- Ignoring failed OS log transfer and resetting os_bucket_id, in transferActualLogFile() (JobLog)
- Only allow alt OS stage-out if "allow_alt_os_stageout" is present in catchall field, in mover_put_data() (Mover)
Direct Access on ND:
- Relaxed the input file verification in get_data() to allow for direct access, requested by D. Cameron and A. Filipcic (mvSiteMover)
Direct access updates:
- Now accepting writeToFile directive for non-ES jobs in setJobDef() (Job)
- Sending eventservice variable to writeToInputFile() from setJobDef() (Job)
- Added eventservice argument to writeToInputFile() (pUtil)
- Selecting --inputHitsFile or --inputAODFile depending on value of eventservice variable in writeToInputFile() (pUtil)
- Updated updateDispatcherData4ES(), updateJobPars() to work not only with ES jobs (pUtil)
- Created getPooFilenameFromJobPars(), updateInputFileWithTURLs() (pUtil) [not merged with main-dev]
- Generating the LFN_to_TURL_dictionary in getTURLs() (Mover)
- Receiving LFN_to_TURL_dictionary from PFC4TURLs() from _mover_get_data_new(), mover_get_data() (Mover)
- Returning LFN_to_TURL_dictionary from getTURLs(), createPFC4TURLs(), PFC4TURLs() (Mover)
- Importing updateInputFileWithTURLs (Mover)
- Calling updateInputFileWithTURLs() from _mover_get_data_new() and mover_get_data() (Mover)
- Added dummy return value from PFC4TURLs() call in getPoolFileCatalog() (RunJobEvent)
Memory monitoring:
- Now enforcing maxPSS from memory monitor is less than 2*maxRSS from schedconfig, otherwise kill job for non-CGROUPS sites, in __check_memory_usage() (Monitor)
- Updated memory monitor from release area 20.1.5.2 to 20.7.5.8 (ATLASExperiment)
Updates from Wen Guan:
- Yoda accounting and monitoring: report cpu hour info and processed events to pilot. Pilot then reports this info to panda with jobMetrics
- ES zip: Yoda ES zip outputs; Grid job ES zip outputs(RunJobEvent), unzip ES when merging(RunJob)
- Report MPI job id to panda in jobMetrics
- Added bulk update event ranges to RunJobEvent
- Fixed os_bucket_id in Yoda
- Fixed RunJobEvent with race condition: When asynchronousOutputStager, updateEventRange and main thread getEventRange are running at the same time, the curl.config will be overwritten and asynchronousOutputStager may crash
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.3:
Added missing factor of 2 in maxrss (Monitor)
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.4:
Updates from Wen Guan:
- Created init_guid_list() (RunJobEvent)
- Using new function init_guid_list() to fix problem with having more than one input file in event service job (RunJobEvent)
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.5:
General changes:
- Cleanup limit raised from 2 to 12 hours in purgeEmptyDirs(), purgeWorkDirs(), purgeMaxedoutDirs() (Cleaner)
- Avoiding sending coreCount = NULL/null in getJobMetrics(), getNodeStructure() (PandaServerClient)
Bug fixes:
- Corrected from variable name (errorOuput -> errorOutput) in put_data() (GFAL2SiteMover)
Spill-over guids:
- Now extracting guids for spill-over files in extractOutputFilesFromJSON() (FileHandling)
- Sending job.outFilesGuids to extractOutputFiles() (RunJob)
- Updated output file guids (RunJob)
- Sending outFilesGuids to removeNoOutputFiles() (FileHandling)
- Added outFilesGuids argument and handling to removeNoOutputFiles() (FileHandling)
- Added argument fromJSON to createFileMetadata() (RunJob)
- Sending fromJSON to createFileMetadata() (RunJob)
Fix for bad TURLs (aria2cSiteMover)
- Added "?select=geoip&schemes=https,http"
Improving the "Empty rucio dictionary"-error
- Now returning pilotErrorDiag from getRucioReplicaDictionary() (Mover)
- Receiving pilotErrorDiag in getReplicaDictionaryRucio() (Mover)
- Sending pilotErrorDiag to verifySURLGUIDDictionary() from getReplicaDictionaryRucio() (Mover)
- Added pilotErrorDiag argument to verifySURLGUIDDictionary() (Mover)
Mover updates from Danila Oleynik:
- Improved pilotErrorDiag for globus system errors in _mover_get_data_new() and mover_get_data(). Requested in GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=121542 (Mover)
AES@HPC updates from Wen Guan:
- add accounting for droid waiting transfer time
- enable retry in droid stager
- report batchid
- fix log collection in multiple job mode
- load balance dns name in s3 objectstore
- fix bug in get objectstore info file
- fix postJob in multiple job mode
- using different curl.config for different job
- fix to avoid for ipv6 in s3objectstore balance
- fix bug in monitor
- fix bug in Timercommand
- support bulk getJobs with json returns
- new jobmetrics from Yoda
- new yoda accounting info
- update job to support new jobmetrics and bulk getjobs
- disable jobmetrics copying in log collection
- more flexible to schedule jobs in yoda
- fix yoda arc plugin
- merge 20160624
Site mover update from Alexey Anisenkov:
- sitemovers: upgraded stagein workflow:
- Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
- cosmetic fix
- movers: stage-out workflow updated
- movers: cosmetic fixes
- sitemovers: transferring logfiles to normal SE switched to new sitemover implementation (JobMover.stageout_logfiles via wrapper mover.put_data_new)
- sitemovers updates
- fix scopeLog initialization in Job
Documentation
Sitemovers workflow upgraded to properly consider DDMEndpoint protocols declaration for transfers.
1 . General concept and workflow description:
-- Pilot executes Panda's command in terms of DDMEndpoint (as a main identifier) where to store data/log files and from which DDMEndpoints (actually from all local ddmendpoints belong to same ATLAS site resolved by suggested ddms from Job.ddmendpointIn input)
-- PandaQueue specifies in the shedconfig fields which copytools (e.g. xrdcp, rucio, lsm) are supported/enabled by given PandaQueue. It's possible to specify separately prioritized list of copytools specific for stage-in (activity='pr'), stage-out (activity='pw') and stage-out-logs
(activity='pl') (via PQ.acopytools field). If no specific copytools assigned to given activity (e.g. 'pr') then all supported copytools defined in PQ.copytools field will be used by default.
-- Each sitemover implementation should provide the list of accepted protocol schemes that can be used by this sitemover instance (e.g. the xrdcp sitemover expects to operate with root:// protocol)
-- Top sitemover manager takes each Job file to be transferred, fetches all protocol settings from the DDM JSON export (specific for given panda activity - e.g. activity='pw' if not set then fallback to normal DDM activity, e.g activity='r'), then iterates over supported copytools and considers only accepted protocols (matched to copytool.schemes) for real transfers.
-- If one copytool is not able to transfer file (for whatever reasons: code exception, site mover logic broken or resource/file is not accessible through given copytool/protocol) then next available copytool will be used for transfer the file
-- PandaQueue Specific protocols/copytools can be declared at the level of PandaQueue - PQ.aprotocols filed - and will be used first (overwrites default settings from the DDM protocols)
-- for stageg-in sitemover considers input replicas resolved by Rucio as is. It's still possible to overwrite base protocol path using PQ.aprotocols settings. Real filename (lfn value) is resolved from one known replica by matching SURL replica (srm, or whatever default protocol specified under specific activity 'SE' -> then fallback to 'a' then to 'r' of DDMEndpoint,aprotocols) against all replicas returned by Rucio.
Stage-in updates:
-- logic changed to following:
Fist process/resolve copytool values from PQ specific aprotocols settings (used for corner-cases, test settings) (the protocol of input replica is modified with appropriate PQ.protocol.SE settings attached to PQ)
then check allowed schemes from PQ.acopytools or PQ.copytools settings (consider input replica provided by Rucio as is)
Stage-out updates:
-- SURL protocol look up fixed to following protocol activity order: 'SE' -> 'a' -> 'r'
-- changed logic of protocol settings look up:
-- first resolve protocol settings from PQ specific aprotocols settings (use them as a master data)
-- then for unknown ddmendpoints resolve settings from default ddm.protocols and consider only protocols supported by specified copytools
Stage-out of log files updates:
-- transferring logfiles to normal SE switched to new sitemover implementation (JobMover.stageout_logfiles via wrapper mover.put_data_new)
other updates
-- Various sitemovers fixes and updates
-- fix scopeLog initialization in Job
-- new log_transfer integration input field introduced to mover.mover_put_data wrapper to specify if it's normal stageout of logfile
-- new schedconfig settings structures:
PQ.acopytools should represent accepted/support copytool names for given PQ.
for example:
PQ.acopytools = {'pr':['xrdcp', 'rucio'], 'pw':['lsm', 'rucio']}
PQ.copytools exposes settings for copytools enabled for given PQ
PQ.copytools = {'xrdcp': {'setup':''}, 'lsm': {'setup':''} }
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
65.6:
User job setup:
- Updated getCacheInfo() to handle AtlasDerivation nightlies for user jobs (ATLASExperiment)
- Removed outdated code for release 13.0.25 from getCacheInfo() (ATLASExperiment)
- Updated getJobExecutionCommand() to setup user jobs with atlasLocalSetup() (ATLASExperiment)
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
TODO:
OS log transfers:
[TODO: set bucket to -1 for failed log transfers in jobMetrics]
Direct access updates:
- [TODO: switch to fax endpoints if direct i/o is requested but not enough info available in copyprefix.]
[TODO: See HC test "732: RootAnalysis 2.3.14 QuickAna-00-00-60 Analy - rc test", ANALY_IFAE]
Remove singletons:
- Change __experiment to experiment in Experiment classes
- Use super() in __init__() (ATLASExperiment)
- Same changes to SiteInformation classes
REMOVE OTHERSITEMOVER FROM DIST
todo: remove the explicit usages of schedconfig.lfchost and replace with an experiment specific method (getFileCatalog())
todo: rename pUtil.getExperiment to pUtil.getExperimentObject, correct import in SiteInformation
#### add new error codes 1217-1219 to proddb
Update prodDB for ERR_RUNJOBEXC : "Exception caught by runJob" -> "Exception caught by RunJob*" ? not necessary??
Added new error codes; 1224 (ERR_ESRECOVERABLE), 1225 (ERR_ESMERGERECOVERABLE) (PilotErrors)