Site Monitoring and Triage

Useful Links for Monitoring US ATLAS Grid Sites

Site-oriented dashboards:
- US tier 2 federations on site-oriented dashboard.
- US tier 2 sites on site-oriented dashboard.
Panda Monitoring pages for the US cloud:
- Look here for failures at active sites: All online US panda Queues.
  - Sort by failure rate or number of failed jobs to identify bad sites.
  - Remember that analysis jobs typically fail 3 to 5 times more frequently than production jobs. The higher failure rate for analysis jobs is usually (but not always) due to user errors.
  - Click on the failures at a site to investigate why they are happening.
    - Look at the list of most frequent failures and try to determine if the issue is a site problem or a problem with the task or other external cause.
    - If you see a task or group of tasks failing for a software issue, email DPA and the production manager (production) or user (analysis) so that the bad tasks can be aborted.
- To find sites that have been placed in test mode and need to be put back online, look here: Offline US Panda Queues in test mode
- Links to look at the failures for individual server issues at each site:
  - AGLT2
  - MWT2
  - NET2
  - SWT2

Debugging Issues Found

First determine if the problem is affecting multiple sites by looking at the monitoring links above.
- If several sites are showing high failure rates, are drained, or have been put offline:
  - Drill down in the Panda Monitor by clicking the the total number of failed jobs in the links for the affected sites. The linked page contains a summary of how many jobs failed for each type of error in the last 12 hours. If a single user or production task is causing large numbers of errors, you can safely assume that it is a problem with job and not the site.
  - Transform errors/athena failures are generally (but not always) caused by coding / configuration problems with the jobs and not the a site problem.
  - Similarly jobs that the pilot aborts because of excessive memory usage or being in an infinite loop are usually problems with the job.
  - The page has links to the 100 most recent failed jobs. Following those links you can view the log files and look for error messages. If it looks like a job issue send email to DPA and the production manager (production jobs) or the user (analysis jobs).
  - If a site is failing jobs for other errors (particularly if only a single site has the problem), then it is likely a site problem.
  - Not complete - More to come later.

Solutions to Common Problems

Problem: Servers are running out of memory

Identify which jobs are causing the problem.
- Start by identifying high memory jobs by using "ps auxf" on sluggish/over loaded compute servers. By looking at the process tree that the f option provides, you can identify the panda Job ID.
- To scan many servers quickly, remotely execute one of the attached scripts on many servers:
  - find-fat-jobs.sh will print the Panda job numbers for all jobs on the server with RSS > 6 GB. You can then use the panda monitoring to look at the jobs individually.
  - find-fat-procs.sh will print the PID for all processes on the server with RSS > 6 GB. Use this script to find high memory processes not associated with Panda payloads (e.g. BOINC, OSG, glide-ins, the actual pilot, etc.) You can than use ps to track down which users are responsible as in this case the Panda Monitor can't help you. NB: Jobs owned by usatlas1 are production jobs while jobs owned by usatlas3 are use analysis jobs. Memory problems mainly happen for the user jobs. If you do find a production task with memory issues report it immediately as some productions launch enormous numbers of jobs. Most of the rest of the instructions require a Panda Job ID so if you can't figure out who is responsible, mail the US ATLAS WBS 2.3 list for help.
  - The parameter SIZE in the scripts can be changed to allow for finding jobs/processes with RSS bigger or smaller than the default of value of 6 GB.
Next investigate some of the Job IDs that you have found to identify the Request Number / Task Number using the Panda Monitor.
To be completed.

Problem: Blocking an Unresponsive User With Many Failing Jobs

Send email to the DPA mailing list. Only the central team can block individual users because all jobs run under one of two IDs: usatlas1 for production jobs and usatlas3 for user analysis jobs.
Once you get in contact with the user and the issue is resolved, don't forget to email DPA to unblock the user.

Problem: Site Draining / Cores Going Unused

Determine if the problem is a lack of jobs from Harvester / the Production System or something local to the site. Use the the local monitoring to check that the queuing system if there are jobs queued. This will vary depending on how your is configured and the queuing system (condor, Slurm, SGE, etc.) If things look OK go to step 2.
Look at number of activated jobs for the queue in the Panda Monitor. The number of activated jobs should be hundreds and be a significant fraction of the total number jobs running. NB: Looking at the analysis and production jobs separately (check box labeled job type under "Split by:") can be helpful if looking at an unified queue. If there are activated jobs and the site is not full go to step 3.
Look in the ElasticSearch/Kibana page for the site. For example for AGLT2 it's this page - delete the computingsite filter specifying AGLT2 (upper left) and add a new filter specifying the site that is draining.
The plots show a 24 hour history of the harvester submission to the site against a number of parameters: submit host, job state (jobs not completed), and job state (job completed).

Scientific Data and Computing Center

Useful Links for Monitoring US ATLAS Grid Sites

Debugging Issues Found

Solutions to Common Problems

Problem: Servers are running out of memory

Problem: Blocking an Unresponsive User With Many Failing Jobs

Problem: Site Draining / Cores Going Unused

Experiments

News & Announcements