FBSNG v1.0 Maintenance Guide

 

Revision 3

 

Jim Fromm, Krzysztof Genser, Tanya Levshina, Igor Mandrichenko

 


 

Maintenance Tasks

Starting FBSNG

FBSNG has the following permanently running processes (daemons):

 

 

 

 

Although not necessary, it is recommended to set up unprivileged account for bmgr and logd and create <FBSNG_ROOT> directory in NFS-exported home area of such account.

 

In order to start each daemon, administrator should use start_bmgr.sh, start_logd.sh and start_lch.sh scripts found in <FBSNG_ROOT>/bin directory. Before they can be used, the scripts must be adjusted to the particular installation following instructions found in the script files. Administrator must 'su' to proper account before using the scripts. The scripts are designed to be called from system boot-time start-up scripts.

 

FBSNG distribution includes “sendcmd” script. This script, located in $FBSNG_ROOT/bin directory, can be used to perform certain operations on a number of nodes using “rsh” command. “sendcmd” script can be used to start launcers on farm nodes in the following way:

 

cd $FBSNG_ROOT/bin

./sendcmd fnpc 101 120 /home/farms/fbsng_root/bin/start_lch.sh

 

This command will run the start_lch.sh script on nodes fnpc101 through fnpc120.

 

In cases when using start_xxx.sh scripts is not acceptable due to site-specific policies or procedures, FBSNG Administrator can use lower level commands to start FBSNG components. The following commands run corresponding daemons:

 

fbs bmgr          # run BMGR

fbs launcher             # run Laucnher

fbs logd          # run Logd

How to Tell Whether FBSNG is Running

First quick check is to use UNIX “ps” command. Here is what ps output should look like for BMGR:

 

fnpcb> ps axw | grep bmgr

17177 pts/12   S      0:01 sh –f /home/farms/fbsng_v1_1/bin/fbs_run.sh bmgr

20163 pts/12   S      0:00 sh /fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng bmgr

20165 pts/12   S      0:09 python /fnal/ups/prd/fbsng/v1_1/Linux/bin/bmgr.py

 

for launcher:

 

fnpcb> ps axw | grep launcher

13660 ?        S      0:00 sh -f /home/farms/fbsng_v1_1/bin/fbs_run.sh launcher

20488 ?        S      0:00 sh /fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng launcher

20509 ?        S     28:50 python /fnal/ups/prd/fbsng/v1_1/Linux/bin/launcher.py

 

and for logd:

 

fnpcb> ps axw | grep logd

18921 pts/12   S      0:00 sh -f /home/farms/fbsng_v1_1/bin/fbs_run.sh logd

19007 pts/12   S      0:00 sh /fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng logd

19008 pts/12   S      0:21 python /fnal/ups/prd/fbsng/v1_1/Linux/bin/logd.py

 

Also, in order to see if FBSNG is working properly, administrator can use “fbs nodes” command. If the command prints list of the farm nodes, it means that important FBSNG components are configured properly, and BMGR process is running. Here is what normal “fbs nodes” output might look like:

 

fnpcb> fbs nodes

 HOST               STATUS   CLASS       PROCESSES

 ----               ------   -----       --------

 fnpcb              up       IO          []

 fnpc104            down     Worker      []

 fnpc105            down     Worker      []

 fnpc106            down     Worker      []

 fnpc107            down     Worker      []

 fnpc101            down     Worker      []

 fnpc102            down     Worker      []

 fnpc103            down     Worker      []

 fnpc108            down     Worker      []

 

It should be noted that after starting, BMGR process is supposed to ignore all request while it performs initial recovery functions. The recovery time interval is defined in fbs.cfg file.

 

“fbs nodes” command will print status of all farm nodes. “down” node status may indicates that either the node is down, or launcher process on the node can not connect to BMGR process, or the launcher is not running there.

Re-starting FBSNG Components

If started using corresponding start-up scripts, FBSNG components will be re-started automatically in a minute after they exit. Therefore, to re-start an FBSNG component, it is sufficient to kill corresponding process. <FBSNG_ROOT>/bin directory contains kill_lch.sh, kill_bmgr.sh and kill_logd.sh scripts. They can be used to kill launcher, BMGR and logd processes respectively. Of course, they must be used on the node where the process is running, and require necessary privileges to kill the process.

Shutting Down FBSNG Components

In order to shut down FBSNG component so that it does not re-start afterwards, shutdown_lch.sh, shutdown_bmgr.sh and shutdown_logd.sh scripts found in <FBSNG_ROOT>/bin directory should be used.

FBSNG Re-configuration Procedure

After changing most of FBSNG configuration parameters, corresponding FBSNG daemon must be re-started for the changes to take effect:

 

If you modified set…

Daemons to be restarted:

bmgr

BMGR

logd

Logd

launcher

Launcher(s)

global

BMGR, Lauchers

queue

BMGR

host_class

BMGR

host_list

BMGR

proc_type

BMGR

history

BMGR

jobdb

BMGR

 

 

Certain rules that must be followed when modifying FBSNG configuration:

 

If a non-empty queue is removed from the configuration, BMGR process will not re-start until the queue is returned back into the configuration.

 

 

If a process type is removed from FBSNG configuration, BMGR process will not re-start until the process type is returned back into the configuration.

 

 

If a node with batch processes running is removed, the processes will continue to run, but FBSNG will consider them terminated as if the node was shut down.

 

 

 

If a resource or pool is removed, all pending sections that require the resource or pool will remain pending indefinitely.

FBSNG Recovery Limitations

FBSNG is designed so that it can recover from most of planned or unexpected component failures. However, when planning re-start of FBSNG components, for example after modifying FBSNG configuration, the administrator should follow certain rules:

 

Launcher on a farm node should never be re-started while any batch processes run on the farm node. Launcher termination is interpreted as shutting down the node it runs on, and any processes running there are considered to be terminated. If it is necessary to shut down or re-start a launcher, the administrator should hold the node using “fbs hold node” command, either kill all jobs or job sections running on the node, or wait until all processes running on the node exit, and only then re-start or shut down the launcher process.

 

In order to prevent FBSNG database corruption, BMGR process must not be killed with SIGKILL (usually signal 9). It should be killed only with SIGINT signal. It may take several seconds for BMGR to exit after receiving the signal depending on what state it was when the signal was received. It is recommended to always use kill_bmgr.sh to re-start the BMGR process or shutdown_bmgr.sh to shut it down permanently. As long as it is done properly, it is safe to shut down or re-start BMGR process at any time. When BMGR process re-starts, it will recover run-time information about pending and running batch jobs.

 

Logd process can be killed virtually any time. If logd re-starts shortly, no log information will be lost. However, if it does not re-start soon enough, some old log information may be discarded.

 

FBSNG Log Files

Unless configured otherwise, FBSNG keeps its own log files in <FBSNG_ROOT> directory tree. They are organized as follows:

 

<FBSNG_ROOT>/logs/bmgr       -       BMGR log files

<FBSNG_ROOT>/logs/lch       -       Launcher log files

<FBSNG_ROOT>/slog           -       Section log files

 

BMGR and Launcher log file names include date when the file was open and the node where the process is running. For Launcher, the log file name includes UNIX process ID of the Launcher process. The log files are closed around midnight every day and new log file with new date encoded in it is created. Examples of BMGR and Launcher log files are:

 

lch.fnpc105.941.20000628.log – this log file contains messages received from the Launcher running on fnpcb computer with process ID 941 during the day of 6/28/2000.

 

bmgr.fnpcb.20000705.log – contains messages received from BMGR running on fnpcb during the day of 7/5/2000.

 

These log files are created by logd process, and therefore would not exist if the logd process was not running.

 

Section log files are created by BMGR process itself and can be viewed using “fbs slog <section id>” command

 

Along with log files maintained by logd, BMGR and Launchers create stderr and stdout files in /tmp directories on the nodes where they are running. These files usually contain only Python stack crash dump information and should be consulted if corresponding log file is unavailable or does not contain necessary information.

Log Clean-up Procedures

FBSNG distribution package includes set of scripts to be used to maintain FBSNG logs and purge old log files. The scripts are designed to run periodically, e.g. daily, as cron tasks. The scripts are located in <FBSNG_DIR>/templates/bin and should be copied to <FBSNG_ROOT>/bin directory during installation. The scripts are:

 

logd.run.cron – this script performs the following functions:

The script parameters can be customized in fbs.cfg in set “logd”.

 

history.run.cron – this script performs history database maintenance functions. It is configured in “history” set of fbs.cfg file.

Known Bugs

There are 2 known bugs in FBSNG v1.0.

Suspended Sections

Symptoms: “fbs nodes” does not show any processes running on farm nodes, while “fbs ls” shows that there are some sections running. For example:

 

$ fbs nodes

 HOST               STATUS   CLASS       PROCESSES

 ----               ------   -----       --------

 fncdf1             up       Worker      []

 fncdf10            up       Worker      []

 fncdf11            up       Worker      []

 fncdf12            up       Worker      []

 fncdf13            up       Worker      []

 fncdf14            up       Worker      []

 

$ fbs ls

SectID          User     St Queue      Pr  Proc Type  N   Time               

-----           ----     -- -----      --  ---------  -   ----               

1617.exe        cdfprod0 *  D          0   MC_MDC2    4   09/27 03:10:09-Str 

1616.exe        cdfprod0 *  D          0   MC_MDC2    4   09/27 02:03:01-Str 

1613.exe        cdfprod0 *  D          0   MC_MDC2    4   09/26 23:09:42-Str 

1612.exe        cdfprod0 *  D          0   MC_MDC2    4   09/26 22:56:20-Str 

1615.exe        cdfprod0 *  D          0   MC_MDC2    4   09/27 00:39:32-Str 

1618.exe        cdfprod0 *  D          0   MC_MDC2    4   09/27 03:53:07-Str 

 

Solution: find out what nodes the suspended sections “think” they are running at are using “fbs status <section-id>” and “fbs nodes –l <node name>”, and re-start Launchers there:

 

$ fbs status 1617.exe

Job_ID: 1617

  Section Name: exe

 

  Per Process Rsrc: {'cpu': 100}          Per Sect Rsrc: {}

  Start Time: Wed Sep 27 03:10:09 2000    End Time: Not Finished            

  Hold Time:                           

 

  State:      running                     Depend:

  ProcType:   MC_MDC2                     Queue: D

  NumProc:    4                           Nice: 0

-------------------------------------------------------------------------

 

  PROCESS INFO FOR ID = 1617.exe.1

  Host: fncdf47

 

 

  PROCESS INFO FOR ID = 1617.exe.2

  Host: fncdf22

 

 

$ fbs nodes -l fncdf48 fncdf22 …

---------------------------------------------------------------------

HOST: fncdf22            CLASS: Worker           STATUS: up       

RESOURCES:  {'disk': (0, 18), 'cpu': (0, 200)}

PROCESSES:  []

 

---------------------------------------------------------------------

HOST: fncdf48            CLASS: Worker           STATUS: up       

RESOURCES:  {'disk': (0, 18), 'cpu': (100, 200)}

PROCESSES:  []

 

Caution: make sure no “normal” (not “suspended”) sections are running on the node before re-starting the launcher. If necessary, use “fbs hold node <node name>” to hold all the nodes to be re-started, and then wait for processes of all “normal” sections to exit there.

Resources Remain Allocated on Empty Held Nodes

Symptoms: “fbs nodes” shows certain node or nodes as “held” without any processes running there, but “fbs nodes –l <node name>” and/or “fbs resources –n <node name>” show non-zero resource utilization on the node.

 

More generally, resource utilization shown by “fbs resources –n <node name>” is inconsistent with what “fbs ls” and “fbs status” show.

 

For example, “fbs nodes –l <node name>” shows that no processes are running on certain node, but 100 units of resource “cpu” are allocated:

 

> fbs nodes -l fnpc238

---------------------------------------------------------------------

HOST: fnpc238            CLASS: Worker           STATUS: up       

RESOURCES:  {'disk': (0, 8), 'cpu': (100, 100), 'Worker': (0, 0)}

PROCESSES:  []

 

PROSESSES list is empty ([]), and cpu utilization (first number in parenthesis in ‘cpu’: (100, 100) is 100.

 

Solution: Re-start BMGR.


Appendix A: Format of FBSNG Configuration Files

FBSNG configuration consists of two plain text files, fbs.cfg and farm.cfg. Both files have the same general structure. Each file consists of sets of parameters. Each set has set type and set identification (set id). There should be only one set with the same combination of set type and set id in the file. Each set begins with set header line:

 

%set <set type> [<set id>]

 

Set id is optional, and if not specified, the set is called type default set. There should be not more than one default set for each type. Values of all parameters specified in the type default set will be used as defaults for all other sets of the same type.

 

For example, the following configuration specification

 

%set car

wheels = 4

seats = 5

 

%set car van

seats = 8

 

%set car convertible

convertible_top = yes

 

is equivalent to:

 

%set car van

wheels = 4

seats = 8

 

%set car convertible

wheels = 4

seats = 5

convertible_top = yes

 

Set header line is followed by one or more parameter definitions. Parameters can be specified in one of 3 formats:

 

<name> = <value>

<name> = <item> <item> …

<name> = <key>[:<value>] <key>[:<value>] …

 

First format is used to represent a parameter with single value, second – a parameter with a list of values and third – a dictionary. In case of dictionary type, some values may be omitted together with separating colons. For the keys without values, the application will use some default value.

 

In case when a list or a dictionary does not fit into single line, it can be continued on the next line. For example:

 

%set car

tires = left-front:34psi left-rear:30psi

       right-front:34psi right-rear:30psi

       spare:unknown

 

Words in parameter values in first two formats can be enclosed in single or double quotes to allow white space inside a word. Pound sign (#) can be used as a comment character. Except for when pound sign is in the middle of a word, right end of the line starting with the pound sign is ignored. For example:

 

%set parser                      # this set describes some

                                   # sort of parser

special_chars = +-#*/        # Pound in the middle does NOT

                                   # begin comment

error_format = ‘ERROR: %t%d’       # use quotes to enter spaces

#uncomment to allow a time-out

#time_out = 10