FBSNG Installation and Administration Guide
Jim Fromm, Krzysztof Genser, Tanya Levshina, Igor Mandrichenko
Farms and Clustered Systems Group
Fermi National Accelerator Laboratory
Daemons Configuration File fbs.cfg
Farm Configuration File farm.cfg
How to Tell Whether FBSNG is Running
Shutting Down FBSNG Components
Appendix A: Format of FBSNG
Configuration Files
Appendix B: Sample fbs.cfg file
Appendix C: Sample farm.cfg file
FBSNG is distributed as a Fermilab Unix Environment (FUE) product. FUE simplifies the installation of FBSNG. However, FBSNG is fully capable to operate in non-FUE environemt. This document covers installation procedures for both FUE and non-FUE initial FBSNG installation. Refer to FBSNG Release Notes document if you are upgrading FBSNG to a newer version.
FBSNG requires the following products to be installed:
Te following
is set of steps the administrator has to perform in order to install FBSNG and
products it depends on a FUE farm.
1. Using UPD/UPS install and declare FCSLIB product. It is FUE-compliant product distributed through KITS. It is portable across different supported UNIX versions, so it should be declared for NULL UPS flavor. FCSLIB comes with fcslib.table UPS action table file. It must be specified in "ups declare" command:
ups declare
-0 -c -m fcslib.table ... fcslib <version>
2. Using UPS/UPD install and declare fbsng product. This will unwind the product into a directory referred to as $FBSNG_DIR. Currently, fbsng is portable across different known versions of Linux and IRIX, so only OS flavor, not version has to be included in UPS declaration.
FBSNG comes with UPS action table file fbsng.table, which must be mentioned in "ups declare" command.
ups declare
-1 -m fbsng.table ... fbsng <version>
ups declare
-f Linux -m fbsng.table ... fbsng <version>
ups declare
-f IRIX -m fbsng.table ... fbsng <version>
3. Create a directory, preferrably in NFS-shared area, for FBSNG configuration files, databases and utility scripts. This directory should be shared by nodes running different flavors of UNIX. This directory will be referred to as <FBSNG_ROOT> or $FBSNG_ROOT.
4. Using UPS, "tailor" the product:
ups tailor -O
<FBSNG_ROOT> fbsng <version>
ups tailor -O
<FBSNG_ROOT> -c fbsng
This step must be performed from the same account as used on step 1.
5. Copy directory tree from <FBSNG_DIR>/templates to <FBSNG_ROOT> using:
cp -r
<FBSNG_DIR>/templates/* <FBSNG_ROOT>
This should create directories:
<FBSNG_ROOT>/cfg - for configuration
/bin - for utility scripts
/history - for historical DB
/logs - for internal FBSNG logs
/slog - for section log files
/archive - for compressed log archives
/jdb - job database
6. Configure FBSNG (see FBSNG Configuration section below). Create FBSNG configuration files fbs.cfg and farm.cfg in <FBSNG_ROOT>/cfg. You can use fbs.cfg.example file in the same directory as an example. Note that interactive FBSNG configuration tools can not be used until BMGR process is running. Therefore, it is recommended to either
· Create initial configuration editing farm.cfg file with a text editor, or
· Rename farm.cfg.template file into farm.cfg, start BMGR (see Starting FBSNG section below), and then use FBSNG Configuration Utility or FBSNG GUI to create desired configuration.
7. Follow instructions in scripts in <FBSNG_ROOT>/bin to adjust them
to your installation, in particular, location of setups.(sh|csh) scripts.
The
following procedure should be used to initially install FBSNG on a non-FUE
farm.
1. Install FCSLIB product:
Download fcslib_<version>_<flavor>.tar file.
Unwind the tar file into, preferably, NFS-shared directory. This
directory will be referred to as <FCSLIB_DIR>. FCSLIB contains
some Python modules used by FBSNG. It is portable across different
flavors of UNIX, and therefore <FCSLIB_DIR> can be shared by
all nodes of the farm.
2. Designate a directory for FBSNG configuration files, databases and utility scripts. This directory should be shared by nodes running different flavors of UNIX, therefore, it is recommended to have it in NFS-shared space. This directory will be referred to as <FBSNG_ROOT> or $FBSNG_ROOT.
3. Designate a directory for FBSNG scripts, executables and libraries. This directory will be referred to as <FBSNG_DIR> or $FBSNG_DIR. Each different flavor of UNIX (not each separate version of the same OS) must have its own <FBSNG_DIR>.
4. Download and unwind fbsng_<version>_<flavor>.tar file(s) into
corresponding <FBSNG_DIR>s. This will create:
<FBSNG_DIR>/bin -
with FBSNG binaries and Python sources
<FBSNG_DIR>/lib -
with FBSNG API and miscellaneous modules
<FBSNG_DIR>/templates - with configuration and script templates
5. Copy directory tree from <FBSNG_DIR>/templates to <FBSNG_ROOT> using:
cp -r
<FBSNG_DIR>/templates/* <FBSNG_ROOT>
This should create directories:
<FBSNG_ROOT>/cfg
- for configuration
/bin
- for utility scripts
/history - for historical DB
/log - for internal FBSNG logs
/slog - for section log files
/archive - for compressed log archives
/jdb - for job database
6. Configure FBSNG (see FBSNG Configuration section below). Create FBSNG configuration files fbs.cfg and farm.cfg in <FBSNG_ROOT>/cfg. You can use fbs.cfg.example file in the same directory as an example. Note that interactive FBSNG configuration tools can not be used until BMGR process is running. Therefore, it is recommended to either
· Create initial configuration editing farm.cfg file with a text editor, or
· Rename farm.cfg.template file into farm.cfg, start BMGR (see Starting FBSNG section below), and then use FBSNG Configuration Utility or FBSNG GUI to create desired configuration.
7. Follow instructions in scripts in <FBSNG_ROOT>/bin to adjust them to your installation. In particular, comment out lines related to FUE
environment, comment in lines for non-FUE environment, and fill-in
actual <FBSNG_DIR> and <FCSLIB_DIR> locations.
fbs_env.(sh|csh) will be 'sourced' by users in order to use FBSNG. Sourcing
fbs_env script will add front-end 'fbs' command and FBSNG API to
user's PATH and PYTHONPATH respectively. Make sure that <FCSLIB_DIR>/lib
is included into PYTHONPATH.
FBSNG has two configuration files, farm.cfg and fbs.cfg, located in $FBSNG_ROOT/cfg directory.
File fbs.cfg contains configuration parameters for FBSNG components such as TCP/IP addresses, directory locations, time-outs, security-related parameters, etc. This file should be edited by a text editor. For any modifications of this file to take effect, corresponding FBSNG component or components must be re-started.
File farm.cfg contains description of the farm in terms of resources, process types and scheduling parameters. FBSNG comes with FBSNG Configuration Utility. FBSNG monitor includes GUI version of the configuration utility. The configuration utility allows the administrator to perform most of configuration operations without editing farm.cfg file. However, in an emergency situations, and when the configuration utility does not provide necessary functionality, manual editing of farm.cfg file may be required. In this case, BMGR daemon must be restarted after the editing in order for the modifications to take effect.
Manual editing of the farm.cfg file is required when the administrator needs to remove FBSNG objects like queues, process types, node classes, etc. except for removing a node from a node class. This later function can be performed using the FBSNG Configuration Utility or GUI. Certain rules that must be followed when manually editing farm.cfg file:
If a non-empty queue is removed from the configuration, BMGR process will not re-start until the queue is returned back into the configuration.
If a process type is removed from FBSNG configuration, BMGR process will not re-start until the process type is returned back into the configuration.
If a node with batch processes running is removed, the processes will continue to run, but FBSNG will consider them terminated as if the node was shut down.
If a resource or pool is removed, all pending sections that require the resource or pool will remain pending indefinitely.
Both configuration files are described below.
FBSNG daemons configuration is specified in $FBSNG_ROOT/cfg/fbs.cfg file. The file has the structure described above in Format of FBSNG Configuration Files section. This file must be located in $FBSNG_ROOT/cfg directory. The following is description of each set of the fbs.cfg file and set parameters.
See Appendix B for an example of fbs.cfg file.
This set defines parameters for BMGR daemon. The following table describes meaning of each parameter and default values, if any:
|
Parameter |
Value |
Default |
Meaning of Value |
|
api_port |
Integer |
(required) |
TCP port number for API
client. |
|
launcher_if_port |
Integer |
(required) |
LauncherIF port where
launchers connect to. Unless BMGR is running as root, this number should 1024
or greater. |
|
host |
String |
(required) |
Host name where BMGR is
running |
|
allow_submit |
List of Strings |
(allow submit from any
node) |
List of patterns,
determines what computers job submission is allowed from. Each pattern can be in
one of 3 forms:
|
|
deny_submit |
List of Strings |
(empty list) |
List of patterns,
determines what computers job submission is denied from. Patterns are in the
same format as for “allow_submit” parameter. |
|
admin_list |
List of Strings |
(any user is an admin) |
List of user names or
numeric user ids. This is list of users authorized to modify FBSNG
configuration. |
|
section_log_dir |
String |
(required) |
Directory where section
logs are stored |
|
recovery_timeout |
Integer |
10 |
BMGR recovery time-out
in seconds. During initial recovery interval, BMGR waits for launchers on
farm nodes to connect and ignores requests from API clients such as User
Interface commands and GUI. |
|
log_ignore |
String |
"" |
Log messages not to
suppressed ‘I’,’D’,’E’,’F’ or any
combination of them. Each letter suppresses
log messages of types Informational, Debug,
Error and Fatal respectively. |
|
job_retention_interval |
Integer |
600 |
Job retention interval
in seconds. Jobs will be kept in memory for the specified time after
completion. |
Parameters “allow_submit” and “deny_submit” define what computers job submission is allowed from. These computers are referred to as “trusted computers”. Submission is allowed from the node if:
and
Together with allow_submit and deny_submit lists, the admin_list parameter defines list of “trusted users” or “administrators”. The user is an administrator if:
and
Depending on username, UID and the node the user comes from, he/she can perform the following actions:
|
Trusted node ? |
Trusted user ? |
Can modify
configuration |
Can submit jobs |
|
Yes |
Yes |
Yes |
Yes |
|
Yes |
No |
No |
Yes |
|
No |
Yes |
No |
No |
|
No |
No |
No |
No |
Any user on any computer is allowed to obtain run-time and configuration information.
This set controls where FBSNG keeps its job data database. Currently, it has the only parameter:
|
Parameter |
Value |
Default |
Meaning of the value |
|
root |
String |
(required) |
Path
to the directory where job database is located. The directory should not
contain any other information than the database created by BMGR |
This set defined parameters for FBSNG logd daemon and supplemental archiving, compression and purging scripts.
|
Parameter |
Value |
Default |
Meaning of the value |
|
host |
String |
(required) |
Log
daemon host name |
|
server_port |
Int |
(required) |
Log
daemon UDP port number |
|
log_dir |
String |
(required) |
Log
directory |
|
email_wait |
Int |
1800 |
Time
interval between sending identical e-mail |
|
email_command |
String |
"Mail" |
Command
to be used to send e-mail |
|
arch_dir |
String |
(required) |
Log
archive storage directory |
|
days_to_zip |
Int |
1 |
Age
of log files to compress in days |
|
days_to_archive |
Int |
5 |
Age
of files to archive in days |
|
ignor_codes |
String |
"" |
Message
codes to ignore. This is a combination of single-character message type codes.
Any message with code present in this string will be discarded without
storing in the log files. |
|
admin_address |
String |
None |
e-mail
address for error reports |
This set defines parameters for FBSNG job history database:
|
Parameter |
Value |
Default |
Meaning of Value |
|
archive_period |
Integer |
1 |
Interval
between moving the data from history file to archive file, in days. |
|
hist_dir |
String |
(required) |
Directory
where history file(s) are stored |
|
archive_file |
String |
archive.log |
Archived
history file name |
|
hist_file |
String |
hist.log |
Current
history file name |
|
compress_period |
Integer |
|
Interval
in days between compressing history information into .tar.Z files |
|
days_to_store |
Integer |
|
Time
in days to keep information in history database |
This set has two parameters:
|
Parameter |
Value |
Default |
Meaning of Value |
|
domain |
String |
(empty) |
Common
suffix for farm node names |
|
mail_command |
String |
/bin/mail |
External
command to be used by FBSNG components to send e-mail messages |
|
Farm_name |
String |
(empty) |
Name
of the farm |
Value of “domain” parameter should be common IP domain name for all farm nodes. If specified, farm node name used by FBSNG will be determined by truncating the specified domain name from the “official” (obtained by gethostname() system function) IP node name.
Set of type “launcher” has only one integer parameter “stat_port”. Its value is UDP port number used by API clients to obtain run-time information about running processes.
This file contains definition of farm resources, farm nodes, their classes, process types, and queues. See FBSNG Resources and FBSNG Scheduler documents for more details on these concepts and relationships between them. This file must be located in $FBSNG_ROOT/cfg directory. The file protection mask must allow BMGR process to write into this file. See Appendix C for an example of farm.cfg file.
Normally, FBSNG Configuration Utility or GUI should be used to modify farm configuration defined in farm.cfg file. However, under certain circumstances it may be necessary to edit this file using a text editor. In such cases, BMGR daemon must be re-started in order for the modifications to take effect.
Farm resources are defined in set “resources” of farm.cfg. The set has two parameters:
|
Parameter |
Value |
Default |
Meaning of Value |
|
local |
List
of Strings |
(no
local resources) |
List
of local resource names, including node attributes. |
|
global |
Dictionary String:Integer |
(no
global resources) |
Dictionary
with resource names as keys and corresponding integer resource capacities as
values |
Resource pools are defined in set “resource_pools”. Each pool is represented with single set parameter with the pool name as parameter name and list of underlying resource names as value.
Each node class is represented as a separate set of type “node_class” and the class name as set id. Each set has two parameters:
|
Parameter |
Value |
Default |
Meaning of Value |
|
resources |
Dictionary String:Integer
|
(no
local resources defined for the class) |
Dictionary
with local resource names as keys and optional integer capacities as values.
If a value is not specified, corresponding resource will be considered as a
node attribute. |
|
local_disks |
Dictionary String:String |
(no
local disks defined for the class) |
This
dictionary describes correspondence between resource names used to represent
local scratch disks and actual locations of the scratch disk storage. For
each key-value pair, the key must be the name of one of the local resources
defined for the node class, and the value – path to corresponding directory. |
Local scratch disk areas are specified as their top directories. If the specified top directory does not exist, it will be created. FBSNG assumes that the specified directory will not contain any file or directory other than those created by FBSNG as temporary working directories for batch processes, and therefore, can delete any file or directory found there at any time.
Set “node_list” defines the node class each node belongs to. The set consists of
<node name> = <node class name>
lines, one line per node. Node name for a farm node is defined as its official IP node name with domain name (as specified in set “global” of fbs.cfg, if any) truncated.
Each process type is represented as a separate set of type “proc_type” and the process type name as the set id. Each set has 3 patameters:
|
Parameter |
Value |
Default |
Meaning of Value |
|
proc_rsrc_defaults |
Dictionary String:Integer
|
(no
default resource requirements) |
This
dictionary defines default resource requirements for processes of this type. For
each key-value pair, the key must be the name of one of a resource, and the
value – integer amount of the resource required. For node attributes, the
value should be omitted. |
|
sect_rsrc_defaults |
Dictionary String:Integer
|
(no
default resource requirements) |
This
dictionary defines default resource requirements for sections of this type. For
each key-value pair, the key must be the name of one of global resources, and
the value – integer amount of the resource required. |
|
Resource_quota |
Dictionary String:Integer |
(no
local disks defined for the class) |
This
dictionary resource allocation quotas for the process type. For each
key-value pair, the key must be the name of one of local or global resources,
and the value – the allocation limit. If
a resource is not mentioned in the dictionary, the process type can allocate
any amount of the resource. |
FBSNG Scheduler is configured by defining set of Scheduler queues and their parameters. See FBSNG Scheduler document for more details on its algorithm and configuration parameters.
In farm.cfg file, each queue is represented by one set of type “queue” with the following parameters:
|
Parameter |
Value Type |
Default |
Meaning of Value |
|
proc_type |
String |
|
Default
process type for the queue |
|
s_prio_max |
Int |
100 |
Maximal
section priority |
|
s_prio_gap |
Int |
1000 |
Section
priority gap |
|
q_prio_min |
Int |
0 |
Minimal
queue priority (= initial queue priority) |
|
q_prio_max |
Int |
100 |
Maximal
queue priority |
|
q_prio_inc |
Int |
1 |
Queue
priority increment |
|
q_prio_dec |
Int |
10 |
Queue
priority decrement |
|
q_prio_gap |
Int |
1000 |
Queue
priority gap |
|
cputime |
Int |
(unlimited) |
Process
CPU time limit |
|
realtime |
Int |
(unlimited) |
Process
run time limit |
The utility can work in 2 modes, interactive and single command. In interactive mode the utility prompts for next command after completion of previous. In single-command mode, it executes only one command and exits. Interactive mode is invoked by "fbs config" command. Command syntax is:
<command> <arguments> ...
Single-command mode is involed by typing
fbs config <command> <arguments> ...
Both modes accept the same set of commands and arguments:
The following commands are used to create FBSNG objects:
create queue <queue name> <def. process type> [<parameters>]
See “set queue” command description for list of optional parameters. If no parameters are specified, the queue is created with default set of parameters and the specified process type associated with it. Queues are always created in locked state. In order to unlock a queue, use “unlock queue” command.
create nclass <node class name>
The command creates an empty (having no nodes associated to it) node class with no resources or local disks defined. Use "set nclass" command to configure local resources and local scratch disks for the node class. Use "add/remove node" commands to manipulate list of nodes of the class.
create ptype <proc type name> [<parameters>]
See "set ptype" command description for accepted list of optional parameters. If no parameters are specified, the new process type will have no quotas or default resource requirements defined.
create pool <pool name> <resource name> ...
The command creates new resource pool combining listed resources.
create gr <global resource name> <capacity>
The command creates new global resource with specified capacity.
create lr <local resource name>
The command creates new local resource.
The following commands can be used to modify parameters of existing objects
set queue <queue name> <parameter>[:<value>] ...
The command accepts the following parameters: QPGap, QPInc, QPDec, MaxQPrio, MinQPrio, MaxSPrio, SPGap, Prio, DefProcType, RealTimeLimit, CPUTimeLimit.
Values for parameters RealTimeLimit and CPUTimeLimit can be omitted together with separating colons. In this case corresponding limits will be removed.
set nclass <node class> rsrcs <resource>:<capacity> ...
“rsrcs” keyword indicates that local resource capacity for the class should be set as specified as by the following dictionary.
set nclass <node class> disks
<disk resource name>:<root directory> ...
“disks” keyword indicates that the following is the mapping between resource names and associated local scratch disk locations. Local disk resources must already be defined as local resources for the node class.
set ptype <proc type name> <parameter> <value>
Parameters and their values are:
set gr <global rsrc name> <capacity>
set pool <pool name> <resource name> ...
The following commands should be used to add or remove nodes to or from an existing node class. Only nodes without any processes running can be removed from a node class. When a node is added to a node class, its status becomes "held".
add node <node class name> <node name> ...
remove node <node class name> <node name> ...
An administrator can lock a queue to disable submission of new sections to it in order to "drain" the queue down. When a new queue is created, it must be unlocked before it can be used. The commands are:
lock queue (all|<queue name>)
unlock queue (all|<queue name>)
If the keyword “all” is used instead of a queue name, the operation will be applied to existing queues.
In order to see information about one or more objects of certain type, one can use "show" command:
show (queue|nclass|pool|ptype|rsrc) [<name> ...]
If no object name is specified, all objects of the type will be displayed. "rsrc" keyword is used to show information about all types of resources.
If a user uses incomplete command, usage information is printed. Also a user can issue help command for a brief command summary.
FBSNG has
the following permanently running processes (daemons):
Although
not necessary, it is recommended to set up unprivileged account for bmgr and
logd and create <FBSNG_ROOT> directory in NFS-exported home area of such
account.
In order
to start each daemon, administrator should use start_bmgr.sh, start_logd.sh and
start_lch.sh scripts found in <FBSNG_ROOT>/bin directory. Before they can
be used, the scripts must be adjusted to the particular installation following
instructions found in the script files. Administrator must 'su' to proper
account before using the scripts. The scripts are designed to be called from
system boot-time start-up scripts.
FBSNG
distribution includes “sendcmd” script. This script, located in $FBSNG_ROOT/bin
directory, can be used to perform certain operations on a number of nodes using
“rsh” command. “sendcmd” script can be used to start launcers on farm nodes in
the following way:
cd
$FBSNG_ROOT/bin
./sendcmd
fnpc 101 120 /home/farms/fbsng_root/bin/start_lch.sh
This
command will run the start_lch.sh script on nodes fnpc101 through fnpc120.
In cases
when using start_xxx.sh scripts is not acceptable due to site-specific policies
or procedures, FBSNG Administrator can use lower level commands to start FBSNG
components. The following commands run corresponding daemons:
fbs bmgr #
run BMGR
fbs launcher # run
Laucnher
fbs logd #
run Logd
First
quick check is to use UNIX “ps” command. Here is what ps output should look
like for BMGR:
fnpcb> ps axw | grep bmgr
17177 pts/12 S 0:01 sh –f
/home/farms/fbsng_v1_1/bin/fbs_run.sh bmgr
20163 pts/12 S 0:00 sh
/fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng bmgr
20165 pts/12 S 0:09 python
/fnal/ups/prd/fbsng/v1_1/Linux/bin/bmgr.py
for
launcher:
fnpcb> ps axw | grep launcher
13660 ? S 0:00 sh -f
/home/farms/fbsng_v1_1/bin/fbs_run.sh launcher
20488 ? S 0:00 sh /fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng
launcher
20509 ? S 28:50 python
/fnal/ups/prd/fbsng/v1_1/Linux/bin/launcher.py
and for
logd:
fnpcb> ps axw | grep logd
18921 pts/12 S 0:00 sh -f
/home/farms/fbsng_v1_1/bin/fbs_run.sh logd
19007 pts/12 S 0:00 sh
/fnal/ups/prd/fbsng/v1_1/Linux/bin/fbsng logd
19008 pts/12 S 0:21 python
/fnal/ups/prd/fbsng/v1_1/Linux/bin/logd.py
Also, in
order to see if FBSNG is working properly, administrator can use “fbs nodes”
command. If the command prints list of the farm nodes, it means that important
FBSNG components are configured properly, and BMGR process is running. Here is
what normal “fbs nodes” output might look like:
fnpcb> fbs nodes
HOST
STATUS CLASS PROCESSES
---- ------ ----- --------
fnpcb
up
IO []
fnpc104
down
Worker
[]
fnpc105
down
Worker
[]
fnpc106
down
Worker
[]
fnpc107
down
Worker
[]
fnpc101
down
Worker
[]
fnpc102
down
Worker
[]
fnpc103
down
Worker
[]
fnpc108
down
Worker
[]
It should
be noted that after starting, BMGR process is supposed to ignore all request
while it performs initial recovery functions. The recovery time interval is
defined in fbs.cfg file.
“fbs
nodes” command will print status of all farm nodes. “down” node status may
indicates that either the node is down, or launcher process on the node can not
connect to BMGR process, or the launcher is not running there.
If started
using corresponding start-up scripts, FBSNG components will be re-started
automatically in a minute after they exit. Therefore, to re-start an FBSNG
component, it is sufficient to kill corresponding process.
<FBSNG_ROOT>/bin directory contains kill_lch.sh, kill_bmgr.sh and
kill_logd.sh scripts. They can be used to kill launcher, BMGR and logd
processes respectively. Of course, they must be used on the node where the
process is running, and require necessary privileges to kill the process.
In order to shut down FBSNG component so that it does not re-start afterwards, shutdown_lch.sh, shutdown_bmgr.sh and shutdown_logd.sh scripts found in <FBSNG_ROOT>/bin directory should be used.
FBSNG is designed so that it can recover from most of planned or unexpected component failures. However, when planning re-start of FBSNG components, for example after modifying FBSNG configuration, the administrator should follow certain rules:
Launcher on a farm node should never be re-started while any batch processes run on the farm node. Launcher termination is interpreted as shutting down the node it runs on, and any processes running there are considered to be terminated. If it is necessary to shut down or re-start a launcher, the administrator should hold the node using “fbs hold node” command, either kill all jobs or job sections running on the node, or wait until all processes running on the node exit, and only then re-start or shut down the launcher process.
In order
to prevent FBSNG database corruption, BMGR process must not be killed
with SIGKILL (usually signal 9). It should be killed only with SIGINT signal.
It may take several seconds for BMGR to exit after receiving the signal
depending on what state it was when the signal was received. It is recommended
to always use kill_bmgr.sh to re-start the BMGR process or shutdown_bmgr.sh to
shut it down permanently. As long as it is done properly, it is safe to shut
down or re-start BMGR process at any time. When BMGR process re-starts, it will
recover run-time information about pending and running batch jobs.
Logd process can be killed virtually any time. If logd re-starts shortly, no log information will be lost. However, if it does not re-start soon enough, some old log information may be discarded.
Unless configured otherwise, FBSNG keeps its own log files in <FBSNG_ROOT> directory tree. They are organized as follows:
<FBSNG_ROOT>/logs/bmgr - BMGR log files
<FBSNG_ROOT>/logs/lch - Launcher log files
<FBSNG_ROOT>/slog - Section log files
BMGR and Launcher log file names include date when the file was open and the node where the process is running. For Launcher, the log file name includes UNIX process ID of the Launcher process. The log files are closed around midnight every day and new log file with new date encoded in it is created. Examples of BMGR and Launcher log files are:
lch.fnpc105.941.20000628.log – this log file contains messages received from the Launcher running on fnpcb computer with process ID 941 during the day of 6/28/2000.
bmgr.fnpcb.20000705.log – contains messages received from BMGR running on fnpcb during the day of 7/5/2000.
These log files are created by logd process, and therefore would not exist if the logd process was not running.
Section log files are created by BMGR process itself and can be viewed using “fbs slog <section id>” command
Along with log files maintained by logd, BMGR and Launchers create stderr and stdout files in /tmp directories on the nodes where they are running. These files usually contain only Python stack crash dump information and should be consulted if corresponding log file is unavailable or does not contain necessary information.
FBSNG distribution package includes set of scripts to be used to maintain FBSNG logs and purge old log files. The scripts are designed to run periodically, e.g. daily, as cron tasks. The scripts are located in <FBSNG_DIR>/templates/bin and should be copied to <FBSNG_ROOT>/bin directory during installation. The scripts are:
logd.run.cron – this script performs the following functions:
The script parameters can be customized in fbs.cfg in set “logd”.
history.run.cron – this script performs history database maintenance functions. It is configured in “history” set of fbs.cfg file.
FBSNG configuration consists of two plain text files, fbs.cfg and farm.cfg. Both files have the same general structure. Each file consists of sets of parameters. Each set has set type and set identification (set id). There should be only one set with the same combination of set type and set id in the file. Each set begins with set header line:
%set <set type> [<set id>]
Set id is optional, and if not specified, the set is called type default set. There should be not more than one default set for each type. Values of all parameters specified in the type default set will be used as defaults for all other sets of the same type.
For example, the following configuration specification
%set car
wheels = 4
seats = 5
%set car van
seats = 8
%set car convertible
convertible_top = yes
is equivalent to:
%set car van
wheels = 4
seats = 8
%set car convertible
wheels = 4
seats = 5
convertible_top = yes
Set header line is followed by one or more parameter definitions. Parameters can be specified in one of 3 formats:
<name> = <value>
<name> = <item> <item> …
<name> = <key>[:<value>] <key>[:<value>] …
First format is used to represent a parameter with single value, second – a parameter with a list of values and third – a dictionary. In case of dictionary type, some values may be omitted together with separating colons. For the keys without values, the application will use some default value.
In case when a list or a dictionary does not fit into single line, it can be continued on the next line. For example:
%set car
tires = left-front:34psi left-rear:30psi
right-front:34psi right-rear:30psi
spare:unknown
Words in parameter values in first two formats can be enclosed in single or double quotes to allow white space inside a word. Pound sign (#) can be used as a comment character. Except for when pound sign is in the middle of a word, right end of the line starting with the pound sign is ignored. For example:
%set parser # this set describes some
# sort of parser
special_chars = +-#*/ # Pound in the middle does NOT
# begin comment
error_format = ‘ERROR: %t%d’ # use quotes to enter spaces
#uncomment to allow a time-out
#time_out = 10
This sample file is supplied with FBSNG distribution package.
#
# fbs.cfg - FBSNG configuration file example
#
%set global
domain = fnal.gov # common domain name for farm nodes only
#farm_name = "CDF Farm" # name of the farm
#mail_command = /usr/lib/sendmail
# command to be used to send e-mail
#------------------------------------------------------------------
# Define FBS BMGR process parameters:
%set bmgr
api_port = 6667 # API server TCP port number
recovery_timeout = 10 # recovery time-out in seconds
launcher_if_port = 5557 # Launcher interface TCP port number
host = hppc.fnal.gov # IP address of BMGR node
# domain name from "global" does not apply
# here !
job_retention_interval = 20 # How long completed jobs are to be kept
# in memory (seconds)
section_log_dir = /home/farms/fbsng/slog
# where section logs are to be stored
#allow_submit = *.fnal.gov # where to accept job submission from
131.225.*
#deny_sunmit = fnpc[a-c].fnal.gov # where to reject submits from
#admin_list = root farms # list of users allowed to re-configure
# the farm
log_ignore = D # do not send debug messages to logd
#------------------------------------------------------------------
# Define Log Daemon parameters
%set logd
host = hppc.fnal.gov # IP address of the node where it is running
# domain name from "global" does not apply
# here !
server_port = 7654 # UDP server port number
log_dir = /home/farms/fbsng/logs
# Root directory for log storage
arch_dir = /home/farms/fbsng/archive
# Where log archive will be stored
days_to_zip = 2 # Compress files more than 2 days old
days_to_archive = 5 # Archive files more than 5 days old
admin_address = root # e-mail address to use when reporting
# problems
email_wait = 10 # do not send the same message more often
# than every 10 minutes
ignor_codes = DI # do not log messages of levels D (debug)
# or I (informational). Use Z to store
# all messages.
#------------------------------------------------------------------
# Define location for persistent job database and history database
%set jobdb
root = /home/farms/fbsng/jdb # Job data storage directory
%set history
hist_dir = /home/farms/fbsng/history
# History database location
hist_file = history.log # Current history file name
archive_file=archive.log # Archived history file name
archive_period=1 # Move history to archive every 1 day
#------------------------------------------------------------------
# Define Launcher parameters. Use "%set launcher" for default
# parameters and "%set launcher <node-name>" for node-specific
# parameters
%set launcher
stat_port = 3333 # Run-time status information server
# UDP port number
log_ignore = D # do not send debug messages to logd
# If necessary, define different parameter values for different nodes
%set launcher h1
stat_port = 5555
#--------------------------------------------------------------------------
#- File:
farm.cfg
#- FBSNG
farm configuration file.
#-
#-
IMPORTANT NOTICE:
#-
=================
#- This
file should be edited directly only if absolutely necessary.
#- Use
"fbs config" or "fbs monitor" to modify farm configuration.
#- If
editing is necessary, preserve header lines starting with "#-"
#- and
modification history lines starting with "##".
#-
#-
farm.cfg set description:
#-
-------------------------
#- %set
resources
#- local
= <local resource name> ...
#- global
= <global resource name>:<integer capacity> ...
#-
#- %set
resource_pools
#-
<pool name> = <resource name> ...
#- ...
#-
#- %set
node_class <class name>
#-
resources = <local resource name>:<capacity> ... <attribute>
...
#-
local_disks = <resource name>:<root directory> ...
#-
#- %set
node_list
#-
<node name> = <node class name>
#- ...
#-
#- %set
proc_type <proc.type name>
#-
proc_rsrc_defaults = <resource name>:<amount> ... <attribute>
...
#-
sect_rsrc_defaults = <global resource name>:<amount> ...
#-
resource_quota = <resource name>:<max. amount> ...
#-
max_prio_inc = <integer max. allowed priority increment>
#-
#- %set
queue <queue name>
#-
s_prio_gap = <section priority gap>
#-
proc_type = <default process type> (required
field)
#-
q_prio_max = <max. queue priority>
#-
q_prio_gap = <queue priority gap>
#-
s_prio_max = <max. section priority>
#-
q_prio_min = <min. queue priority>
#-
q_prio_inc = <queue priority increment>
#-
q_prio_dec = <queue priority decrement>
#-
#---------------------------------------------------------------------------
## Fri
Sep 15 11:38:33 2000: ivm: removed node fnpc111
##
## Fri
Sep 15 11:38:33 2000: ivm: removed node fnpc112
##
%set
node_list
fnpc104 =
Worker
fnpc105 =
Worker
fnpc106 =
Worker
fnpc107 =
Worker
fnpcb =
IO
fnpc102 =
Worker
fnpc103 =
Worker
fnpc108 =
Worker
fnpc101 =
Worker
%set
resources
local =
IO cpu disk1 disk2
global =
nfs_disk:30
%set
proc_type IO
resource_quota
= cpu:1000
proc_rsrc_defaults
= IO cpu:50
%set
proc_type Light
resource_quota
= cpu:150
proc_rsrc_defaults
= cpu:1
%set
proc_type Worker
proc_rsrc_defaults
= cpu:100
%set
node_class IO
resources
= disk2:20 IO cpu:200
local_disks
= disk2:/tmp/fbs_scratch
%set
node_class Worker
resources
= cpu:100 disk1:10
local_disks
= disk1:/tmp/fbs_scratch
%set
resource_pools
disk =
disk1 disk2
%set
queue FastQ
q_prio_max
= 100
s_prio_max
= 100
q_prio_gap
= 5
s_prio_gap
= 1000
proc_type
= Worker
q_prio_inc
= 3
q_prio_min
= 2
%set
queue IOQ
s_prio_gap
= 1000
proc_type
= IO
q_prio_max
= 1000
q_prio_gap
= 1000
s_prio_max
= 100
%set
queue LongQ
s_prio_gap
= 10
proc_type
= Worker
q_prio_max
= 100
q_prio_gap
= 10
s_prio_max
= 100