schedo Command

Purpose

Manages processor scheduler tunable parameters.

Syntax

schedo [ -p | -r ] [-y] { -o Tunable[= Newvalue]}

schedo [ -p | -r ] [-y] { -d Tunable }

schedo [ -p | -r ] [-y] -D

schedo [ -p | -r ] [ -F] -a

schedo -h [Tunable ]

schedo [-F] -L [Tunable ]

schedo [-F] -x [Tunable ]

Note: Multiple flags -o, -d, -x, and -L flags are allowed

Description

Note: The schedo command can only be executed by root.

Use the schedo command to configure scheduler tuning parameters. This command sets or displays current or next boot values for all scheduler tuning parameters. This command can also make permanent changes or defer changes until the next reboot. Whether the command sets or displays a parameter is determined by the accompanying flag. The -o flag performs both actions. It can either display the value of a parameter or set a new value for a parameter.

Understanding the Effect of Changing Tunable Parameters

Misuse of this command can cause performance degradation or operating-system failure. Be sure that you have studied the appropriate tuning sections in the AIX® Version 7.1 Performance management before using schedo to change system parameters.

Before modifying any tunable parameter, you must first carefully read about all its characteristics in the Tunable Parameters section below, and follow any Refer To pointer, in order to fully understand its purpose.

You must then make sure that the Diagnosis and Tuning sections for this parameter truly apply to your situation and that changing the value of this parameter could help improve the performance of your system.

If the Diagnosis and Tuning sections both contain only "N/A", you must never change this parameter unless specifically directed by AIX development.

Priority-Calculation Parameters

The priority of most user processes varies with the amount of processor time the process has used recently. The processor scheduler's priority calculations are based on two parameters that are set with schedo, sched_R and sched_D. The sched_R and sched_D values are in thirty-seconds (1/32); that is, the formula used by the scheduler to calculate the amount to be added to a process's priority value as a penalty for recent processor use is:
CPU penalty = (recently used CPU value of the process) * (r/32)
and the once-per-second recalculation of the recently used processor value of each process is:
new recently used CPU value = (old recently used CPU value of the process) * (d/32)

Both r (sched_R parameter) and d (sched_D parameter) have default values of 16. This maintains the processor scheduling behavior of previous versions of the operating system. Before experimenting with these values, you must be familiar with "Tuning the processor scheduler" in the Performance Management Guide.

Memory-Load-Control Parameters

The operating system scheduler performs memory load control by suspending processes when memory is over committed. The system does not swap out processes; instead pages are stolen as they are needed to fulfill the current memory requirements. Typically, pages are stolen from suspended processes. Memory is considered over committed when the following condition is met:
Item Description
p * h s where:

p is the number of pages written to paging space in the last second
h is  an integer specified by the v_repage_hi parameter
s is the number of page steals that have occurred in the last second

A process is suspended when memory is over committed and the following condition is met:
Item Description
r * p f where:

r r is the number of repages that  the process has accumulated in the last second
p is an integer specified  by the v_repage_proc parameter
f  is the number of page faults that the process has experienced in the last second

In addition, fixed-priority processes and kernel processes are exempt from being suspended.

The term repages refers to the number of pages belonging to the process, which were reclaimed and are soon after referenced again by the process.

The user also can specify a minimum multiprogramming level with the v_min_process parameter. Doing so ensures that a minimum number of processes remain active throughout the process-suspension period. Active processes are those that are runnable and waiting for page I/O. Processes that are waiting for events and processes that are suspended are not considered active, nor is the wait process considered active.

Suspended processes can be added back into the mix when the system has stayed below the over committed threshold for n seconds, where n is specified by the v_sec_wait parameter. Processes are added back into the system based, first, on their priority and, second, on the length of their suspension period.

Before experimenting with these values, you must be thoroughly familiar with "VMM memory load control tuning with the schedo command" in the Performance Management Guide.

Time-Slice-Increment Parameter

The schedo command can also be used to change the amount of time the operating system allows a given process to run before the dispatcher is called to choose another process to run (the time slice). The default value for this interval is a single clock tick (10 milliseconds). The timeslice tuning parameter allows the user to specify the number of clock ticks by which the time slice length is to be increased.

In AIX Version 4, this parameter only applies to threads with the SCHED_RR scheduling policy. See Scheduling Policy for Threads.

fork() Retry Interval Parameter

If a fork() subroutine call fails because there is not enough paging space available to create a new process, the system retries the call after waiting for a specified period of time. That interval is set with the pacefork tuning parameter.

Special Terminology for Symmetric Multithreading

Multiple run queues are supported. Under this scheme each processor has it's own run queue. POWER5 processors support symmetric multithreading, where each physical processor has two execution engines, called hardware threads. Each hardware thread is essentially equivalent to a single processor. Symmetric multithreading is enabled by default, but it can be disabled (or re-enabled) dynamically. When symmetric multithreading is enabled, each hardware thread services a separate run queue. For example, on a 4-way system when symmetric multithreading is disabled or not present, there are 4 run queues in addition to the global run queue. When symmetric multithreading is enabled, there are 8 run queues in addition to the global run queue.

The hardware threads belonging to the same physical processor are referred to as sibling threads. A primary sibling thread is the first hardware thread of the physical processor. A secondary sibling thread is the second hardware thread of the physical processor.

Virtual Processor Management

More virtual processors can be defined than are needed to handle the work in a partition. The overhead of dispatching virtual processors can be reduced by using fewer virtual processors without a decrease in overall processor usage or a lack of virtual processors. Virtual processors are not dynamically removed from the partition, but instead are not used and are used again only when additional work is available. Each virtual processor uses a maximum of one physical processor. The number of virtual processes needed is determined by rounding up the sum of the physical processor utilization and the vpm_xvcpus tunable:
number = ceiling( p_util + vpm_xvcpus)
Where number is the number of virtual processors that are needed, p_util is the physical processor utilization, and vpm_xvcpus is a tunable that specifies the number of additional virtual processors to enable. If number is less than the number of currently enabled virtual processors, a virtual processor will be disabled. If number is greater than the number of currently enabled virtual processors, a disabled virtual processor will be enabled. Threads that are attached to a disabled virtual processor are still allowed to run on the disabled virtual processor.

Node Load

The node load, or simply load, is the average run queue depth across all run queues, including the global run queue multiplied by 256, and is strongly smoothed over time. For example, a load of 256 means that if we have 16 processors (including symmetric multithreading processors), then we have had approximately 16 runnable jobs in the system for the last few milliseconds.

Flags

Item Description
-a Displays the current, reboot (when used in conjunction with -r) or permanent (when used in conjunction with -p) value for all tunable parameters, one per line in pairs Tunable = Value. For the permanent option, a value is only displayed for a parameter if its reboot and current values are equal. Otherwise NONE displays as the value.
-d Tunable ResetsTunable to its default value. If a tunable needs to be changed (that is, it is currently not set to its default value, and -r is not used in combination, it won't be changed but a warning is displayed.
-D Resets all tunables to their default value. If tunables needing to be changed are of type Bosboot or Reboot, or are of type Incremental and have been changed from their default value, and -r is not used in combination, they will not be changed but a warning displays.
-F Forces the display of restricted tunable parameters when you specify the -a, -L or -x options on the command line, to list all of the tunables. If you do not specify the -F flag, restricted tunables are not included, unless they are specifically named in association with a display option.
-h [Tunable] Displays help about the Tunable parameter if one is specified. Otherwise, displays the schedo command usage statement.
-L [ Tunable ] Lists the characteristics of one or all tunables, one per line, using the following format:
NAME                      CUR    DEF    BOOT   MIN    MAX    UNIT           TYPE  
     DEPENDENCIES  
--------------------------------------------------------------------------------  
v_repage_hi               0      0      0      0      2047M                    D
--------------------------------------------------------------------------------
v_repage_proc             4      4      4      0      2047M                    D
--------------------------------------------------------------------------------
v_sec_wait                1      1      1      0      2047M  seconds           D
--------------------------------------------------------------------------------
...  
where:  
    CUR = current value  
    DEF = default value  
    BOOT = reboot value  
    MIN = minimal value  
    MAX = maximum value  
    UNIT = tunable unit of measure  
    TYPE = parameter type: D (for Dynamic), S (for Static), R (for Reboot),
           B (for Bosboot), M (for Mount), I (for Incremental), C (for Connect), and d (for Deprecated)  
    DEPENDENCIES = list of dependent tunable parameters, one per line
-o Tunable [=Newvalue] Displays the value or sets Tunable to Newvalue. If a tunable needs to be changed (the specified value is different than current value), and is of type Bosboot or Reboot, or if it is of type Incremental and its current value is bigger than the specified value, and -r is not used in combination, it will not be changed but a warning displays.

When -r is used in combination without a new value, the nextboot value for tunable is displayed. When -p is used in combination without a new value, a value displays only if the current and next boot values for tunable are the same. Otherwise NONE displays as the value.

-p Makes changes apply to both current and reboot values, when used in combination with -o, -d or -D, that is, turns on the updating of the /etc/tunables/nextboot file in addition to the updating of the current value. These combinations cannot be used on Reboot and Bosboot type parameters because their current value can't be changed.

When used with -a or -o without specifying a new value, values are displayed only if the current and next boot values for a parameter are the same. Otherwise NONE displays as the value.

-r Makes changes apply to reboot values when used in combination with -o, -d or -D, that is, turns on the updating of the /etc/tunables/nextboot file. If any parameter of type Bosboot is changed, the user will be prompted to run bosboot.

When used with -a or -o without specifying a new value, next boot values for tunables display instead of current values.

-x [Tunable] Lists characteristics of one or all tunables, one per line, using the following (spreadsheet) format:
tunable,current,default,reboot,min,max,unit,type,{dtunable }  

where:  
    current = current value  
    default = default value  
    reboot = reboot value  
    min = minimal value  
    max = maximum value  
    unit = tunable unit of measure  
    type = parameter type: D (for Dynamic), S (for Static), R (for Reboot),  
               B (for Bosboot),M (for Mount), I (for Incremental), 
               C (for Connect), and d (for Deprecated)  
    dtunable = space separated list of dependent tunable parameters
-y Suppresses the confirmation prompt before executing the bosboot command.
Note: Options -o, -d, and -D are not supported within a workload partition because they attempt to change the value of a scheduler tunable parameter.

If you make any change (with the -o, -d, or -D options) to a restricted tunable parameter, it results in a warning message that a tunable parameter of the restricted-use type, has been modified. If you also specified the -r or -p options on the command line, you will be prompted to confirm the change. In addition, at system reboot, restricted tunables that are displayed in the /etc/tunables/nextboot file, which were modified to values that are different from their default values (using a command line specifying the -r or -p options), causes an error log entry that identifies the list of these modified tunables.

When modifying a tunable, you can specify the tunable value using the abbreviations such as K, M, G, T, P and E to indicate units. See units. The following table shows the prefixes and values that are associated with the number abbreviations:
Abbreviation Power of 2
K 1024
M 1 048 576
G 1 073 741 824
T 1 099 511 627 776
P 1 125 899 906 842 624
E 1 152 921 504 606 846 976
Thus, a tunable value of 1024 might be specified as 1K.

Any change (with -o, -d or -D) to a parameter of type Mount results in a message displaying to warn you that the change is only effective for future mountings.

Any change (with -o, -d or -D flags) to a parameter of type Connect will result in inetd being restarted, and in a message being displayed to warn the user that the change is only effective for future socket connections.

Any attempt to change (with -o, -d or -D) a parameter of type Bosboot or Reboot without -r, results in an error message.

Any attempt to change (with-o, -d or -D but without -r) the current value of a parameter of type Incremental with a new value smaller than the current value, results in an error message.

Tunable Parameters Type

All the tunable parameters manipulated by the tuning commands (no, nfso, vmo, ioo, raso, and schedo) have been classified into these categories:
Item Description
Dynamic If the parameter can be changed at any time
Static If the parameter can never be changed
Reboot If the parameter can only be changed during reboot
Bosboot If the parameter can only be changed by running bosboot and rebooting the machine
Mount If changes to the parameter are only effective for future file systems or directory mounts
Incremental If the parameter can only be incremented, except at boot time
Connect If changes to the parameter are only effective for future socket connections
Deprecated If changing this parameter is no longer supported by the current release of AIX.
For parameters of type Bosboot, whenever a change is performed, the tuning commands automatically prompt the user to ask if they want to execute the bosboot command. For parameters of type Connect, the tuning commands automatically restart the inetd daemon.

Note that the current set of parameters managed by the schedo command only includes Dynamic, and Reboot types.

Compatibility Mode

When running in pre 5.2 compatibility mode (controlled by the pre520tune attribute of sys0, see AIX 5.2 compatibility mode in the AIX Version 7.1 Performance management), reboot values for parameters, except those of type Bosboot, are not really meaningful because in this mode they are not applied at boot time.

In pre 5.2 compatibility mode, setting reboot values to tuning parameters continues to be achieved by imbedding calls to tuning commands in scripts called during the boot sequence. Parameters of type Reboot can therefore be set without the -r flag, so that existing scripts continue to work.

This mode is automatically turned ON when a machine is MIGRATED to AIX 5.2. For complete installations, it is turned OFF and the reboot values for parameters are set by applying the content of the /etc/tunables/nextboot file during the reboot sequence. Only in that mode are the -r and -p flags fully functional. See Kernel Tuning in the AIX Version 7.1 Performance Tools Guide and Reference for more information.

Tunable Parameters

For default values and range of values for tunables, refer schedo command help (-h <tunable_parameter_name>).
Item Description
affinity_lim
Purpose:
Sets the number of intervening dispatches after which the SCHED_FIFO2 policy no longer favors a thread.
Tuning:
Once a thread is running with SCHED_FIFO2 policy, tuning of this variable may or may not have an effect on the performance of the thread and workload. Ideal values must be determined by trial and error.
big_tick_size
Purpose:
Sets physical tick interval and synchronizes ticks across cpus.
Tuning:
The big_tick_size value times 10 ms as a tick interval, and must evenly divide into 100. Use of this parameter will make system statistics less accurate.
ded_cpu_donate_thresh
Purpose:
Specifies the utilization threshold for donation of a dedicated processor.
Tuning:
In a dedicated processor partition that is enabled for donation, idle processor capacity can be donated to the shared processor pool for use by shared processor partitions. If a dedicated processor's utilization is less than this threshold, the dedicated processor will be donated for use by other partitions when the processor is idle. If a dedicated processor's utilization is equal to or greater than this threshold, the dedicated processor will not be donated for use by other partitions when the dedicated processor is idle.
fixed_pri_global
Purpose:
Keep fixed priority threads on global run queue.
Tuning:
If 1, then fixed priority threads are placed on the global run queue.
force_grq
Purpose:
Keep non-MPI threads on the global run queue.
Tuning:
If 1, only MPI and bound threads will use local run queues.
maxspin
Purpose:
Sets the number of times to spin on a kernel lock before going to sleep.
Tuning:
Increasing the value on MP systems may reduce idle time; however, it might also waste CPU time in some situations. Increasing it on uniprocessor systems is not recommended.
pacefork
Purpose:
The number of clock ticks to wait before retrying a failed fork call that has failed for lack of paging space.
Tuning:
Use when the system is running out of paging space and a process cannot be forked. The system will retry a failed fork five times. For example, if a fork() subroutine call fails because there is not enough paging space available to create a new process, the system retries the call after waiting the specified number of clock ticks.
proc_disk_stats
Purpose:
A value of 1 enables and a value of 0 disables the process scope disk statistics. The default value is 1 and ranges from 0 to 1.
Tuning:
Disabling process scope disk statistics improves performance when the statistics are not wanted.
sched_D
Purpose:
Sets the short term CPU usage decay rate.
Tuning:
The default is to decay short-term CPU usage by 1/2 (16/32) every second. Decreasing this value enables foreground processes to avoid competition with background processes for a longer time.
sched_R
Purpose:
Sets the weighting factor for short-term CPU usage in priority calculations.
Tuning:
Run the command ps al. If you find that the PRI column has priority values for the foreground processes (those with NI values of 20) that are higher than the PRI values of some background processes (NI values > 20), you can reduce the r value. The default is to include 1/2 (16/32) of the short term CPU usage in the priority calculation. Decreasing this value makes it easier for foreground processes to compete.
tb_balance_S0
Purpose:
Controls SMT-cores busy balancing.
Tuning:
A value of 0 indicates that the balancing is disabled. A value of 1 indicates that the balancing is enabled only within MCMs (S2 groups). A value of 2 indicates fully enabled.
tb_balance_S1
Purpose:
Controls processor busy balancing.
Tuning:
A value of 0 indicates that the balancing is disabled. A value of 1 indicates that the balancing is enabled only within MCMs (S2 groups). A value of 2 indicates fully enabled.
tb_threshold
Purpose:
Number of ticks to consider a thread busy for the purposes of optimization for thread_busy load balancing.
Tuning:
A value of 100 corresponds to 1 second. The values 10 and 1000 correspond to 0.1 and 10 seconds, respectively.
timeslice
Purpose:
The number of clock ticks a thread can run before it is put back on the run queue.
Tuning:
Increasing the timeslice value can reduce overhead of dispatching threads. The value refers to the total number of clock ticks in a timeslice and only affects fixed-priority processes.
vpm_fold_policy
Purpose:
Controls the application of the virtual processor management feature of processor folding in a partition.
Tuning:
The virtual processor management feature of processor folding can be enabled or disabled based on whether a partition has shared or dedicated processors. In addition, when the partition is in static power saving mode, processor folding is automatically enabled for both shared or dedicated processor partitions.

When processor folding is enabled, the vpm_vxcpus tunable can be used to control processor folding.

There are 3 bits in vpm_fold_policy to control processor folding:
  • Bit 0 (0x1): When set to 1, this bit indicates processor folding is enabled if the partition is using shared processors.
  • Bit 1 (0x2): When set to 1, this bit indicates processor folding is enabled if the partition is using dedicated processors.
  • Bit 2 (0x4): When set to 1, this bit disables the automatic setting of processor folding when the partition is in static power saving mode.

You can perform an OR operation on the Bit 0, Bit 1, and Bit 2 values to form the desired value.

vpm_throughput_core_threshold

Specifies the number of cores that must be unfolded before vpm_throughput_mode parameter is honored. Till that, the system behaves with the value of vpm_throughput_mode parameter set as 1.

vpm_throughput_mode

Specifies the desired level of SMT exploitation for scaled throughput mode. A value of 0 gives default behavior (raw throughput mode). A value of 1, 2, or 4 selects the scaled throughput mode and the desired level of SMT exploitation.

vpm_xvcpus
Purpose:
Setting this tunable to a value greater than -1 will enable the scheduler to enable and disable virtual processors based on the partition's CPU utilization.
Tuning:
The value specified signifies the number of virtual processors to enable in addition to the virtual processors required to satisfy the workload.

Examples

  1. To list the current and reboot value, range, unit, type and dependencies of all tunables parameters managed by the schedo command, enter:
    schedo -L  
  2. To list (spreadsheet format) the current and reboot value, range, unit, type, and dependencies of all tunables parameters managed by the schedo command, enter:
    schedo -x
  3. To reset v_sec_wait to default, enter:
    schedo -d v_sec_wait 
  4. To display help on sched_R, enter:
    schedo -h sched_R
  5. To set v_min_process to 4 after the next reboot, enter:
    schedo -r -o v_min_process=4 
  6. To permanently reset all schedo tunable parameters to default, enter:
    schedo -p -D 
  7. To list the reboot value for all schedo parameters, enter:
    schedo -r -a