ACCRE Home

Enabling Researcher-Driven Innovation and Exploration


Vanderbilt Home
Frequently Asked Questions
Cluster Accounts: Connectivity: Configuring your Environment: Submitting and Running Jobs (Also read Getting Started): Disk Space: Software and Compiling:

Accounts: How do I change my ACCRE account password?

Please follow these steps:

  1. Log on to the cluster (vmplogin.accre.vanderbilt.edu) using your existing password.
  2. At this point you are simply logged on to a cluster gateway. Any changes to your account will need to be propogated to all of the compute nodes.

  3. Issue the command:

    ssh vmpsched

    You will be prompted for your account password (the same one used to initially log on to the cluster).

    Commands issued now (e.g., the passwd command below) will effect modifications to all of the nodes.

  4. To change your password, type the command:

    passwd

    The passwd program will prompt you to enter a new password. Please use a non-dictionary word (i.e., nonstandard words, combinations of letters and numbers).

    After you have changed your password, please allow approximately one hour for the change to be propogated to all of the nodes.

  5. To disconnect from vmpsched, type:

    exit

    You will still be logged into your account. You may either continue working on the cluster or log out.

See also our tutorials on getting started on the cluster.

Top of Page


Accounts: I've forgotten my password; what is the procedure to reset it?

Notify us by submitting a Request Tracker ticket.

The script we run to reset your password propogates the change out to the cluster and sends you e-mail with your new password. This normally takes a few minutes, and we ask you wait at least 15 minutes to log on. As soon as you receive the e-mail with your new password, please follow the procedure to reset it to something of your choosing.

Top of Page


Connectivity: I cannot connect to the cluster, am experiencing intermittent connectivity to the cluster, or the system hangs upon log on. What should I do?

If you are normally able to connect and suddenly cannot, let us know via Request Tracker. Please provide as much information about the issue as possible including any useful output to your screen.

Besides occasional network problems, there are a number of possible causes for sluggish to zero connectivity. Please read the following to help self-diagnose before submitting a help desk ticket so we are better able to assist you:

  • You may connect to the cluster only via a Secure Shell (SSH) client. For more information go to Logging on to the Cluster.

  • If you see the following error when trying to connect:

    ssh: connect to host vmplogin.accre.vanderbilt.edu port 22: Operation timed out

    it means the gateway you're trying to connect to is unreachable.

    The cluster has 20 x86 gateway machines (vmps01 - vmps20) and 12 PowerPC gateway machines (b1s - b12s). There are several reasons why we have multiple gateways of each architecture. For example, this distributes the user load across many login gateways. Another main reason is to protect against an unreachable gateway preventing a connection to the cluster.

    However, even with this backup system in place it is still possible when you ssh to either ppclogin.accre.vanderbilt.edu or vmplogin.accre.vanderbilt.edu to get an error similar to the above.

    vmplogin and ppclogin are only aliases which use DNS round-robining to select one of the actual gateway machines to connect to. If you get the above error, what has likely happened is that either the local DNS cache on your system or the DNS server you use has cached an alias to a gateway which is now unreachable for some reason. If this occurs, you should simply select one of the actual gateways at random and attempt to ssh directly to it. For example, ssh vmps13.accre.vanderbilt.edu or ssh b10s.accre.vanderbilt.edu.

  • If you can connect but the connection "hangs" before you receive a command-line prompt, your problem may be related to an error in one of your login files (e. g., .bashrc in your home directory). This we can help diagnose. Please send us a message via Request Tracker.

  • If you can connect but your login "hangs", it is possible we are experiencing a problem with GPFS (the General Parallel File System designed by IBM). Other symptoms of this include logging on but not being able to see, for example, your home directory. Sometimes the file system problem is temporary, lasting only a moment. Larger file system problems normally occur when the system is overloaded, which can happen for various reasons. If the problem is found to be caused by a particular user account or set of jobs, we immediately work with the user to resolve it.

    Sometimes the issue is not related to user software or the way a user is submitting jobs and we have to work with IBM to determine the root of the file system problem. When we expect the issue cannot be resolved quickly we notify all users to expect intermittent cluster access.

    In most cases when the system is in this state you should still be able to accomplish your work, albeit you may find the system is occasionally sluggish and intermittently nonresponsive. Please be patient. You will find upon repeated attempts you will be able to log on and submit jobs. If these jobs are not heavily dependent on disk I/O they should continue running.

In any case, so we can immediately begin revolving the issue, please notify us ASAP of any connectivity problems by submitting a Request Tracker ticket. Include details such as a "cut and paste" of the information in your login window if you are able.

Top of Page


Connectivity: How can I make a scheduled downtime work for me?

As a scheduled downtime for the cluster approaches more time becomes available for shorter jobs. Thus, if you have applications that take a few days or less to run, you will be able to execute more of these types of jobs as a scheduled downtime approaches because applications requiring longer period of times will not be running. It's an excellent time to take advantage of the extra computing cycles that would ordinarily not be available!

Top of Page


Environment: How do I display graphics from the cluster to my local machine?

You need two things: (N. b., you should first check with your P. I. before installing the following software, since one or both of these may already be on your system, especially if you're using a computer in your lab which is already configured to run on the cluster)

  1. Get X server support on your local machine:
    The graphics enviroment on the cluster is X11, therefore, you must install and run an X server from your local machine.

  2. Configure SSH tunneling:
    You must tell SSH on your local machine to allow the display of graphics from software running on the cluster.

Windows users: ITS provides a free X server called Reflection X. This software has OpenSSH integrated within, so you do not require a separate SSH client.

  • Follow these instructions to install the X11 server or ask ITS for more help:
    • You can only access this software repository from a machine logged on from the campus network (the full list of available software can be found here).
    • Scroll down to Reflection Software Products for instructions to Map a Network Drive, which are repeated in more detail here:
      1. On your machine open My Computer, and under Tools select Map Network Drive. Or you may instead right-click on My Computer and select Map Network Drive.
      2. Select any unused letter for the temporary Drive name. Z: is fine.
      3. For Folder enter \\vusoftware\e-library.
      4. Unselect reconnect at logon.
      5. Click Finish.
      6. You will be prompted for your username and password.
      7. Your username is "VANDERBILT\ followed by your VUnetID, i. e., VANDERBILT\yourVUnetID.
      8. Your password is your VUnetID password.
    • Or connect through Internet Explorer:
      1. In the URL window of Internet Explorer type: \\vusoftware\e-library.
      2. You will be prompted for your username and password.
      3. Your username is VANDERBILT\ followed by your VUnetID, i. e., VANDERBILT\yourVUnetID.
      4. Your password is your VUnetID password.
    • You will be presented with a long list of directories.
    • Click on the WRQ-Relection12 folder.
    • Copy (drag) the ReflectionX folder onto your Desktop.
    • Be patient; the download takes several minutes.
    • Open the ReflectionX folder on your desktop and run the install / WRQ Reflection Install Engine.
    • Select Install Reflection.
    • Select the first option, Reflection X.
    • Select Workstation Install.
    • Follow the Installation Wizard intructions:
      • You may simply skip the Customer Information page without entering anything.
      • Selecting Typical Installation is suitable for the current application.
    • Once installed we suggest you read the Reflection X manual available at Attachmate Support.
    • However, the following is one set-up option to run an SSH connection to the cluster from a local xterm:
      1. Launch the server for the first time You may find it through your start menu or in Programs under WRQ ReflectionReflection X. You might also find it convenient to create a Desktop short cut. While the server is running, a taskbar button is visible***.
      2. Unblock the program if you're running a firewall (the firewall software should inform you).
      3. Choose to run the Performance Tuner and under First time connection options select Go directly to the WRQ Reflection X Manager.
      4. The Performance Tuner takes a few minutes and instructs you to not use your computer while it's running.
      5. When this completes, the Reflection X Manager launches, and you can configure and save a connection to the cluster.
      6. Select Client TemplatesClient Startuplinux.
      7. To configure and save a Client File, fill in what you wish from the Client connection settings, by giving a unique Description: (something like vmplogin.accre will do), selecting OPENSSH under Method:,
      8. Entering the connection Host name: vmplogin.accre.vanderbilt.edu, and
      9. Entering your cluster User name.
      10. Let it use the default Command: xterm, which you may configure as you like.
      11. Save this configuration, i. e., Save Client File with the name vmplogin.accre and it will now appear under Client FilesClient Startup.
      12. You may now also right click on it to make start menu and/or Desktop short cuts.
      13. Double-click the icon to begin the xterm running the ssh connection to the cluster. If you didn't fill all the connection settings in, you'll be prompted for most of them, but you must have selected OPENSSH .
      14. If ssh asks whether you are sure you want to connect, say Yes.
      15. When prompted enter your cluster password.
      16. You should now be logged on.
      17. Finally, see below for how to quickly check you can display remote graphics locally.
      18. ***The X server continues to run even after you quit all X clients, e. g., xterm sessions, just like your Desktop keeps running after quitting other applications, e. g., Firefox, etc. One way to cleanly exit the X server is to right-click on the taskbar buttonExit. In addition, after the initial setup you don't have to start the Manager every time. The shortcut you created starts the server first, then launches your connection and xterm.
  • If you want to use a different SSH client, launch the X server without an SSH connection.
  • Once an X server is running, proceed to configuring your SSH.
  • If you are using the SSH downloaded from ITS do the following:
    1. Click Edit → Settings (this brings up the connections settings window).
    2. On the left, select Profile Settings → Tunneling.
    3. At the bottom of this window, select Tunnel X11 connections.
    4. Click OK.
    5. From the main screen File → Save Settings.
    6. Exit SSH and relaunch.
    7. Finally, see below for how to quickly check you can display remote graphics locally.
  • Another popular SSH client for Windows is PuTTY.

N. b., as with 3rd-party software, you need to be more diligent about keeping up with software upgrades which fix bugs and security holes since they are do not all have automatic update reminders.

Mac OS X users: You can get a free X11 server from Apple. Mac OS X should already have SSH installed.

  1. Follow their directions to install and run the X11 server.
  2. Launch the X11 server.
  3. Run an xterm.
  4. When you log on to the cluster from the command line in the xterm, to activate SSH tunneling you can use the -X option, i. e., ssh -X user@vmplogin.accre.vanderbilt.edu.
  5. Finally, see below for how to quickly check you can display remote graphics locally.

Linux users: We assume you are already running an X server and have SSH installed.

  1. When you log on to the cluster, ssh -X will activate SSH tunneling, i. e., ssh -X user@vmplogin.accre.vanderbilt.edu.
  2. See below for how to quickly check you can display remote graphics locally.

To quickly check your X server is running and the SSH tunneling is enabled:

  1. Log on to the cluster.
  2. Type echo $DISPLAY
    • This should return something like localhost:13.0.
    • If so, X11 is being tunneled properly.
  3. As a final test, run xterm on the cluster and a terminal window will be displayed on your local machine (it may take a minute to send the graphics across the network).

Top of Page


Environment: I am running an X server, how do I fix X connection or .Xauthority file errors?

If you are getting error messages similar to these:

"/usr/X11R6/bin/xauth: error in locking authority file /home/user/.Xauthority"
"X11 connection rejected because of wrong authentication. X connection to localhost:11.0 broken (explicit kill or server shutdown)"

try removing the .Xauthority file in your home directory, then log out and back in. This file occasionally becomes corrupted. When you log back in and start X, it will recreate your .Xauthority file. Sometimes you have to do this a few times. If you continue to have problems, please submit a Request Tracker ticket.

Top of Page


Jobs: What types of nodes are available?

The ACCRE Linux cluster is comprised of 42 dual dual core x86 1.8 GHz Opterons, 378 dual processor x86 (2.0 GHz Xeon and Opteron) nodes, and 334 dual 2.2 GHz IBM PowerPC 970 J20 blades. The compute processors number just over 1500 and the compute capacity is roughly 6 TFLOPS. For more details, refer to the table in the description of the compute cluster.

Top of Page


Jobs: What are the ACCRE cluster defined attributes I can use in my PBS scripts corresponding to the available node properties?

The properties of our compute nodes can be specified with combinations of the following attributes (defined by us):

   ppc64, nomyrinet
   ppc64, myrinet
   x86, nomyrinet
   x86, nomyrinet, imagic
   x86, nomyrinet, imagic, bigmem
   x86, opteron, nomyrinet
   x86, opteron, nomyrinet, bigmem
   x86, opteron, nomyrinet, dualdual
   x86, opteron, myrinet
   x86, opteron, myrinet, twogig

E. g., in your PBS script you could specify:

   #PBS -l nodes=4:x86:myrinet

Find more examples in subsequent FAQ.

Top of Page


Jobs: Can I run on the gateway machines?

When you log on via vmplogin.accre.vanderbilt.edu, you are logged onto a gateway machine. From here you submit your jobs which are sent to the compute nodes by the scheduler. However, we do allow you to run very short, <15 minute, test jobs on the gateway machines, as long as such jobs do not slow the gateway for other users. Anything longer than this should be submitted to the compute nodes using "qsub" (see How to Submit Basic Jobs).

Top of Page


Jobs: How do I make sure my application is using the type of processor required?

The "#PBS -l" statement is used not only to specify the number of nodes/processors, but also the type of processor that you require. We have a mix of two types of nodes within two architectures ('x86':Opterons; 'ppc64': PowerPC Blades).

Please note: To run on the 32-bit Intel architecture, your code will need to be compiled in that environment; to run on the 64-bit PowerPC archicture, your code will need to be compiled in that environment. We strongly encourage anyone interested in compiling their code for the PowerPC's to schedule an appointment with one of our technical staff, preferably during office hours from 4 to 5 p.m. To schedule an appointment, please use our online support request form.

Each of the examples below is requesting 4 nodes (2 CPUs each) for a total of 8 CPUs. The examples differ in the type of CPU that may actually be used to process your job.

If your job will run on x86 (Opteron) processors, the following statement will suffice:

   #PBS -l nodes=4:ppn=2:x86

This is the same as:

   #PBS -l nodes=4:ppn=2:opteron

If your job requires PowerPCs, use the following statement:

   #PBS -l nodes=4:ppn=2:ppc64

Of the x86 processors, half of the Opteron processors have Myrinet. Therefore, if your code is compiled for the x86 architecture and requires Myrinet, use the following statement:

   #PBS -l nodes=4:ppn=2:opteron:myrinet

If your job requires PowerPCs with Myrinet, use the following statement:

   #PBS -l nodes=4:ppn=2:ppc64:myrinet

Note: when you specify ppn=2 it may increase your wait time since any dual processor nodes or blades with one processor already in use cannot fulfill your request. If you want to use 8 processors but you do not care if the CPUs are scattered across multiple nodes, you would simply leave the "ppn=#" option out of your specifications, e. g.,

   #PBS -l nodes=8:ppc64:myrinet

For more complex node request options, please see the FAQ regarding efficiently matching the types of CPU with the needs of your application.

Top of Page


Jobs: How can my job requests either type of the nodes (x86 and ppc64) to increase the chance that it gets scheduled faster?

If your application can run on both architectures, you may use the "FIRSTOF" option within the specification of a nodeset on a "#PBS" line. The first type of CPU listed will be given preference if both types of CPU are available.

Example

   #PBS -l nodes=4:ppn=2
   #PBS -l nodeset=FIRSTOF:FEATURE:x86:ppc64

In this example, x86 will be used if both PowerPCs and x86 are available.

If you have absolutely no preference for CPU (even if both are available), you may replace "FIRSTOF" with "ONEOF" in the examples shown above.

Based on the nodes you are getting (e.g. using "arch" to determine the node architecture), you can run the proper applications.

Jobs: What happens if my job uses more resources than requested?

The job scheduler will automatically kill most jobs which exceed the resources requested in the PBS script. For example, if you specify a walltime of 4 hours and your job runs over that, the scheduler will kill the job. The reason for this is that running jobs which use more resources than requested may affect the scheduling and running of other jobs. This is because the scheduler relies on PBS specifications (among other parameters) to determine on which nodes to run jobs. Also read our job scheduler policies for more information on killing jobs which are interfering with other jobs or the system itself.

When testing code or running code you are unfamiliar with, you should more diligently monitor the resource consumption to fine tune your PBS request. Specifying much more, e. g., walltime or mem or pmem, than your job requires may delay its start time if the requested resources are not immediately available. Therefore, you should start somewhat conservatively, then reduce your resource specifications once you've determined what you are really using, still always leaving a buffer to ensure the job is covered.

Learn more about how to request resources and the PBS defaults when you submit a job.

Learn how to monitor and check the status of a submitted job.

Top of Page


Jobs: Why is my eligible job waiting so long in the Idle state?

There are several things you should check to understand your wait time in the Idle queue. See tips on checking the status of a submitted job.

  • Make sure you have requested an allowed set of resources. Check your PBS script against both the available nodes in the cluster and our job scheduler policies. You can also check on the resources requested with the command:

    checkjob -v job_number

  • Check the queue and current usage on the cluster. It could be the particular resources your jobs need may be heavily utilized, even if the entire cluster is not. You can check the total usage of the cluster with the command:

    showq

    You can also see current and past utilization levels on this website.

  • You can see the current estimated start time for your job with the command:

    showstart job_number

    Newly submitted jobs, which need the same resources requested by your jobs, can change this estimate if those jobs have higher priority, possibly because they are submitted by accounts running under fairshare (see next point). If your job cannot run because you've requested resources that don't exist, this command will state it cannot determine a start time.

  • Your account or group account may be running over its fairshare. This means when the cluster is very busy, other jobs from accounts which are under fairshare may be assigned higher priority and may jump ahead of your job in the eligible queue. Use the mdiag -f command to check your recent usage.

If you still do not understand why your jobs are not starting more quickly, please submit an RT ticket.

Top of Page


Jobs: What does job status Deferred mean?

Paraphrased from the Moab documentation: Deferred jobs are those placed on hold because the scheduler has determined they cannot run. This can be because a job asks for resources which do not currently exist, does not have allocations to run, is rejected by the resource manager, repeatedly fails after start up, etc. A job will remain in the deferred state, unable to run for a period of time specified by a parameter set for the scheduler software. Once this specified time has elapsed, the job is automatically released and the scheduler again attempts to schedule it."

That is, deferred jobs are not lost and will eventually run. Therefore, you will not gain anything by deleting and resubmitting jobs. However you will need to take steps in order to get such jobs released.

You should first check the resources requested in your PBS script exist on the cluster. Sometimes there is a problem with the node(s) allocated to run your job or if the cluster network traffic reaches extreme levels, the scheduler software may encounter inefficient communication with a node(s) and the PBS MOM server running on that node(s). On each node the MOM server is launced by the pbs_mom daemon. pbs_mom executes instructions to launch jobs, monitor usage of jobs, and tells the server when jobs complete. For multi-node jobs, it also communicates with MOMs on other nodes.

If there are severe communication failures, for example, occasionally the MOM server must be restarted on affected nodes to reinitiate the proper exchange of information between nodes and the scheduler. Once we do this, usually deferred jobs will then start.

The checkjob command will show you if there may be a MOM server problem. You will see reference to an RMFailure (where RM means "resource manager").

Whenever you encounter this error, please send us a Request Tracker ticket listing your job number(s) and the node(s). We (an administrator) can restart the MOM server on the affected node(s), thereby hopefully releasing your job(s) to run.

We will research jobs held in a deferred state due to any other reason as well. An administrator will manually release the hold on jobs if it is determined the jobs should be eligible.

Top of Page


Jobs: Why is my job in the blocked queue?

There are several reasons why your job(s) may appear in the blocked portion of the queue. These reasons include:

  • You may already have the maximum of 10 Idle jobs waiting in the eligible queue. The checkjob command will tell you the blocked job "violates idle HARD MAXJOB limit of 10". As your Idle jobs become active (start running), your blocked jobs will move up to the eligible portion of the queue.

  • The scheduler may have placed your job into a Deferred state. See the FAQ on this subject.

  • There is a user hold or system hold the job. Users can and do place holds on their own jobs for various reasons, such as gradual staging of job submissions. System holds, however, are instituted by and can only be released by an administrator. There is a long list of reasons why we may find it necessary to block certain jobs in this way. If we do so, we contact the affected user to explain why and, if necessary, work with that user to remedy the situation.

If you cannot determine why your job has been blocked, please submit Request Tracker ticket.

Top of Page


Jobs: When I run qstat I have jobs listed with a status of 'C'. The jobs are not running and I cannot delete the jobs with 'qdel'. Why?

When running 'qstat', jobs that have completed will be listed with a status of 'C'

Top of Page


Jobs: For how long will completed jobs continue to show up when the "qstat" command is used?

1 hour.

Top of Page


Jobs: When I submit my jobs to run on the Myrinet nodes, why do they continually fail to start?

For single-processor jobs make certain never to specify "myrinet" nodes only. Myrinet can be especially useful when running some multi-processor/multi-node jobs which are able to benefit from the faster communication between nodes compared to Ethernet.

Because such multi-processor jobs can run only on the Myrinet nodes, we want to prevent filling up those nodes with jobs that do not need Myrinet (as long as other nodes are available to run on). Because single-processor jobs run on one machine, and therefore do not benefit from faster communication between nodes, specifying "myrinet" in your PBS script is simply not allowed.

If you are running code compiled with Myrinet, it should be parallel code. If not, you need to recompile your code without any Myrinet options.

N. b., even though for single-processor jobs you are not allowed to specifically state "myrinet" in your PBS script, the pool of x86 Myrinet nodes are still available to you as long as your code is compiled to run on the x86 architecture. That is, the scheduler will send your job to whichever an unreserved Myrinet nodes if no Ethernet nodes are free.

Refer to our job scheduler policies for more on this and other restrictions we must place on job submissions.

Top of Page


Jobs: Why do only some of my Matlab jobs fail to run?

Matlab is only installed on the x86 machines. Therefore you must specify this in your PBS script.

The following line says you want 1 node with 2 processors without specifying the architecture. If the job gets scheduled on a PPC node, it will fail.

#PBS -l nodes=1:ppn=2

To prevent this, specify to run on the x86 nodes as follows:

#PBS -l nodes=1:ppn=2:x86

Top of Page


Jobs: I am receiving a "/usr/spool/PBS/mom_priv/..." error after I submit my job. What does this mean?

Because of the mixed 32-bit Intel and 64-bit PowerPC environment in the cluster, the type of architecture you request in your PBS node attribute must match the architecture on which your code was compiled. If you receive an error message to the effect:

/usr/spool/PBS/mom_priv/..../username/program: cannot execute binary file

most likely your PBS script does not contain a match between the type of processor the code was compiled on and the type of processor on which the code was being executed.

Top of Page


Jobs: I am receiving a "no job control in this shell" error in my job output file. What does this mean?

The full error is:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

This is a Unix shell error. When a script is executed on a node in the background, and is therefore not run in an interactive environment, this error is generated if your job is submitted as an interactive shell script (e. g., tcsh or bash). This will not affect running your script. Your script will continue to execute since you do not need the interactive function of the shell.

Top of Page


Jobs: What is the maximum number of jobs I can submit or have running at any one time?

"Eligible" and "Blocked" Queue Limits: We ask users not to exceed a maximum of 300 jobs in the "eligible" and "blocked" portions of the queue. This limit has proved to work well to keep the total number of jobs safely below 4,000, which is a scheduler software limitation. We will ask users to delete jobs from the queue if over the 300 job limit.

"Active" Limits: There is a maximum of 300 CPUs in use at any one time per user. This number is summed from any combination of single and multi-processor jobs. Additional limits are placed as necessary for groups running either medium (defined as 4 to 7 days) or long (over 7 days) jobs on a regular basis if there usage is impacting the ability of other groups to use their full fairshare. Individual groups may also request upper limits on their users. New guest users have upper job limits until they have attended the Introduction to the Cluster and Job Scheduler classes.

Please refer to the job scheduler policies for additional important details of these limits.

Top of Page


Jobs: What is the maximum allowed "wall clock time" I may specify?

The maximum allowed walltime is 30 days, or in hh:mm:ss = 720:00:00. Your job will not start if you have specified a walltime greater than this. You may reduce the walltime of an already submitted job using "mjobctl" (Moab job control).

In addition we ask that, except for a small number of test jobs, jobs run at least 30 minutes and over an hour in length is preferable. Our job scheduler policies explains more on this subject.

Also see How to Submit Basic Jobs for other PBS specifications and how to deal with very short jobs.

Top of Page


Jobs: How do I use mjobctl to change the walltime of an already running job?

Only an administrator has permission to execute the command qalter to increase a job's walltime, although a user can use qalter to decrease a job's walltime. However, a user can use the mjobctl (Moab job control) command to edit (add to or subtract from) the walltime of a submitted job.

Keep in mind you should only rarely use mjobctl to increase the walltime of a job that is already running. This can adversely interfere with the scheduling of other jobs, and therefore is not good practice. If we find a user is continually doing this instead of modifying their submission scripts, we will find it necessary to place temporary restrictions on the account.

Example:

mjobctl -m reqawduration+=600 1234

will modify (-m) job 1234, adding (+) 10 minutes (600 seconds) to its walltime.

Other options can be found by typing:

mjobctl --help

or going to the Moab web page on this subject.

Top of Page


Jobs: Where can I find detailed documentation on all other TORQUE and Moab commands?

Many of the TORQUE commands (in /usr/scheduler/torque/bin) and Moab commands (in /usr/scheduler/moab/bin) you will use have online manual pages. E. g., log onto the cluster and type:

man qsub

Cluster Resources site has loads of Moab Workload Manager and TORQUE Resource Manager documentation including the list of Moab commands and TORQUE commands.

Top of Page


Jobs: What if my job requires big memory?

There are 20 bigmem nodes (40 processors) that have approximately 4 GB for two processors or 2 GB each. These nodes are on the Opteron hardware and have 3,955 MB per node.

Since all hardware requires some RAM for the OS which can range from 200 - 300 MB per node, you should always leave buffer for OS related memory usage in your job script. To use the node with big memory, specify the bigmem property. E.g.:

   #PBS -l nodes=1:ppn=2:x86:bigmem
   #PBS -l mem=1900mb

Top of Page


Disk space: Determining Disk Space Usage and Quotas

As noted in the cluster disk policies, you have both soft and hard limits on both your home and scratch directories. To help keep the system running smoothly, you should be in the habit of checking your usage level, especially since hard quota limits are definitive and, due to potential filesystem problems, we may have to either kill jobs or place temporary limits on accounts which exceed their soft limit. Please read our cluster disk policies to understand diskspace quotas and the FAQ on how to increase your available diskspace by using scratch space, requesting a possible temporary quota increase, or purchasing more diskspace.

To view your current usage and quota levels, type the command:

   /usr/lpp/mmfs/bin/mmlsquota

You will receive two sets of usage information, a set for disk usage (in KB) and numbers of files.

An example of the information received is shown below:

Block Limits
Filesystem type KB quota limit in_doubt grace
gpfs0 USR 1443096 10485760 20971520 66800 none
gpfs1 USR 8 10485760 104857600 0 none

File Limits
Filesystem files quota limit in_doubt grace Remarks
gpfs0 678 100000 200000 259 none
gpfs1 1 100000 1000000 0 none

The line under the filesystem column labeled "gpfs0" refers to quotas and usage on your home directory. The line labeled "gpfs1" summarizes information about your scratch disk usage.

The column labeled "quota" refer to soft quota values. The column labeled "limit" are your hard quota limits.

The values in the "in_doubt" columns are estimates of disk and file usage that may or may not be actually used at the specific point in time that you issued the "mmlsquota" request. The only time that you need to be concerned with those values is if the sum of your current usage and your "in_doubt" values exceeds your hard quota limit. At that point, you will no longer be able to write to disk until you have removed and/or compressed some files - resulting in a sum of those values that falls below your hard limit.

If you would like to see how much disk space your entire home directory currently fills, use the unix command du, e. g.,

   du -sh /home/username

will return something like:

   2010.3M /home/username

Top of Page


Disk space: Using scratch disk space

You have disk and file allocations available for your use on both your home directory (which is backed up) and on scratch disk space (which is not backed up). To take advantage of your scratch disk space, simply cd to that filesystem.

For example:

   cd /scratch

Before using the scratch filesystem, please create a top-level directory using either your username or your group's name.

   mkdir /scratch/username

or

   mkdir /scratch/groupname

Top of Page


Disk space: If I need more disk space than this, will you temporarily grant a quota increase?

We do not necessarily relax quota restrictions. It depends entirely on the details of your request and we can discuss your options. Please submit a Request Tracker ticket explaining why you wish a quota change, how much space you believe you require, and how long you expect to need it. If you need more diskspace for an extended period of time you may purchase it. Please see the details of our cluster disk policies then send us your request.

Top of Page


Disk space: Will ACCRE restore deleted or lost data?

Yes. Please refer to our policy regarding restoring from backup.

Top of Page


Software: What research software packages are available of the cluster?

Please see this list for links to documentation for some of the more complex software packages we have installed on the cluster.

Top of Page


Software: I'd like to have some software installed on the cluster. How do I go about doing that?

As much as possible, ACCRE staff are glad to accommodate your needs for software. Of course, the software must be amenable to execution in the cluster environment and (if not open source) you are responsible for taking care of licensing arrangements prior to installation, as well as continued maintenance of the software license.

If you'd like to explore the possibility of adding some software to our cluster environment, please submit a Request Tracker ticket.

Top of Page


Software: I'd like to compile my code for use on the PowerPCs. Do you have some tips for me?

We have 64-bit libraries compiled on the PowerPC machines. To determine whether a file is a 32-bit or 64-bit object, type the command:

   readelf -h

Your linker will let you know if you try to link two different architectures.

Because 64-bit is not the default Suse distribution toolchain environment, you will likely need to provide the following options to the compiler:

  • For gcc: Use -m64

  • For the IBM fortran compiler: Use -q64

  • For xlf, f90, or f95: Type 'f90' with no options to see the help file

Top of Page


Software: Which Fortran compilers are available on the cluster?

For the x86 architecture:
   Absoft v8.0 (f90)
   Intel v9 (ifort)
   g77 v3.2.3 (g77)

For the PPC architecture:
   IBM v9.1 (xlf)
   g77 v3.3.3 (g77)

We suggest using the Absoft compiler on the x86 machines and the IBM compiler on the PPC machines. g77 is notoriously slow. Intel's compilers are good, but have problems with some F90/95 features.

In order for applications to link with ACCRE libraries when using the Absoft Fortran Compiler please add the following compiler options:

-YEXT_NAMES=LCS -YEXT_SFX="_"

These options "mangle" the names correctly for linking with all other ACCRE supported libraries.

You will need to modify your package list using setpkgs in your .cshrc or .bashrc file to add the compilers to your search path.

Please also see our software list for links to documentation of these packages.

Top of Page


Last modified: April 10 2008 15:07:31 CST.