--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA
Linux Release Notes
Version 2.3
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------

  Hardware Support
  o  See http://www.nvidia.com/object/cuda_learn_products.html

  Platform Support
  o  Continued OS support
     - RHEL 4.x, 5.x
     - Fedora 10
     - SLED 10 SP2
     - Ubuntu 8.10
  o  Additional OS support
     - Ubuntu 9.04
     - SUSE Linux 11.1
  o  Eliminated OS support
     - Fedora 9
     - Ubuntu 8.04
     - OpenSUSE Linux 11.0

  CUFFT Features
  o Performance enhancements
  o Double precision
     - CUFFT now supports double-precision transforms, with types and
       functions analagous to the existing single-precision versions.
       Similarly, the "cufftType" enumeration (used in calls like
       cufftPlan1d) has expanded to include double-precision identifiers:

       Precision:   Single          Double
       Type:        cufftReal       cufftDoubleReal
       Type:        cufftComplex    cufftDoubleComplex

       cufftType:   CUFFT_R2C       CUFFT_D2Z
       cufftType:   CUFFT_C2R       CUFFT_Z2D
       cufftType:   CUFFT_C2C       CUFFT_Z2Z

       Function:    cufftExecC2C    cufftExecZ2Z
       Function:    cufftExecR2C    cufftExecD2Z
       Function:    cufftExecC2R    cufftExecZ2D

     - The double-precision versions are invoked in an identical manner to
       the single-precision ones, obviously with arguments changed from the
       single- to the double-precision types. See "cufft.h" for exact
       definitions of the above.

  CUDA-GDB Features
  o Available now on all supported Linux platforms
  o Included in the toolkit installer

  Cross-Compilation Support
  o Support compilation of 32bit applications on 64bit hosts.

  Double Handling by the Compiler
  o when a ptx file with an sm version prior to sm_13 contains double 
    precision instructions, ptxas now emits a warning that double precision 
    instructions are demoted to single precision. ptxas has a new option 
    --suppress-double-demote-warning to suppress this warning

--------------------------------------------------------------------------------
Major Bug Fixes
--------------------------------------------------------------------------------

  C++ Support for Device Emulation
  o Support is restored for using C++ code in device emulation mode

--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

o GPU enumeration order on multi-GPU systems is non-deterministic and
  may change with this or future releases. Users should make sure to
  enumerate all CUDA-capable GPUs in the system and select the most
  appropriate one(s) to use.

o Individual GPU program launches are limited to a run time
  of less than 5 seconds on a GPU with a display attached.
  Exceeding this time limit causes a launch failure reported
  through the CUDA driver or the CUDA runtime. GPUs without
  a display attached are not subject to the 5 second run time
  restriction. For this reason it is recommended that CUDA is
  run on a GPU that is NOT attached to an X display.

o In order to run CUDA applications, the CUDA module must be
  loaded and the entries in /dev created.  This may be achieved
  by initializing X Windows, or by creating a script to load the
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

o When compiling with GCC, special care must be taken for structs that
  contain 64-bit integers.  This is because GCC aligns long longs
  to a 4 byte boundary by default, while NVCC aligns long longs
  to an 8 byte boundary by default.  Thus, when using GCC to
  compile a file that has a struct/union, users must give the
  -malign-double
  option to GCC.  When using NVCC, this option is automatically
  passed to GCC.

o It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

o For maximum performance when using multiple byte sizes to access the
  same data, coalesce adjacent loads and stores when possible rather
  than using a union or individual byte accesses. Accessing the data via
  a union may result in the compiler reserving extra memory for the object,
  and accessing the data as individual bytes may result in non-coalesced
  accesses. This will be improved in a future compiler release.

o OpenGL interoperability
  - OpenGL cannot access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. X11 Xinerama).

o Sending sigkill (ctrl-c) to an application that is currently running a
  kernel on the GPU may not result in a clean shutdown of the process as the
  kernel may continue running for a long time afterwards on the GPU. In such
  cases, a system restart may be necessary before running further CUDA or
  graphics applications.

--------------------------------------------------------------------------------
Open64 Sources
--------------------------------------------------------------------------------

The Open64 source files are controlled under terms of the GPL license.
Current and previously released versions are located via anonymous ftp at
download.nvidia.com in the CUDAOpen64 directory.


--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------

  07/2009 - Version 2.3
  06/2009 - Version 2.3 Beta
  05/2009 - Version 2.2
  03/2009 - Version 2.2 Beta
  11/2008 - Version 2.1 Beta
  06/2008 - Version 2.0
  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9
  02/2007 - Version 0.8 - Initial public Beta


--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

  For more information and help with CUDA, please visit
  http://www.nvidia.com/cuda