PROWAREtech

articles » current » dot-net » nvidia-cuda

.NET: Nvidia CUDA for CSharp (v1.0.1)

Access Nvidia CUDA functionality from the C# language (as well as VB.NET) on Windows 64-bit (x64) using this simple wrapper library.

This is a simple CUDA API wrapper to get started with. Also see the version for Linux.

Download these files including the compiled assembly and CUDA NVRTC DLL files: NVIDIACUDAFORCSHARP.zip.

Assembly file compiled with .NET 6 for x64 Windows.

These functions around Euler's number may be of use if writing for CUDA.

This Nvidia CUDA code could take great advantage of something like machine learning in C#.

Example Usage:



using CudaForCSharp;
using System;
using System.IO;

namespace TestApp
{
	class Program
	{
		static void VectorAdd(int deviceId)
        {
			/*
			 * 
			 * 
			 * VECTOR ADD EXAMPLE
			 * RUNS ON GPGPU
			 * THIS IS HOW A GPGPU SHINES!
			 * 
			 * 
			 */

			string vectorAdd_kernel_code = @"
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */

extern ""C"" __global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		C[i] = A[i] + B[i];
	}
}";
			ulong freeMemory, totalMemory;
			Cuda.MemoryInfo(deviceId, out freeMemory, out totalMemory);

			string kernel_file = Path.Combine(Directory.GetCurrentDirectory(), "vectorAdd_kernel.cu");
			File.WriteAllText(kernel_file, vectorAdd_kernel_code);
			byte[] cubin;
			ulong cubinSize;
			Cuda.CompileFileToCUBIN(deviceId, kernel_file, out cubin, out cubinSize);

			ulong module = Cuda.LoadCUBIN(deviceId, cubin);

			IntPtr kernel_addr;
			Cuda.GetModuleFunction(out kernel_addr, module, "vectorAdd");

			// Print the vector length to be used, and compute its size
			const int ALLOWED = 5242880; // 5MB room allowed for GPU to execute code; this may not always be enough!
			int numElements = (int)((freeMemory - ALLOWED) / sizeof(float)) / 3;
			ulong size = (ulong)numElements * sizeof(float);

			Console.WriteLine($"[Initializing {numElements} elements]");

			// Allocate the host input vector A
			float[] h_A = new float[numElements];

			// Allocate the host input vector B
			float[] h_B = new float[numElements];

			// Allocate the host output vector C
			float[] h_C = new float[numElements];

			Random rand = new Random();

			// Initialize the host input vectors
			for (int i = 0; i < numElements; ++i)
			{
				h_A[i] = (float)rand.NextDouble();
				h_B[i] = (float)rand.NextDouble();
			}

			var start = DateTime.Now;
			Console.WriteLine($"[Vector addition of {numElements} elements]");
			Console.WriteLine("Using CPU...");
			for (int i = 0; i < numElements; ++i)
				h_C[i] = h_A[i] + h_B[i];
			Console.WriteLine((DateTime.Now - start).TotalMilliseconds + " milliseconds");

			// Allocate the device input vector A
			ulong d_A = Cuda.MemAlloc(size);

			// Allocate the device input vector B
			ulong d_B = Cuda.MemAlloc(size);

			// Allocate the device output vector C
			ulong d_C = Cuda.MemAlloc(size);

			// Copy the host input vectors A and B in host memory to the device input
			// vectors in device memory
			Cuda.MemcpyHostToDev(d_A, h_A, size);
			Cuda.MemcpyHostToDev(d_B, h_B, size);

			// Launch the Vector Add CUDA Kernel
			uint threadsPerBlock = (uint)Cuda.DeviceGetAttribute(CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, deviceId);
			uint blocksPerGrid = ((uint)numElements + threadsPerBlock - 1) / threadsPerBlock;
			Console.WriteLine($"CUDA kernel launch with {blocksPerGrid} blocks of {threadsPerBlock} threads");

			var cudaBlockSize = new dim3(threadsPerBlock, 1, 1);
			var cudaGridSize = new dim3(blocksPerGrid, 1, 1);

			IntPtr[] extra = null;

			object[] parameters = new object[] { d_A, d_B, d_C, numElements };

			start = DateTime.Now;
			Console.WriteLine("Using GPU...");
			Cuda.LaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y, cudaGridSize.z, cudaBlockSize.x, cudaBlockSize.y, cudaBlockSize.z, 0, (IntPtr)0, parameters, extra);
			Cuda.ContextSynchronize();
			Console.WriteLine((DateTime.Now - start).TotalMilliseconds + " milliseconds");

			// Copy the device result vector in device memory to the host result vector
			// in host memory.
			Cuda.MemcpyDevToHost(h_C, d_C, size);

			// Verify that the result vector is correct
			for (int i = 0; i < numElements; ++i)
			{
				if (h_A[i] + h_B[i] - h_C[i] != 0)
				{
					Console.WriteLine($"Result verification failed at element {i}!");
					goto free_memory;
				}
			}

			Console.WriteLine("Vector Add Test PASSED!\r\n");


			free_memory:

			// Free device global memory
			Cuda.MemFree(d_C);
			Cuda.MemFree(d_B);
			Cuda.MemFree(d_A);
		}

		static void SelectionSort(int deviceId)
        {
			/*
			 * 
			 * SELECTION SORT EXAMPLE
			 * RUNS ON THE GPGPU
			 * 
			 */
			string selection_sort_kernel_code = @"
/* SELECTION SORT RUNNING ON THE GPU */
extern ""C"" __global__ void selection_sort(int *data, const int size)
{
	int indexMin = 0, valueMin = 0;
	const int bottom = (size - 1);
	for(int top = 0; top < bottom; top++)
	{
		indexMin = top;
		valueMin = data[top];
		for(int seekMin = top + 1; seekMin < size; seekMin++)
		{
			if (data[seekMin] < valueMin)
			{
				valueMin = data[seekMin];
				indexMin = seekMin;
			}
		}
		data[indexMin] = data[top];
		data[top] = valueMin;
	}
}";
			byte[] cubin;
			ulong cubinSize;
			Cuda.CompileSourceCodeToCUBIN(deviceId, selection_sort_kernel_code, "selection_sort_program", out cubin, out cubinSize);
			ulong module = Cuda.LoadCUBIN(deviceId, cubin);

			IntPtr kernel_addr;
			Cuda.GetModuleFunction(out kernel_addr, module, "selection_sort");

			int numElements = 50;
			ulong size = (ulong)numElements * sizeof(int);
			Console.WriteLine($"[Selection sort of {numElements} elements]");

			int[] h_Array = new int[numElements];

			Random rand = new Random();

			for (int i = 0; i < numElements; ++i)
				h_Array[i] = rand.Next();

			ulong d_Array = Cuda.MemAlloc(size);

			Cuda.MemcpyHostToDev(d_Array, h_Array, size);

			object[] parameters = new object[] { d_Array, numElements };

			IntPtr[] extra = null;

			Cuda.LaunchKernel(kernel_addr, 1, 1, 1, 1, 1, 1, 0, (IntPtr)0, parameters, extra);

			Cuda.ContextSynchronize();

			// Copy the device result vector in device memory to the host result vector
			// in host memory.
			Cuda.MemcpyDevToHost(h_Array, d_Array, size);

			Cuda.MemFree(d_Array);

			for (int i = 0; i < numElements; ++i)
				Console.WriteLine(h_Array[i]);
		}

		static void Main(string[] args)
		{
			Cuda.Initialize(0);

			var deviceId = Cuda.GetMaxGflopsDeviceId();

			VectorAdd(deviceId);

			Console.WriteLine("Hit ENTER to begin selection sort on the GPU.");
			Console.ReadLine();
			SelectionSort(deviceId);
		}
	}
}

The CudaForCSharp Main Assembly code:


// Cuda.cs

using System;
using System.IO;
using System.IO.Compression;
using System.Runtime.InteropServices;

namespace CudaForCSharp
{
	[StructLayout(LayoutKind.Sequential)]
	public struct uint3
    {
		public uint x;
		public uint y;
		public uint z;
	}
	[StructLayout(LayoutKind.Sequential)]
	public struct dim3
    {
		public uint x;
		public uint y;
		public uint z;
		public dim3(uint vx = 1, uint vy = 1, uint vz = 1) { x = vx; y = vy; z = vz; }
		public dim3(uint3 v) { x = v.x; y = v.y; z = v.z; }
	}

	public static class Cuda
	{
		public static void Initialize(uint flags)
		{
			var result = cuInit(flags);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuInit() failed.") { Source = "CUDA", HResult = (int)result };
		}

		// This function returns the best GPU device ID based on performance
		public static int GetMaxGflopsDeviceId()
		{
			int current_device;
			int max_perf_device = 0;
			int device_count;
			int sm_per_multiproc;
			ulong max_compute_perf = 0;
			int major;
			int minor;
			int multiProcessorCount;
			int clockRate;
			int devices_prohibited = 0;

			CUresult result;

			result = cuInit(0);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuInit() failed.") { Source = "CUDA", HResult = (int)result };

			result = cuDeviceGetCount(out device_count);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceGetCount() failed.") { Source = "CUDA", HResult = (int)result };

			if (device_count == 0)
				throw new Exception("No devices support CUDA.");

			// Find the best CUDA capable GPU device
			current_device = 0;

			while (current_device < device_count)
			{
				result = cuDeviceGetAttribute(out multiProcessorCount, CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, current_device);
				if (result != CUresult.CUDA_SUCCESS)
					throw new Exception("cuDeviceGetAttribute() failed.") { Source = "CUDA", HResult = (int)result };
				result = cuDeviceGetAttribute(out clockRate, CUdevice_attribute.CU_DEVICE_ATTRIBUTE_CLOCK_RATE, current_device);
				if (result != CUresult.CUDA_SUCCESS)
					throw new Exception("cuDeviceGetAttribute() failed.") { Source = "CUDA", HResult = (int)result };
				result = cuDeviceGetAttribute(out major, CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, current_device);
				if (result != CUresult.CUDA_SUCCESS)
					throw new Exception("cuDeviceGetAttribute() failed.") { Source = "CUDA", HResult = (int)result };
				result = cuDeviceGetAttribute(out minor, CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, current_device);
				if (result != CUresult.CUDA_SUCCESS)
					throw new Exception("cuDeviceGetAttribute() failed.") { Source = "CUDA", HResult = (int)result };

				int computeMode;
				cuDeviceGetAttribute(out computeMode, CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, current_device);

				if (computeMode != (int)CUcomputemode.CU_COMPUTEMODE_PROHIBITED)
				{
					if (major == 9999 && minor == 9999)
						sm_per_multiproc = 1;
					else
						sm_per_multiproc = ConvertSMVer2CoresDRV(major, minor);

					ulong compute_perf = (ulong)(multiProcessorCount * sm_per_multiproc * clockRate);

					if (compute_perf > max_compute_perf)
					{
						max_compute_perf = compute_perf;
						max_perf_device = current_device;
					}
				}
				else
					devices_prohibited++;

				++current_device;
			}

			if (devices_prohibited == device_count)
				throw new Exception("GetMaxGflopsDeviceId() error: all devices have compute mode prohibitied.");

			return max_perf_device;
		}
		
		public static int DeviceGetCount()
        {
			int count;
			var result = cuDeviceGetCount(out count);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceGetCount() failed.") { Source = "CUDA", HResult = (int)result };
			return count;
		}
		
		public static int DeviceGetAttribute(CUdevice_attribute attrib, int deviceId)
        {
			int pi;
			var result = cuDeviceGetAttribute(out pi, attrib, deviceId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceGetAttribute() failed.") { Source = "CUDA", HResult = (int)result };
			return pi;
		}
		
		public static string DeviceGetName(int deviceId)
        {
			byte[] buf = new byte[256];
			var result = cuDeviceGetName(buf, 256, deviceId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceGetName() failed.") { Source = "CUDA", HResult = (int)result };
			return System.Text.Encoding.ASCII.GetString(buf);
		}

		public static int DriverGetVersion()
		{
			int ver;
			var result = cuDriverGetVersion(out ver);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDriverGetVersion() failed.") { Source = "CUDA", HResult = (int)result };
			return ver;
		}

		public static ulong DeviceTotalMem(int deviceId)
		{
			ulong total;
			var result = cuDeviceTotalMem_v2(out total, deviceId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceTotalMem_v2() failed.") { Source = "CUDA", HResult = (int)result };
			return total;
		}


		// deviceId maybe null if context is already created
		public static void MemoryInfo(int? deviceId, out ulong freeBytes, out ulong totalBytes)
		{
			ulong ctxId = 0;
			if (deviceId != null)
				ctxId = ContextCreateId(0, deviceId ?? 0);
			var result = cuMemGetInfo(out freeBytes, out totalBytes);
			if (deviceId != null)
				ContextDestroy(ctxId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemGetInfo() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static bool DeviceCanAccessPeer(int deviceId, int peerDeviceId)
		{
			int canAccess;
			var result = cuDeviceCanAccessPeer(out canAccess, deviceId, peerDeviceId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceCanAccessPeer() failed.") { Source = "CUDA", HResult = (int)result };
			return canAccess != 0;
		}

		// gpuDeviceId can be null in which case the most competent GPU is chosen
		public static void CompileFileToCUBIN(int? gpuDeviceId, string filename, out byte[] cubinResult, out ulong cubinResultSize/*, int requiresCGheaders*/)
		{
			CompileSourceCodeToCUBIN(gpuDeviceId, File.ReadAllText(filename), filename, out cubinResult, out cubinResultSize /*, requiresCGheaders*/);
		}

		// gpuDeviceId can be null in which case the most competent GPU is chosen
		public static void CompileSourceCodeToCUBIN(int? gpuDeviceId, string sourceCode, string name, out byte[] cubinResult, out ulong cubinResultSize/*, int requiresCGheaders*/)
		{
			int numCompileOptions = 0;

			var compileParams = new string[2];

			// Picks the best CUDA device available
			int cuDevice = gpuDeviceId ?? GetMaxGflopsDeviceId();


			// get compute capabilities and the devicename
			int major = DeviceGetAttribute(CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice);
			int minor = DeviceGetAttribute(CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice);

			// Compile cubin for the GPU arch on which are going to run cuda kernel.
			compileParams[numCompileOptions] = $"--gpu-architecture=sm_{major}{minor}";


			numCompileOptions++;

			/*if (requiresCGheaders)
			{
				std::string compileOptions;
				char HeaderNames[256];
				
				sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");

				compileOptions = "--include-path=";

				std::string path = sdkFindFilePath(HeaderNames, argv[0]);
				if (!path.empty())
				{
					std::size_t found = path.find(HeaderNames);
					path.erase(found);
				}
				else
				{
					printf(
						"\nCooperativeGroups headers not found, please install it in %s "
			  
						"sample directory..\n Exiting..\n",
						argv[0]);
				}
				compileOptions += path.c_str();
				compileParams[numCompileOptions] = reinterpret_cast<char*>(malloc(sizeof(char) * (compileOptions.length() + 1)));
				sprintf_s(compileParams[numCompileOptions], sizeof(char) * (compileOptions.length() + 1), "%s", compileOptions.c_str());

				numCompileOptions++;
			}//*/

			// compile
			ulong prog = 0, a, b;
			nvrtcResult nvrtcRes = nvrtcResult.NVRTC_SUCCESS;

			while (true)
			{
				try
				{
					nvrtcRes = nvrtcCreateProgram(out prog, sourceCode, name, 0, out a, out b);
					break;
				}
				catch (Exception ex)
				{
					if (ex.Message.StartsWith("Unable to load DLL 'nvrtc64_112_0'"))
					{
						var dlls = new string[] { "nvrtc64_112_0.dll", "nvrtc-builtins64_115.dll" };
						foreach (var dll in dlls)
							SaveDll(dll);
					}
					else
						break;
				}
			}
			if (nvrtcRes != nvrtcResult.NVRTC_SUCCESS)
				throw new Exception("nvrtcCreateProgram() failed.") { Source = "NVRTC", HResult = (int)nvrtcRes };

			nvrtcRes = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
			if (nvrtcRes != nvrtcResult.NVRTC_SUCCESS)
				throw new Exception("nvrtcCompileProgram() failed.") { Source = "NVRTC", HResult = (int)nvrtcRes };

			ulong codeSize = 0;
			nvrtcRes = nvrtcGetCUBINSize(prog, out codeSize);
			if (nvrtcRes != nvrtcResult.NVRTC_SUCCESS)
				throw new Exception("nvrtcGetCUBINSize() failed.") { Source = "NVRTC", HResult = (int)nvrtcRes };

			var code = new byte[codeSize];
			nvrtcRes = nvrtcGetCUBIN(prog, code);
			if (nvrtcRes != nvrtcResult.NVRTC_SUCCESS)
				throw new Exception("nvrtcGetCUBIN() failed.") { Source = "NVRTC", HResult = (int)nvrtcRes };

			cubinResult = code;
			cubinResultSize = codeSize;

		}

		// gpuDeviceId can be null
		public static ulong LoadCUBIN(int? gpuDeviceId, byte[] cubin)
		{
			ulong module, context;

			// Picks the best CUDA device available
			int cuDevice = gpuDeviceId ?? GetMaxGflopsDeviceId();

			CUresult result;

			result = cuInit(0);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuInit() failed.") { Source = "CUDA", HResult = (int)result };

			result = cuCtxCreate(out context, 0, cuDevice);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuCtxCreate() failed.") { Source = "CUDA", HResult = (int)result };

			result = cuModuleLoadData(out module, cubin);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuModuleLoadData() failed.") { Source = "CUDA", HResult = (int)result };

			return module;
		}

		public static void GetModuleFunction(out IntPtr hfunc, ulong module, string name)
		{
			CUresult result = cuModuleGetFunction(out hfunc, module, name);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuModuleGetFunction() failed.") { Source = "CUDA", HResult = (int)result };

		}

		public static ulong MemAlloc(ulong size)
		{
			ulong devicePtr;
			var result = cuMemAlloc(out devicePtr, size);
			if (result == CUresult.CUDA_ERROR_OUT_OF_MEMORY)
				throw new Exception("Out of device memory.");
			else if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemAlloc() failed.") { Source = "CUDA", HResult = (int)result };
			return devicePtr;
		}

		public static void MemFree(ulong devicePtr)
		{
			var result = cuMemFree(devicePtr);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemFree() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, byte[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, ushort[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, uint[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, ulong[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, short[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, int[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, long[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, float[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyHostToDev(ulong destDevice, double[] srcHost, ulong size)
		{
			var result = cuMemcpyHtoD(destDevice, srcHost, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyHtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(byte[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(ushort[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(uint[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(ulong[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(short[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(int[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(long[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(float[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToHost(double[] destHost, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoH(destHost, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoH() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void MemcpyDevToDev(ulong destDevice, ulong srcDevice, ulong size)
		{
			var result = cuMemcpyDtoD(destDevice, srcDevice, size);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuMemcpyDtoD() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void LaunchKernel(IntPtr func, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, IntPtr hStream, object[] kernelParams, IntPtr[] extra)
		{
			int count = kernelParams.Length;
			IntPtr[] arr = new IntPtr[count];
			GCHandle[] GCHandleList = new GCHandle[count];

			//Get pointers to kernel parameters
			for (int i = 0; i < count; i++)
			{
				GCHandleList[i] = GCHandle.Alloc(kernelParams[i], GCHandleType.Pinned);
				arr[i] = GCHandleList[i].AddrOfPinnedObject();
			}

			var result = cuLaunchKernel(func, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, hStream, arr, extra);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuLaunchKernel() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static int DeviceGet(int ordinal)
		{
			int dev;
			var result = cuDeviceGet(out dev, ordinal);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuDeviceGet() failed.") { Source = "CUDA", HResult = (int)result };
			return dev;
		}

		public static ulong ContextCreateId(uint flags, int deviceId)
		{
			ulong ctx;
			var result = cuCtxCreate(out ctx, flags, deviceId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuCtxCreate() failed.") { Source = "CUDA", HResult = (int)result };
			return ctx;
		}

		public static void ContextSynchronize()
		{
			var result = cuCtxSynchronize();
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuCtxSynchronize() failed.") { Source = "CUDA", HResult = (int)result };
		}

		public static void ContextDestroy(ulong contextId)
		{
			var result = cuCtxDestroy(contextId);
			if (result != CUresult.CUDA_SUCCESS)
				throw new Exception("cuCtxDestroy() failed.") { Source = "CUDA", HResult = (int)result };
		}

		// Beginning of GPU Architecture definitions
		internal struct sSMtoCores
		{
			public int SM;  // 0xMm (hexadecimal notation), M = SM Major version, and m = SM minor version
			public int Cores;
			public sSMtoCores(int SM, int Cores)
			{
				this.SM = SM;
				this.Cores = Cores;
			}
		}

		public static int ConvertSMVer2CoresDRV(int major, int minor)
		{
			// Defines for GPU Architecture types (using the SM version to determine the # of cores per SM
			sSMtoCores[] nGpuArchCoresPerSM = {
				new sSMtoCores(0x30, 192),
				new sSMtoCores(0x32, 192),
				new sSMtoCores(0x35, 192),
				new sSMtoCores(0x37, 192),
				new sSMtoCores(0x50, 128),
				new sSMtoCores(0x52, 128),
				new sSMtoCores(0x53, 128),
				new sSMtoCores(0x60,  64),
				new sSMtoCores(0x61, 128),
				new sSMtoCores(0x62, 128),
				new sSMtoCores(0x70,  64),
				new sSMtoCores(0x72,  64),
				new sSMtoCores(0x75,  64),
				new sSMtoCores(0x80,  64),
				new sSMtoCores(0x86, 128),
				new sSMtoCores(0x87, 128),
				new sSMtoCores(-1, -1)
			};

			int index = 0;

			while (nGpuArchCoresPerSM[index].SM != -1)
			{
				if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
					return nGpuArchCoresPerSM[index].Cores;
				index++;
			}

			// If we don't find the values, we default use the previous one to run
			// properly
			return nGpuArchCoresPerSM[index - 1].Cores;
		}
		// end of GPU Architecture definitions

		#region Private_Decl
		private static void SaveDll(string name)
		{
			var assembly = typeof(Cuda).Assembly;
			var compressedName = name + ".gz";
			foreach(var resourceName in assembly.GetManifestResourceNames())
			{
				if(resourceName.EndsWith(compressedName))
				{
					using var compressedStream = assembly.GetManifestResourceStream(resourceName);
					using var stream = new GZipStream(compressedStream!, CompressionMode.Decompress);
					using (var fs = File.Create(Path.Combine(assembly.Location[..^Path.GetFileName(assembly.Location).Length], name)))
					{
						stream.CopyTo(fs);
					}
					break;
				}
			}
		}

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuInit(uint Flags);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemGetInfo(out ulong free, out ulong total);
		
		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceGetCount(out int count);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceGetAttribute(out int pi, CUdevice_attribute attrib, int dev);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceGetName(byte[] name, int len, int dev);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDriverGetVersion(out int driverVersion);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceTotalMem_v2(out ulong bytes, int dev);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceCanAccessPeer(out int canAccessPeer, int dev, int peerDev);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemAlloc(out ulong dptr, ulong bytesize);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemFree(ulong dptr);


		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, byte[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, ushort[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, uint[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, ulong[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, short[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, int[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, long[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, float[] srcHost, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyHtoD(ulong dstDevice, double[] srcHost, ulong ByteCount);


		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(byte[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(ushort[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(uint[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(ulong[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(short[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(int[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(long[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(float[] dstHost, ulong srcDevice, ulong ByteCount);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoH(double[] dstHost, ulong srcDevice, ulong ByteCount);


		[DllImport("nvcuda", CallingConvention = CallingConvention.StdCall)]
		private static extern CUresult cuMemcpyDtoD(ulong dstDevice, ulong srcDevice, ulong ByteCount);


		[DllImport("nvrtc64_112_0", CallingConvention=CallingConvention.Cdecl)]
		private static extern nvrtcResult nvrtcCreateProgram(out ulong prog, string src, string name, int numHeaders, out ulong headers, out ulong includeNames);

		[DllImport("nvrtc64_112_0", CallingConvention=CallingConvention.Cdecl)]
		private static extern nvrtcResult nvrtcCompileProgram(ulong prog, int numOptions, string[] options);

		[DllImport("nvrtc64_112_0", CallingConvention=CallingConvention.Cdecl)]
		private static extern nvrtcResult nvrtcGetCUBINSize(ulong prog, out ulong cubinSizeRet);

		[DllImport("nvrtc64_112_0", CallingConvention=CallingConvention.Cdecl)]
		private static extern nvrtcResult nvrtcGetCUBIN(ulong prog, byte[] cubin);


		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuDeviceGet(out int device, int ordinal);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuCtxCreate(out ulong pctx, uint flags, int dev);

		[DllImport("nvcuda", CallingConvention = CallingConvention.StdCall)]
		private static extern CUresult cuCtxSynchronize();

		[DllImport("nvcuda", CallingConvention = CallingConvention.StdCall)]
		private static extern CUresult cuCtxDestroy(ulong ctx);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuLaunchKernel(IntPtr func, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, IntPtr hStream, IntPtr[] kernelParams, IntPtr[] extra);

		[DllImport("nvcuda", CallingConvention=CallingConvention.StdCall)]
		private static extern CUresult cuModuleLoadData(out ulong module, byte[] image);

		[DllImport("nvcuda", CallingConvention = CallingConvention.StdCall)]
		private static extern CUresult cuModuleGetFunction(out IntPtr hfunc, ulong hmod, string name);

		#endregion
	}
}


// CudaEnum.cs

namespace CudaForCSharp
{
	enum CUresult
	{
		/**
		 * The API call returned with no errors. In the case of query calls, this
		 * also means that the operation being queried is complete (see
		 * ::cuEventQuery() and ::cuStreamQuery()).
		 */
		CUDA_SUCCESS = 0,

		/**
		 * This indicates that one or more of the parameters passed to the API call
		 * is not within an acceptable range of values.
		 */
		CUDA_ERROR_INVALID_VALUE = 1,

		/**
		 * The API call failed because it was unable to allocate enough memory to
		 * perform the requested operation.
		 */
		CUDA_ERROR_OUT_OF_MEMORY = 2,

		/**
		 * This indicates that the CUDA driver has not been initialized with
		 * ::cuInit() or that initialization has failed.
		 */
		CUDA_ERROR_NOT_INITIALIZED = 3,

		/**
		 * This indicates that the CUDA driver is in the process of shutting down.
		 */
		CUDA_ERROR_DEINITIALIZED = 4,

		/**
		 * This indicates profiler is not initialized for this run. This can
		 * happen when the application is running with external profiling tools
		 * like visual profiler.
		 */
		CUDA_ERROR_PROFILER_DISABLED = 5,

		/**
		 * \deprecated
		 * This error return is deprecated as of CUDA 5.0. It is no longer an error
		 * to attempt to enable/disable the profiling via ::cuProfilerStart or
		 * ::cuProfilerStop without initialization.
		 */
		CUDA_ERROR_PROFILER_NOT_INITIALIZED = 6,

		/**
		 * \deprecated
		 * This error return is deprecated as of CUDA 5.0. It is no longer an error
		 * to call cuProfilerStart() when profiling is already enabled.
		 */
		CUDA_ERROR_PROFILER_ALREADY_STARTED = 7,

		/**
		 * \deprecated
		 * This error return is deprecated as of CUDA 5.0. It is no longer an error
		 * to call cuProfilerStop() when profiling is already disabled.
		 */
		CUDA_ERROR_PROFILER_ALREADY_STOPPED = 8,

		/**
		 * This indicates that the CUDA driver that the application has loaded is a
		 * stub library. Applications that run with the stub rather than a real
		 * driver loaded will result in CUDA API returning this error.
		 */
		CUDA_ERROR_STUB_LIBRARY = 34,

		/**
		 * This indicates that no CUDA-capable devices were detected by the installed
		 * CUDA driver.
		 */
		CUDA_ERROR_NO_DEVICE = 100,

		/**
		 * This indicates that the device ordinal supplied by the user does not
		 * correspond to a valid CUDA device or that the action requested is
		 * invalid for the specified device.
		 */
		CUDA_ERROR_INVALID_DEVICE = 101,

		/**
		 * This error indicates that the Grid license is not applied.
		 */
		CUDA_ERROR_DEVICE_NOT_LICENSED = 102,

		/**
		 * This indicates that the device kernel image is invalid. This can also
		 * indicate an invalid CUDA module.
		 */
		CUDA_ERROR_INVALID_IMAGE = 200,

		/**
		 * This most frequently indicates that there is no context bound to the
		 * current thread. This can also be returned if the context passed to an
		 * API call is not a valid handle (such as a context that has had
		 * ::cuCtxDestroy() invoked on it). This can also be returned if a user
		 * mixes different API versions (i.e. 3010 context with 3020 API calls).
		 * See ::cuCtxGetApiVersion() for more details.
		 */
		CUDA_ERROR_INVALID_CONTEXT = 201,

		/**
		 * This indicated that the context being supplied as a parameter to the
		 * API call was already the active context.
		 * \deprecated
		 * This error return is deprecated as of CUDA 3.2. It is no longer an
		 * error to attempt to push the active context via ::cuCtxPushCurrent().
		 */
		CUDA_ERROR_CONTEXT_ALREADY_CURRENT = 202,

		/**
		 * This indicates that a map or register operation has failed.
		 */
		CUDA_ERROR_MAP_FAILED = 205,

		/**
		 * This indicates that an unmap or unregister operation has failed.
		 */
		CUDA_ERROR_UNMAP_FAILED = 206,

		/**
		 * This indicates that the specified array is currently mapped and thus
		 * cannot be destroyed.
		 */
		CUDA_ERROR_ARRAY_IS_MAPPED = 207,

		/**
		 * This indicates that the resource is already mapped.
		 */
		CUDA_ERROR_ALREADY_MAPPED = 208,

		/**
		 * This indicates that there is no kernel image available that is suitable
		 * for the device. This can occur when a user specifies code generation
		 * options for a particular CUDA source file that do not include the
		 * corresponding device configuration.
		 */
		CUDA_ERROR_NO_BINARY_FOR_GPU = 209,

		/**
		 * This indicates that a resource has already been acquired.
		 */
		CUDA_ERROR_ALREADY_ACQUIRED = 210,

		/**
		 * This indicates that a resource is not mapped.
		 */
		CUDA_ERROR_NOT_MAPPED = 211,

		/**
		 * This indicates that a mapped resource is not available for access as an
		 * array.
		 */
		CUDA_ERROR_NOT_MAPPED_AS_ARRAY = 212,

		/**
		 * This indicates that a mapped resource is not available for access as a
		 * pointer.
		 */
		CUDA_ERROR_NOT_MAPPED_AS_POINTER = 213,

		/**
		 * This indicates that an uncorrectable ECC error was detected during
		 * execution.
		 */
		CUDA_ERROR_ECC_UNCORRECTABLE = 214,

		/**
		 * This indicates that the ::CUlimit passed to the API call is not
		 * supported by the active device.
		 */
		CUDA_ERROR_UNSUPPORTED_LIMIT = 215,

		/**
		 * This indicates that the ::CUcontext passed to the API call can
		 * only be bound to a single CPU thread at a time but is already
		 * bound to a CPU thread.
		 */
		CUDA_ERROR_CONTEXT_ALREADY_IN_USE = 216,

		/**
		 * This indicates that peer access is not supported across the given
		 * devices.
		 */
		CUDA_ERROR_PEER_ACCESS_UNSUPPORTED = 217,

		/**
		 * This indicates that a PTX JIT compilation failed.
		 */
		CUDA_ERROR_INVALID_PTX = 218,

		/**
		 * This indicates an error with OpenGL or DirectX context.
		 */
		CUDA_ERROR_INVALID_GRAPHICS_CONTEXT = 219,

		/**
		* This indicates that an uncorrectable NVLink error was detected during the
		* execution.
		*/
		CUDA_ERROR_NVLINK_UNCORRECTABLE = 220,

		/**
		* This indicates that the PTX JIT compiler library was not found.
		*/
		CUDA_ERROR_JIT_COMPILER_NOT_FOUND = 221,

		/**
		 * This indicates that the provided PTX was compiled with an unsupported toolchain.
		 */

		CUDA_ERROR_UNSUPPORTED_PTX_VERSION = 222,

		/**
		 * This indicates that the PTX JIT compilation was disabled.
		 */
		CUDA_ERROR_JIT_COMPILATION_DISABLED = 223,

		/**
		 * This indicates that the ::CUexecAffinityType passed to the API call is not
		 * supported by the active device.
		 */
		CUDA_ERROR_UNSUPPORTED_EXEC_AFFINITY = 224,

		/**
		 * This indicates that the device kernel source is invalid. This includes
		 * compilation/linker errors encountered in device code or user error.
		 */
		CUDA_ERROR_INVALID_SOURCE = 300,

		/**
		 * This indicates that the file specified was not found.
		 */
		CUDA_ERROR_FILE_NOT_FOUND = 301,

		/**
		 * This indicates that a link to a shared object failed to resolve.
		 */
		CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND = 302,

		/**
		 * This indicates that initialization of a shared object failed.
		 */
		CUDA_ERROR_SHARED_OBJECT_INIT_FAILED = 303,

		/**
		 * This indicates that an OS call failed.
		 */
		CUDA_ERROR_OPERATING_SYSTEM = 304,

		/**
		 * This indicates that a resource handle passed to the API call was not
		 * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
		 */
		CUDA_ERROR_INVALID_HANDLE = 400,

		/**
		 * This indicates that a resource required by the API call is not in a
		 * valid state to perform the requested operation.
		 */
		CUDA_ERROR_ILLEGAL_STATE = 401,

		/**
		 * This indicates that a named symbol was not found. Examples of symbols
		 * are global/constant variable names, driver function names, texture names,
		 * and surface names.
		 */
		CUDA_ERROR_NOT_FOUND = 500,

		/**
		 * This indicates that asynchronous operations issued previously have not
		 * completed yet. This result is not actually an error, but must be indicated
		 * differently than ::CUDA_SUCCESS (which indicates completion). Calls that
		 * may return this value include ::cuEventQuery() and ::cuStreamQuery().
		 */
		CUDA_ERROR_NOT_READY = 600,

		/**
		 * While executing a kernel, the device encountered a
		 * load or store instruction on an invalid memory address.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_ILLEGAL_ADDRESS = 700,

		/**
		 * This indicates that a launch did not occur because it did not have
		 * appropriate resources. This error usually indicates that the user has
		 * attempted to pass too many arguments to the device kernel, or the
		 * kernel launch specifies too many threads for the kernel's register
		 * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
		 * when a 32-bit int is expected) is equivalent to passing too many
		 * arguments and can also result in this error.
		 */
		CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701,

		/**
		 * This indicates that the device kernel took too long to execute. This can
		 * only occur if timeouts are enabled - see the device attribute
		 * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_LAUNCH_TIMEOUT = 702,

		/**
		 * This error indicates a kernel launch that uses an incompatible texturing
		 * mode.
		 */
		CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING = 703,

		/**
		 * This error indicates that a call to ::cuCtxEnablePeerAccess() is
		 * trying to re-enable peer access to a context which has already
		 * had peer access to it enabled.
		 */
		CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED = 704,

		/**
		 * This error indicates that ::cuCtxDisablePeerAccess() is
		 * trying to disable peer access which has not been enabled yet
		 * via ::cuCtxEnablePeerAccess().
		 */
		CUDA_ERROR_PEER_ACCESS_NOT_ENABLED = 705,

		/**
		 * This error indicates that the primary context for the specified device
		 * has already been initialized.
		 */
		CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE = 708,

		/**
		 * This error indicates that the context current to the calling thread
		 * has been destroyed using ::cuCtxDestroy, or is a primary context which
		 * has not yet been initialized.
		 */
		CUDA_ERROR_CONTEXT_IS_DESTROYED = 709,

		/**
		 * A device-side assert triggered during kernel execution. The context
		 * cannot be used anymore, and must be destroyed. All existing device
		 * memory allocations from this context are invalid and must be
		 * reconstructed if the program is to continue using CUDA.
		 */
		CUDA_ERROR_ASSERT = 710,

		/**
		 * This error indicates that the hardware resources required to enable
		 * peer access have been exhausted for one or more of the devices
		 * passed to ::cuCtxEnablePeerAccess().
		 */
		CUDA_ERROR_TOO_MANY_PEERS = 711,

		/**
		 * This error indicates that the memory range passed to ::cuMemHostRegister()
		 * has already been registered.
		 */
		CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED = 712,

		/**
		 * This error indicates that the pointer passed to ::cuMemHostUnregister()
		 * does not correspond to any currently registered memory region.
		 */
		CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED = 713,

		/**
		 * While executing a kernel, the device encountered a stack error.
		 * This can be due to stack corruption or exceeding the stack size limit.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_HARDWARE_STACK_ERROR = 714,

		/**
		 * While executing a kernel, the device encountered an illegal instruction.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_ILLEGAL_INSTRUCTION = 715,

		/**
		 * While executing a kernel, the device encountered a load or store instruction
		 * on a memory address which is not aligned.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_MISALIGNED_ADDRESS = 716,

		/**
		 * While executing a kernel, the device encountered an instruction
		 * which can only operate on memory locations in certain address spaces
		 * (global, shared, or local), but was supplied a memory address not
		 * belonging to an allowed address space.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_INVALID_ADDRESS_SPACE = 717,

		/**
		 * While executing a kernel, the device program counter wrapped its address space.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_INVALID_PC = 718,

		/**
		 * An exception occurred on the device while executing a kernel. Common
		 * causes include dereferencing an invalid device pointer and accessing
		 * out of bounds shared memory. Less common cases can be system specific - more
		 * information about these cases can be found in the system specific user guide.
		 * This leaves the process in an inconsistent state and any further CUDA work
		 * will return the same error. To continue using CUDA, the process must be terminated
		 * and relaunched.
		 */
		CUDA_ERROR_LAUNCH_FAILED = 719,

		/**
		 * This error indicates that the number of blocks launched per grid for a kernel that was
		 * launched via either ::cuLaunchCooperativeKernel or ::cuLaunchCooperativeKernelMultiDevice
		 * exceeds the maximum number of blocks as allowed by ::cuOccupancyMaxActiveBlocksPerMultiprocessor
		 * or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags times the number of multiprocessors
		 * as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.
		 */
		CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,

		/**
		 * This error indicates that the attempted operation is not permitted.
		 */
		CUDA_ERROR_NOT_PERMITTED = 800,

		/**
		 * This error indicates that the attempted operation is not supported
		 * on the current system or device.
		 */
		CUDA_ERROR_NOT_SUPPORTED = 801,

		/**
		 * This error indicates that the system is not yet ready to start any CUDA
		 * work.  To continue using CUDA, verify the system configuration is in a
		 * valid state and all required driver daemons are actively running.
		 * More information about this error can be found in the system specific
		 * user guide.
		 */
		CUDA_ERROR_SYSTEM_NOT_READY = 802,

		/**
		 * This error indicates that there is a mismatch between the versions of
		 * the display driver and the CUDA driver. Refer to the compatibility documentation
		 * for supported versions.
		 */
		CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803,

		/**
		 * This error indicates that the system was upgraded to run with forward compatibility
		 * but the visible hardware detected by CUDA does not support this configuration.
		 * Refer to the compatibility documentation for the supported hardware matrix or ensure
		 * that only supported hardware is visible during initialization via the CUDA_VISIBLE_DEVICES
		 * environment variable.
		 */
		CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE = 804,

		/**
		 * This error indicates that the MPS client failed to connect to the MPS control daemon or the MPS server.
		 */
		CUDA_ERROR_MPS_CONNECTION_FAILED = 805,

		/**
		 * This error indicates that the remote procedural call between the MPS server and the MPS client failed.
		 */
		CUDA_ERROR_MPS_RPC_FAILURE = 806,

		/**
		 * This error indicates that the MPS server is not ready to accept new MPS client requests.
		 * This error can be returned when the MPS server is in the process of recovering from a fatal failure.
		 */
		CUDA_ERROR_MPS_SERVER_NOT_READY = 807,

		/**
		 * This error indicates that the hardware resources required to create MPS client have been exhausted.
		 */
		CUDA_ERROR_MPS_MAX_CLIENTS_REACHED = 808,

		/**
		 * This error indicates the the hardware resources required to support device connections have been exhausted.
		 */
		CUDA_ERROR_MPS_MAX_CONNECTIONS_REACHED = 809,

		/**
		 * This error indicates that the operation is not permitted when
		 * the stream is capturing.
		 */
		CUDA_ERROR_STREAM_CAPTURE_UNSUPPORTED = 900,

		/**
		 * This error indicates that the current capture sequence on the stream
		 * has been invalidated due to a previous error.
		 */
		CUDA_ERROR_STREAM_CAPTURE_INVALIDATED = 901,

		/**
		 * This error indicates that the operation would have resulted in a merge
		 * of two independent capture sequences.
		 */
		CUDA_ERROR_STREAM_CAPTURE_MERGE = 902,

		/**
		 * This error indicates that the capture was not initiated in this stream.
		 */
		CUDA_ERROR_STREAM_CAPTURE_UNMATCHED = 903,

		/**
		 * This error indicates that the capture sequence contains a fork that was
		 * not joined to the primary stream.
		 */
		CUDA_ERROR_STREAM_CAPTURE_UNJOINED = 904,

		/**
		 * This error indicates that a dependency would have been created which
		 * crosses the capture sequence boundary. Only implicit in-stream ordering
		 * dependencies are allowed to cross the boundary.
		 */
		CUDA_ERROR_STREAM_CAPTURE_ISOLATION = 905,

		/**
		 * This error indicates a disallowed implicit dependency on a current capture
		 * sequence from cudaStreamLegacy.
		 */
		CUDA_ERROR_STREAM_CAPTURE_IMPLICIT = 906,

		/**
		 * This error indicates that the operation is not permitted on an event which
		 * was last recorded in a capturing stream.
		 */
		CUDA_ERROR_CAPTURED_EVENT = 907,

		/**
		 * A stream capture sequence not initiated with the ::CU_STREAM_CAPTURE_MODE_RELAXED
		 * argument to ::cuStreamBeginCapture was passed to ::cuStreamEndCapture in a
		 * different thread.
		 */
		CUDA_ERROR_STREAM_CAPTURE_WRONG_THREAD = 908,

		/**
		 * This error indicates that the timeout specified for the wait operation has lapsed.
		 */
		CUDA_ERROR_TIMEOUT = 909,

		/**
		 * This error indicates that the graph update was not performed because it included 
		 * changes which violated constraints specific to instantiated graph update.
		 */
		CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE = 910,

		/**
		 * This indicates that an async error has occurred in a device outside of CUDA.
		 * If CUDA was waiting for an external device's signal before consuming shared data,
		 * the external device signaled an error indicating that the data is not valid for
		 * consumption. This leaves the process in an inconsistent state and any further CUDA
		 * work will return the same error. To continue using CUDA, the process must be
		 * terminated and relaunched.
		 */
		CUDA_ERROR_EXTERNAL_DEVICE = 911,

		/**
		 * This indicates that an unknown internal error has occurred.
		 */
		CUDA_ERROR_UNKNOWN = 999
	}

	public enum CUcomputemode
	{
		CU_COMPUTEMODE_DEFAULT = 0, /**< Default compute mode (Multiple contexts allowed per device) */
		CU_COMPUTEMODE_EXCLUSIVE = 1,
		CU_COMPUTEMODE_PROHIBITED = 2, /**< Compute-prohibited mode (No contexts can be created on this device at this time) */
		CU_COMPUTEMODE_EXCLUSIVE_PROCESS = 3  /**< Compute-exclusive-process mode (Only one context used by a single process can be present on this device at a time) */
	}

	public enum CUdevice_attribute
	{
		CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 1,						  /**< Maximum number of threads per block */
		CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X = 2,								/**< Maximum block dimension X */
		CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y = 3,								/**< Maximum block dimension Y */
		CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z = 4,								/**< Maximum block dimension Z */
		CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X = 5,								 /**< Maximum grid dimension X */
		CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y = 6,								 /**< Maximum grid dimension Y */
		CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z = 7,								 /**< Maximum grid dimension Z */
		CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK = 8,					/**< Maximum shared memory available per block in bytes */
		CU_DEVICE_ATTRIBUTE_SHARED_MEMORY_PER_BLOCK = 8,						/**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK */
		CU_DEVICE_ATTRIBUTE_TOTAL_CONSTANT_MEMORY = 9,						  /**< Memory available on device for __constant__ variables in a CUDA C kernel in bytes */
		CU_DEVICE_ATTRIBUTE_WARP_SIZE = 10,									 /**< Warp size in threads */
		CU_DEVICE_ATTRIBUTE_MAX_PITCH = 11,									 /**< Maximum pitch in bytes allowed by memory copies */
		CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12,					   /**< Maximum number of 32-bit registers available per block */
		CU_DEVICE_ATTRIBUTE_REGISTERS_PER_BLOCK = 12,						   /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK */
		CU_DEVICE_ATTRIBUTE_CLOCK_RATE = 13,									/**< Typical clock frequency in kilohertz */
		CU_DEVICE_ATTRIBUTE_TEXTURE_ALIGNMENT = 14,							 /**< Alignment requirement for textures */
		CU_DEVICE_ATTRIBUTE_GPU_OVERLAP = 15,								   /**< Device can possibly copy memory and execute a kernel concurrently. Deprecated. Use instead CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT. */
		CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT = 16,						  /**< Number of multiprocessors on device */
		CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT = 17,						   /**< Specifies whether there is a run time limit on kernels */
		CU_DEVICE_ATTRIBUTE_INTEGRATED = 18,									/**< Device is integrated with host memory */
		CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY = 19,						   /**< Device can map host memory into CUDA address space */
		CU_DEVICE_ATTRIBUTE_COMPUTE_MODE = 20,								  /**< Compute mode (See ::CUcomputemode for details) */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_WIDTH = 21,					   /**< Maximum 1D texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_WIDTH = 22,					   /**< Maximum 2D texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_HEIGHT = 23,					  /**< Maximum 2D texture height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_WIDTH = 24,					   /**< Maximum 3D texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_HEIGHT = 25,					  /**< Maximum 3D texture height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_DEPTH = 26,					   /**< Maximum 3D texture depth */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_WIDTH = 27,			   /**< Maximum 2D layered texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_HEIGHT = 28,			  /**< Maximum 2D layered texture height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_LAYERS = 29,			  /**< Maximum layers in a 2D layered texture */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_WIDTH = 27,				 /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_WIDTH */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_HEIGHT = 28,				/**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_HEIGHT */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES = 29,			 /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_LAYERS */
		CU_DEVICE_ATTRIBUTE_SURFACE_ALIGNMENT = 30,							 /**< Alignment requirement for surfaces */
		CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,							/**< Device can possibly execute multiple kernels concurrently */
		CU_DEVICE_ATTRIBUTE_ECC_ENABLED = 32,								   /**< Device has ECC support enabled */
		CU_DEVICE_ATTRIBUTE_PCI_BUS_ID = 33,									/**< PCI bus ID of the device */
		CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID = 34,								 /**< PCI device ID of the device */
		CU_DEVICE_ATTRIBUTE_TCC_DRIVER = 35,									/**< Device is using TCC driver model */
		CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE = 36,							 /**< Peak memory clock frequency in kilohertz */
		CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH = 37,					   /**< Global memory bus width in bits */
		CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE = 38,								 /**< Size of L2 cache in bytes */
		CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,				/**< Maximum resident threads per multiprocessor */
		CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,							/**< Number of asynchronous engines */
		CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,							/**< Device shares a unified address space with the host */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LAYERED_WIDTH = 42,			   /**< Maximum 1D layered texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LAYERED_LAYERS = 43,			  /**< Maximum layers in a 1D layered texture */
		CU_DEVICE_ATTRIBUTE_CAN_TEX2D_GATHER = 44,							  /**< Deprecated, do not use. */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_WIDTH = 45,				/**< Maximum 2D texture width if CUDA_ARRAY3D_TEXTURE_GATHER is set */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_HEIGHT = 46,			   /**< Maximum 2D texture height if CUDA_ARRAY3D_TEXTURE_GATHER is set */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE = 47,			 /**< Alternate maximum 3D texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE = 48,			/**< Alternate maximum 3D texture height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE = 49,			 /**< Alternate maximum 3D texture depth */
		CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID = 50,								 /**< PCI domain ID of the device */
		CU_DEVICE_ATTRIBUTE_TEXTURE_PITCH_ALIGNMENT = 51,					   /**< Pitch alignment requirement for textures */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_WIDTH = 52,				  /**< Maximum cubemap texture width/height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH = 53,		  /**< Maximum cubemap layered texture width/height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS = 54,		 /**< Maximum layers in a cubemap layered texture */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_WIDTH = 55,					   /**< Maximum 1D surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_WIDTH = 56,					   /**< Maximum 2D surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_HEIGHT = 57,					  /**< Maximum 2D surface height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_WIDTH = 58,					   /**< Maximum 3D surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_HEIGHT = 59,					  /**< Maximum 3D surface height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_DEPTH = 60,					   /**< Maximum 3D surface depth */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_LAYERED_WIDTH = 61,			   /**< Maximum 1D layered surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_LAYERED_LAYERS = 62,			  /**< Maximum layers in a 1D layered surface */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_WIDTH = 63,			   /**< Maximum 2D layered surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_HEIGHT = 64,			  /**< Maximum 2D layered surface height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_LAYERS = 65,			  /**< Maximum layers in a 2D layered surface */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_WIDTH = 66,				  /**< Maximum cubemap surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH = 67,		  /**< Maximum cubemap layered surface width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS = 68,		 /**< Maximum layers in a cubemap layered surface */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LINEAR_WIDTH = 69,				/**< Deprecated, do not use. Use cudaDeviceGetTexture1DLinearMaxWidth() or cuDeviceGetTexture1DLinearMaxWidth() instead. */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_WIDTH = 70,				/**< Maximum 2D linear texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_HEIGHT = 71,			   /**< Maximum 2D linear texture height */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_PITCH = 72,				/**< Maximum 2D linear texture pitch in bytes */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH = 73,			 /**< Maximum mipmapped 2D texture width */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT = 74,			/**< Maximum mipmapped 2D texture height */
		CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,					  /**< Major compute capability version number */
		CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,					  /**< Minor compute capability version number */
		CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH = 77,			 /**< Maximum mipmapped 1D texture width */
		CU_DEVICE_ATTRIBUTE_STREAM_PRIORITIES_SUPPORTED = 78,				   /**< Device supports stream priorities */
		CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED = 79,					 /**< Device supports caching globals in L1 */
		CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED = 80,					  /**< Device supports caching locals in L1 */
		CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = 81,		  /**< Maximum shared memory available per multiprocessor in bytes */
		CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,			  /**< Maximum number of 32-bit registers available per multiprocessor */
		CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,								/**< Device can allocate managed memory on this system */
		CU_DEVICE_ATTRIBUTE_MULTI_GPU_BOARD = 84,							   /**< Device is on a multi-GPU board */
		CU_DEVICE_ATTRIBUTE_MULTI_GPU_BOARD_GROUP_ID = 85,					  /**< Unique id for a group of devices on the same multi-GPU board */
		CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED = 86,				  /**< Link between the device and the host supports native atomic operations (this is a placeholder attribute, and is not supported on any current hardware)*/
		CU_DEVICE_ATTRIBUTE_SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO = 87,		 /**< Ratio of single precision performance (in floating-point operations per second) to double precision performance */
		CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS = 88,						/**< Device supports coherently accessing pageable memory without calling cudaHostRegister on it */
		CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89,					 /**< Device can coherently access managed memory concurrently with the CPU */
		CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED = 90,				  /**< Device supports compute preemption. */
		CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM = 91,	   /**< Device can access host registered memory at the same virtual address as the CPU */
		CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS = 92,						/**< ::cuStreamBatchMemOp and related APIs are supported. */
		CU_DEVICE_ATTRIBUTE_CAN_USE_64_BIT_STREAM_MEM_OPS = 93,				 /**< 64-bit operations are supported in ::cuStreamBatchMemOp and related APIs. */
		CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_WAIT_VALUE_NOR = 94,				 /**< ::CU_STREAM_WAIT_VALUE_NOR is supported. */
		CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH = 95,							/**< Device supports launching cooperative kernels via ::cuLaunchCooperativeKernel */
		CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH = 96,			   /**< Deprecated, ::cuLaunchCooperativeKernelMultiDevice is deprecated. */
		CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN = 97,			 /**< Maximum optin shared memory per block */
		CU_DEVICE_ATTRIBUTE_CAN_FLUSH_REMOTE_WRITES = 98,					   /**< The ::CU_STREAM_WAIT_VALUE_FLUSH flag and the ::CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES MemOp are supported on the device. See \ref CUDA_MEMOP for additional details. */
		CU_DEVICE_ATTRIBUTE_HOST_REGISTER_SUPPORTED = 99,					   /**< Device supports host memory registration via ::cudaHostRegister. */
		CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES = 100, /**< Device accesses pageable memory via the host's page tables. */
		CU_DEVICE_ATTRIBUTE_DIRECT_MANAGED_MEM_ACCESS_FROM_HOST = 101,		  /**< The host can directly access managed memory on the device without migration. */
		CU_DEVICE_ATTRIBUTE_VIRTUAL_ADDRESS_MANAGEMENT_SUPPORTED = 102,		 /**< Deprecated, Use CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED*/
		CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED = 102,		 /**< Device supports virtual memory management APIs like ::cuMemAddressReserve, ::cuMemCreate, ::cuMemMap and related APIs */
		CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED = 103,  /**< Device supports exporting memory to a posix file descriptor with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
		CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_WIN32_HANDLE_SUPPORTED = 104,		   /**< Device supports exporting memory to a Win32 NT handle with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
		CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_WIN32_KMT_HANDLE_SUPPORTED = 105,	   /**< Device supports exporting memory to a Win32 KMT handle with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
		CU_DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR = 106,				/**< Maximum number of blocks per multiprocessor */
		CU_DEVICE_ATTRIBUTE_GENERIC_COMPRESSION_SUPPORTED = 107,				/**< Device supports compression of memory */
		CU_DEVICE_ATTRIBUTE_MAX_PERSISTING_L2_CACHE_SIZE = 108,				 /**< Maximum L2 persisting lines capacity setting in bytes. */
		CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE = 109,				/**< Maximum value of CUaccessPolicyWindow::num_bytes. */
		CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED = 110,	  /**< Device supports specifying the GPUDirect RDMA flag with ::cuMemCreate */
		CU_DEVICE_ATTRIBUTE_RESERVED_SHARED_MEMORY_PER_BLOCK = 111,			 /**< Shared memory reserved by CUDA driver per block in bytes */
		CU_DEVICE_ATTRIBUTE_SPARSE_CUDA_ARRAY_SUPPORTED = 112,				  /**< Device supports sparse CUDA arrays and sparse CUDA mipmapped arrays */
		CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED = 113,			/**< Device supports using the ::cuMemHostRegister flag ::CU_MEMHOSTERGISTER_READ_ONLY to register memory that must be mapped as read-only to the GPU */
		CU_DEVICE_ATTRIBUTE_TIMELINE_SEMAPHORE_INTEROP_SUPPORTED = 114,		 /**< External timeline semaphore interop is supported on the device */
		CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED = 115,					   /**< Device supports using the ::cuMemAllocAsync and ::cuMemPool family of APIs */
		CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED = 116,					/**< Device supports GPUDirect RDMA APIs, like nvidia_p2p_get_pages (see https://docs.nvidia.com/cuda/gpudirect-rdma for more information) */
		CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_FLUSH_WRITES_OPTIONS = 117,		 /**< The returned attribute shall be interpreted as a bitmask, where the individual bits are described by the ::CUflushGPUDirectRDMAWritesOptions enum */
		CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WRITES_ORDERING = 118,			  /**< GPUDirect RDMA writes to the device do not need to be flushed for consumers within the scope indicated by the returned attribute. See ::CUGPUDirectRDMAWritesOrdering for the numerical values returned here. */
		CU_DEVICE_ATTRIBUTE_MEMPOOL_SUPPORTED_HANDLE_TYPES = 119,			   /**< Handle types supported with mempool based IPC */
		CU_DEVICE_ATTRIBUTE_MAX
	}

	enum nvrtcResult
	{
		NVRTC_SUCCESS = 0,
		NVRTC_ERROR_OUT_OF_MEMORY = 1,
		NVRTC_ERROR_PROGRAM_CREATION_FAILURE = 2,
		NVRTC_ERROR_INVALID_INPUT = 3,
		NVRTC_ERROR_INVALID_PROGRAM = 4,
		NVRTC_ERROR_INVALID_OPTION = 5,
		NVRTC_ERROR_COMPILATION = 6,
		NVRTC_ERROR_BUILTIN_OPERATION_FAILURE = 7,
		NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 8,
		NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 9,
		NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 10,
		NVRTC_ERROR_INTERNAL_ERROR = 11
	}
}

Comment

.NET: Nvidia CUDA for CSharp (v1.0.1)

.

..

tutorial

.NET: About the System.Security.Cryptography Namespace

.NET: Access the Pixels of an Image Using SixLabors.ImageSharp v3.x

.NET: Add Comment to XML Document

.NET: Alpha Compositing Algorithm

.NET: Call an External Function Using DllImport in C#

.NET: CNN v1.0 for Supervised Deep Learning Example

.NET: CNN v2.0 for Supervised Deep Learning Example

.NET: Common Activation Functions with Their Derivatives in C#

.NET: ConcurrentBag - What is It and How to Use It?

.NET: ConcurrentDictionary - What is It and How to Use it?

.NET: Convert a Binary Loss to a Percent Accuracy

.NET: Convert a Softmax Loss to a Percentage Accuracy

.NET: Convert Dictionary Keys and Values to a List

.NET: Convert Double Array to Byte Array

.NET: Convert Float Array to Byte Array

.NET: Convert Google's Gemini Markdown Text to HTML

.NET: Convert Int Array to Byte Array

.NET: Convert Number to Binary String

.NET: Copy Console Output to a Log File

.NET: Create, Read-from and Write-to the Process Class

.NET: Crop Image to Square using SixLabors.ImageSharp v3.x

.NET: Double to String Conversion Without Scientific Notation

.NET: Download any Type of File Data from a URL with HttpClient

.NET: Extract the Page Title from HTML

.NET: Find Keywords in Text using Regex

.NET: Find the Index Value When Using LINQ Select()

.NET: Globally Unique Identifiers

.NET: Google Generative AI Library

.NET: Hash-based Message Authentication Code

.NET: How to Create WordPress Slugs

.NET: How to Randomize the Order of a List and Optional Parallel List

.NET: Identify Brotli Compressed Files

.NET: Identify GZip Compressed Files

.NET: Image Utility for SixLabors.ImageSharp v1.0

.NET: Image Utility for SixLabors.ImageSharp v3.x

.NET: Inverted Sigmoid Decay Function

.NET: Is a Number a Power of Two (2)

.NET: Join/Merge Two List, Array or Enumerable (IEnumerable) Objects

.NET: K-Means Clustering Algorithm

.NET: Kill or End a Process

.NET: Lazy Load Data

.NET: Machine Learning, Unsupervised Learning or Clustering, K-mean / Silhouette Clustering Library

.NET: Memory-Mapped Files

.NET: Mersenne Twister Random Number Generation

.NET: Multi-threading with Tasks

.NET: Neural Network, Supervised Deep Machine Learning Example in C#

.NET: Nvidia CUDA for CSharp (v1.0.1 - Linux Edition)

.NET: Nvidia CUDA for CSharp (v1.0.1)

.NET: Operating System Detection

.NET: Power Status - AC/DC

.NET: Remove Data Outliers for Better Data

.NET: Retrieve the Executable's Path and Directory

.NET: Reverse Geocode

.NET: Sort a List of Objects with a Delegate Function

.NET: Strip/Remove HTML SCRIPT Tags from Text Using Regex

.NET: Strip/Remove HTML Tags from Text Using Regex

.NET: The Random Class is Not Thread-safe!

.NET: Use Callback Function with Windows API

.NET: Using Brotli to Compress and Decompress Data

.NET: Using GZip to Compress and Decompress Data

.NET: What's New or Changed in C# 6

.NET: What's New or Changed in C# 7

.NET: Working with Arrays

.NET: Working with Dates

.NET: Working with Strings

.NET: Working with Threads

.NET: XmlSerializer Example