PROWAREtech

articles » current » assembly » x86 » tutorial » page-14

Intel IA-32 Assembly Tutorial - A Guide to the Basics of x86 Assembly - Page 14

Language Elements (AVX/AVX2: Advanced Vector Extensions, Detecting with CPUID, AVX Compared to SSE, New AVX Instructions, Copy Memory Using 256-bit AVX Registers, Multiply Large Array with 256-bit AVX Registers, New AVX2 Instructions).

The assembly code on this page was compiled and tested with Visual Studio 2022. Disabling option Image has Safe Exception Handlers in Project PropertiesConfiguration PropertiesLinkerAll Options may be required to compile most of this code.

Be familiar with SSE/SSE2 before reading this part of the tutorial.

AVX/AVX2: Advanced Vector Extensions

AVX is the latest generation of Single instruction, multiple data (SIMD) added to the x86 CPU, and it is available on newer CPU's only, such as Intel's Sandy Bridge and later. This is a large extension to the x86 CPU and the biggest since SSE. AVX2 is also known as "Haswell New Instructions" from Intel's Haswell processors.

AVX requires operating system support. Windows 7 requires Service Pack 1. Windows 8.1, 10 and 11 support AVX. Linux kernel version 2.6.30 supports AVX. When the operating system does context switching, the register file needs to be larger. This is why AVX needs OS support.

Check for the CPUID instruction:

TITLE 'extern "C" int __cdecl isCPUID();'
.686P
.model FLAT
PUBLIC	_isCPUID
_TEXT	SEGMENT
_isCPUID PROC NEAR
	
	push ebx         ; save ebx for the caller
	pushfd           ; push eflags on the stack
	pop eax          ; pop them into eax
	mov ebx, eax     ; save to ebx for restoring afterwards
	xor eax, 200000h ; toggle bit 21
	push eax         ; push the toggled eflags
	popfd            ; pop them back into eflags
	pushfd           ; push eflags
	pop eax          ; pop them back into eax
	cmp eax, ebx     ; see if bit 21 was reset
	jz not_supported
	
	mov eax, 1
	jmp exit
	
not_supported:
	xor eax, eax;

exit:
	pop ebx
	ret 0
_isCPUID ENDP
_TEXT	ENDS
END
Detecting AVX

TITLE 'extern "C" int __cdecl isAVX();'
.686P
.model FLAT
PUBLIC	_isAVX
_TEXT	SEGMENT
_isAVX PROC NEAR

	push ebx
	
	mov eax, 1
	cpuid
	shr ecx, 28 ; bit 28 of the ecx register
	and ecx, 1
	mov eax, ecx

	pop ebx

exit:
	ret 0
_isAVX ENDP
_TEXT	ENDS
END
Detecting AVX2

TITLE 'extern "C" int __cdecl isAVX2();'
.686P
.model FLAT
PUBLIC	_isAVX2
_TEXT	SEGMENT
_isAVX2 PROC NEAR

	push ebx
	
	mov eax, 7
	xor ecx, ecx
	cpuid
	shr ebx, 5 ; bit 5 of the ebx register
	and ebx, 1
	mov eax, ebx

	pop ebx

exit:
	ret 0
_isAVX2 ENDP
_TEXT	ENDS
END

AVX Compared to SSE

Now, the exciting changes in AVX as compared to SSE. Registers are 256 bits (32-bytes) so they are twice as wide as SSE. The AVX instuctions work on eight packed singles (C/C++ float data-type) or 4 packed doubles (C/C++ double data-type). Many AVX instructions take three operands and are non-destructive so the source operands are not necessarily overwritten as they are in SSE! This is a big departure from how most of the x86 architecture works. AVX introduced the YMMWORD which is a 256-bit value. Just like SSE, there are 16 registers. Each AVX register (YMM0 to YMM15) is aliased to the 16 SSE registers.

1 ymmword
2 xmmwords  
4 doubles    
8 singles        

Some SSE2 instructions deal with integers. The AVX instructions deal only with singles and doubles, but the AVX registers can of course be used to store many different data-types. AVX2 expands most SSE2 integer commands to 256 bits (32 bytes) and introduces new instructions.

Most SSE instructions are available. The mnemonics of AVX instructions have a V prefix, so SSE instruction SUBPS becomes VSUBPS, for example. SSE MMX and scalar instructions are unavailable. Most SSE2 integer instructions are available with AVX2.

There are new broadcasting (fill AVX register with a value), extraction/insertion, masked moves (imitates SIMD branching), shuffling and zeroing instructions.

Consider the following code:


vsubps ymm0, ymm1, ymm2

This operation subtracts eight packed singles. It will subtract register ymm2 from register ymm1 and store the eight results in register ymm0.

When copying to and from memory, AVX is bottlenecked by the speed of the memory and as a result is only about as fast a SSE. The difference comes in calculations, so if an application primarily moves a lot of data in and out of memory then SSE may be a better choice because of better CPU and OS support.

In this complete example, 32 bytes are copied from one location in memory to another:


TITLE 'extern "C" int __cdecl avx_copy_example(char dest[], char src[]);'
.686P
.xmm
.model FLAT
PUBLIC	_avx_copy_example
_TEXT	SEGMENT
_avx_copy_example PROC NEAR
	
	mov eax, [esp+(2+0)*4]             ; src - should be at least 32 bytes long
	vmovupd ymm0, YMMWORD PTR [eax]    ; copy src to ymm0

	mov eax, [esp+(1+0)*4]             ; dest - should be at least 32 bytes long
	vmovupd YMMWORD PTR [eax], ymm0    ; copy ymm0 to dest

	mov eax, 1

	ret 0
_avx_copy_example ENDP
_TEXT	ENDS
END

Here is the C/C++ driver code:


extern "C" int __cdecl avx_copy_example(char dest[], char src[]);

int main()
{
	char src[32], dest[32];
	for (int i = 0; i < 32; i++)
		src[i] = 'a' + i;
	avx_copy_example(dest, src);

	return 0;
}

Data Alignment

Just like aligned SSE instructions must have data aligned to 16-bytes, data must be 32-byte aligned to use AVX instructions like VMOVAPD and VMOVAPS. If data are not aligned on 32-bytes then these instructions will crash the program.

New Instructions

AVX

As has been already stated, most of the SSE instructions are available with a V added to the beginning of the instruction. Here are the all new AVX instructions

These AVX instructions are completely new:

InstructionDescription
VBROADCASTSS VBROADCASTSD VBROADCASTF128Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VINSERTF128Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTF128Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VMASKMOVPS VMASKMOVPDConditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero, or, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPERMILPS VPERMILPDPermute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes.
VPERM2F128Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. See example below.
VTESTPS VTESTPDPacked bit test of the packed single-precision or double-precision floating-point sign bits, setting or clearing the ZF flag based on AND and CF flag based on ANDN.
VZEROALLSet all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use.
VZEROUPPERSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VPERM2F128 Instruction Example

Like the SSE shuffle instructions, this is a somewhat complex one, so the example is rather long to showcase some of what this instruction can do.

VPERM2F128 — parameters: [ymm], [ymm], [ymm/256-bits-memory], [imm8] — first two must be YMM registers


TITLE 'extern "C" int __cdecl avx_vperm2f128_example(double src1[], double src2[], double dest[], void (*print)(double d[], unsigned char imm8));'
.686P
.xmm
.model FLAT
PUBLIC	_avx_vperm2f128_example
_TEXT	SEGMENT
_avx_vperm2f128_example PROC NEAR
	
	push esi



	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 00b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 00b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 01b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 01b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 10b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 11b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 11b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8




	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010000b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 010000b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 100000b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 100000b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110000b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 110000b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8




	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010001b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 010001b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 100010b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 100010b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110011b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 110011b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8




	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010010b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 010010b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110010b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 110010b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010011b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 010011b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8




	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 00011000b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 00011000b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10000010b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 10000010b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov esi, [esp+(1+1)*4]          ; parameter: src1
	vmovupd ymm1, YMMWORD PTR [esi]
	mov esi, [esp+(2+1)*4]          ; parameter: src2
	vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10001000b
	mov esi, [esp+(3+1)*4]          ; parameter: dest
	vmovupd YMMWORD PTR [esi], ymm0

	push 10001000b
	push esi                        ; parameter: dest
	mov eax,[esp+(4+3)*4]           ; parameter: print function
	call eax                        ; call print function
	add esp, 8


	mov eax, 1

	pop esi
	
	ret 0
_avx_vperm2f128_example ENDP
_TEXT	ENDS
END

The driver code:


#include <iostream>
using namespace std;

extern "C" int __cdecl avx_vperm2f128_example(double src1[], double src2[], double dest[], void (*print)(double d[], unsigned char imm8));

void print(double d[], unsigned char imm8)
{
	unsigned char printbits = 0;
	for (int i = 0; i < 8; i++)
	{
		printbits <<= 1;
		printbits += (imm8 >> i & 1);
	}
	cout << endl << "imm8 = ";
	for (int i = 0; i < 8; i++)
	{
		cout << (printbits >> i & 1);
	}

	cout << "b" << " = " << (int)imm8 << endl << "dest = [";
	for (int i = 0; i < 4; i++)
	{
		if (i > 0)
			cout << ", ";
		cout << d[i];
	}
	cout << "]" << endl;
}

int main()
{
	double src1[4], src2[4], dest[4];
	for (int i = 0; i < 4; i++)
		src1[i] = (i + 1) * 1.1;
	for (int i = 0; i < 4; i++)
		src2[i] = (i + 1) * 10.1;

	cout << "src1 = [";
	for (int i = 0; i < 4; i++)
	{
		if (i > 0)
			cout << ", ";
		cout << src1[i];
	}
	cout << "]" << endl << "src2 = [";
	for (int i = 0; i < 4; i++)
	{
		if (i > 0)
			cout << ", ";
		cout << src2[i];
	}
	cout << "]" << endl;

	avx_vperm2f128_example(src1, src2, dest, print);

	cout << endl;

	cin.get();
	return 0;
}

The program output:

src1 = [1.1, 2.2, 3.3, 4.4]
src2 = [10.1, 20.2, 30.3, 40.4]

imm8 = 00000000b = 0
dest = [1.1, 2.2, 1.1, 2.2]

imm8 = 00000001b = 1
dest = [3.3, 4.4, 1.1, 2.2]

imm8 = 00000010b = 2
dest = [10.1, 20.2, 1.1, 2.2]

imm8 = 00000011b = 3
dest = [30.3, 40.4, 1.1, 2.2]

imm8 = 00010000b = 16
dest = [1.1, 2.2, 3.3, 4.4]

imm8 = 00100000b = 32
dest = [1.1, 2.2, 10.1, 20.2]

imm8 = 00110000b = 48
dest = [1.1, 2.2, 30.3, 40.4]

imm8 = 00010001b = 17
dest = [3.3, 4.4, 3.3, 4.4]

imm8 = 00100010b = 34
dest = [10.1, 20.2, 10.1, 20.2]

imm8 = 00110011b = 51
dest = [30.3, 40.4, 30.3, 40.4]

imm8 = 00010010b = 18
dest = [10.1, 20.2, 3.3, 4.4]

imm8 = 00110010b = 50
dest = [10.1, 20.2, 30.3, 40.4]

imm8 = 00010011b = 19
dest = [30.3, 40.4, 3.3, 4.4]

imm8 = 00011000b = 24
dest = [0, 0, 3.3, 4.4]

imm8 = 10000010b = 130
dest = [10.1, 20.2, 0, 0]

imm8 = 10001000b = 136
dest = [0, 0, 0, 0]

This figure shows how the destination is affected by the bitmask in [imm8]:

IMM8 BITS 0 and 1:
00000000: SRC1[127-0]   → DEST[127-0]
00000001: SRC1[255-128] → DEST[127-0]
00000010: SRC2[127-0]   → DEST[127-0]
00000011: SRC2[255-128] → DEST[127-0]

IMM8 BIT 2
00000100: reserved

IMM8 BIT 3
00001000: 0 → DEST[127-0]

IMM8 BITS 4 and 5:
00000000: SRC1[127-0]   → DEST[255-128]
00001000: SRC1[255-128] → DEST[255-128]
00010000: SRC2[127-0]   → DEST[255-128]
00011000: SRC2[255-128] → DEST[255-128]

IMM8 BIT 6
01000000: reserved

IMM8 BIT 7
10000000: 0 → DEST[255-128]

Copy Memory Using 256-bit AVX Registers

This code will copy memory from one location to another using 32-byte registers (it is not really faster than SSE 128-bit copy example because memory speed is a bottleneck):


TITLE 'extern "C" char * memcopy256(char *dest, const char *src, unsigned int length);'
.686P
.xmm
.model FLAT
PUBLIC	_memcopy256
_TEXT	SEGMENT
_memcopy256 PROC NEAR

	push ebx
	push esi

	mov esi, DWORD PTR [esp+(3+3)*4]  ; move length to esi
	mov ebx, esi                      ; move length to ebx
	shr esi, 5				          ; divide by 32 - holds quotient now
	shl esi, 5                        ; multiply by 32
	sub ebx, esi                      ; find the remainder and store in ebx
	shr esi, 5                        ; divide by 32 - holds quotient now

	mov ecx, esi                      ; move the quotient to ecx

	mov esi, DWORD PTR [esp+(2+3)*4]  ; move the src pointer to esi

	mov eax, DWORD PTR [esp+(1+3)*4]  ; move the dest pointer to eax

	cmp ecx, 0                        ; make sure there are at least 32 bytes to copy
	je compare_remainder

copy32bytes:
	vmovdqu ymm0, YMMWORD PTR [esi]    ; copy 32 bytes from src to ymm0
	vmovdqu YMMWORD PTR [eax], ymm0    ; copy 32 bytes from ymm0 to dest
	add eax, 32
	add esi, 32
	loopnz copy32bytes                ; loop while ecx > 0 - this will automatically decrement ecx

compare_remainder:
	cmp ebx, 0                        ; if there is no remainder then finished
	je exit

	mov ecx, ebx                      ; move the remainder to ecx
	shr ecx, 1				          ; divide by 2
	cmp ecx, 0                        ; make sure there are at least 2 bytes to copy
	je check_odd_byte

copy2bytes:
	mov dx, WORD PTR [esi]            ; copy 2 bytes from src to dx
	mov WORD PTR [eax], dx            ; copy 2 byte from dx to dest
	add eax, 2
	add esi, 2
	loopnz copy2bytes                 ; loop while ecx > 0 - this will automatically decrement ecx

check_odd_byte:
	test ebx, 1
	jz exit

	mov dl, BYTE PTR [esi]            ; copy 1 byte from src to r9b
	mov BYTE PTR [eax], dl            ; copy 1 byte from r9b to dest
	inc eax

exit:
	pop esi
	pop ebx
	ret 0

_memcopy256 ENDP
_TEXT	ENDS
END

The driver code:


#include <iostream>
using namespace std;

extern "C" char* memcopy256(char* dest, const char* src, unsigned int length);

int main()
{
	char dest[1024];
	const char* src = "abcdefghijklmnopqrstuvwxyz? ";
	int len = strlen(src);
	char* end;
	end = memcopy256(dest, src, len);
	end = memcopy256(end, dest, len * 1);
	end = memcopy256(end, dest, len * 2);
	end = memcopy256(end, dest, len * 4);
	memcopy256(end, "", 1); // copy sentinel
	return 0;
}

Multiply Large Array with 256-bit AVX Registers

Math operations on 32-byte arrays are very fast. This is useful for processing images, for example, because the bitmap data are stored in a large array.

This code will multiply a large array with one multiplier using 32-byte (256-bit) registers:


TITLE 'extern "C" void avx_multiply_array(float product[], float multiplicand[], float multiplier, int length);'
.686P
.xmm
.model FLAT
PUBLIC	_avx_multiply_array
_TEXT	SEGMENT
_avx_multiply_array PROC NEAR

	push ebx
	push edi
	push esi

	mov eax, DWORD PTR [esp+(4+4)*4]            ; move length to eax

	mov edx, eax                                ; move length to ebx
	shr edx, 5				                    ; divide by 32 - holds quotient now
	shl edx, 5                                  ; multiply by 32
	mov ebx, eax                                ; move length to edx
	sub ebx, edx                                ; find the remainder and store in ebx
	shr edx, 5                                  ; divide by 32 - holds quotient now

	mov edi, DWORD PTR [esp+(2+4)*4]            ; move multiplicand array to edi
	mov esi, DWORD PTR [esp+(1+4)*4]            ; move product array (destination array) to esi

	cmp edx, 0                                  ; make sure there are at least 32 bytes to copy
	je compare_remainder

	mov ecx, edx                                ; move the quotient to ecx

	vbroadcastss ymm1, DWORD PTR [esp+(3+4)*4]  ; broadcast multiplier to ymm1

multiply32bytes:
	vmulps ymm0, ymm1, YMMWORD PTR [edi]        ; perform multiplication on 8 dwords
	vmovups YMMWORD PTR [esi], ymm0             ; move products to product array
	add edi, 32
	add esi, 32
	loopnz multiply32bytes                      ; loop while ecx > 0 - this will automatically decrement ecx

compare_remainder:
	cmp ebx, 0                                  ; if there is no remainder then finished
	je exit

	mov ecx, ebx                                ; move the remainder to ecx

multi4bytes:
	movss xmm0, DWORD PTR [esp+(3+4)*4]         ; move multiplier to xmm0
	mulss xmm0, DWORD PTR [edi]                 ; multiply 4 byte multiplicand with xmm0
	movss DWORD PTR [esi], xmm0                 ; move the product to the product array
	add edi, 4
	add esi, 4
	loopnz multi4bytes                          ; loop while ecx > 0 - this will automatically decrement ecx

exit:
	pop esi
	pop edi
	pop ebx
	ret 0

_avx_multiply_array ENDP
_TEXT	ENDS
END

The driver code:


#include <random>
#include <iostream>
using namespace std;

extern "C" void avx_multiply_array(float product[], float multiplicand[], float multiplier, int length);

const int length = 10000000; // very large array
float product[length], multiplicand[length], multiplier = 1.5f;

int main()
{
	srand(clock());
	for (int i = 0; i < length; i++)
		multiplicand[i] = rand() * (rand() % 2 ? -0.33333f : 0.33333f);
	avx_multiply_array(product, multiplicand, multiplier, length);
	return 0;
}

AVX2

AVX2 adds vector shifts, adds DWORD-granularity and QWORD-granularity "any to any" permutes, enables vector elements to be loaded from non-contiguous memory locations, and expands most vector integer SSE and AVX instructions to 256 bits. AVX2 came out in 2013 processors so support is somewhat limited at the time of writing this section of the tutorial.

These AVX2 instructions are completely new:

InstructionDescription
VBROADCASTSS VBROADCASTSDCopy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX. There is no 128-bit version, but the same effect can be simply achieved using VINSERTF128.
VPBROADCASTB VPBROADCASTW VPBROADCASTD VPBROADCASTQCopy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register.
VBROADCASTI128Copy a 128-bit memory operand to all elements of a YMM vector register.
VINSERTI128Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTI128Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VGATHERDPD VGATHERQPD VGATHERDPS VGATHERQPSGathers single or double precision floating point values using either 32 or 64-bit indices and scale.
VPGATHERDD VPGATHERDQ VPGATHERQD VPGATHERQQGathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale.
VPMASKMOVD VPMASKMOVQConditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPERMPS VPERMDShuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMPD VPERMQShuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERM2I128Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VPBLENDDDoubleword immediate version of the PBLEND instructions from SSE4.
VPSLLVD VPSLLVQShift left logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRLVD VPSRLVQShift right logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRAVDShift right arithmetically. Allows variable shifts where each element is shifted according to the packed input.
<<<[Page 14 of 15]>>>

This site uses cookies. Cookies are simple text files stored on the user's computer. They are used for adding features and security to this site. Read the privacy policy.
CLOSE