Conclusion#
The conclusion is written at the beginning
Defining local loop variables within a for loop statement, whether using the AC6 compiler or the GCC compiler, will not result in multiple stack operations; instead, it will use the same two stack offsets. If optimization is enabled, the assembly will be exactly the same when there is no actual difference in logical functionality between the two.
In fact, defining the loop variable at the same time as the for loop is an excellent practice. Moving all local variable definitions to the beginning of the function can lead to actual negative optimization or no optimization (depending on the optimization level and compiler).
Therefore, if extreme performance is pursued, the local variable should only be declared in the branch where it is used.
The following tests are compiled with stm32H7 as the target
Discussing stack operations of local variables with the following code#
for(int i = 0; i < 50; i++)
{
for(int j = 0; j < 50; j++)
{
HAL_Delay(1);
}
}
Intuitively, each time the first loop runs, a local variable j is declared. Will this lead to multiple stack allocation operations?
The disassembly for this part is as follows#
0x0000001e: LDR r0,[sp,#0]
0x00000020: STR r0,[sp,#8]
0x00000022: B {pc}+0x2 ; 0x24
0x00000024: LDR r0,[sp,#8]
0x00000026: CMP r0,#0x31
0x00000028: BGT {pc}+0x2c ; 0x54
0x0000002a: B {pc}+0x2 ; 0x2c
0x0000002c: MOVS r0,#0
0x0000002e: STR r0,[sp,#4]
0x00000030: B {pc}+0x2 ; 0x32
0x00000032: LDR r0,[sp,#4]
0x00000034: CMP r0,#0x31
0x00000036: BGT {pc}+0x14 ; 0x4a
0x00000038: B {pc}+0x2 ; 0x3a
0x0000003a: MOVS r0,#1
0x0000003c: BL HAL_Delay
0x00000040: B {pc}+0x2 ; 0x42
0x00000042: LDR r0,[sp,#4]
0x00000044: ADDS r0,#1
0x00000046: STR r0,[sp,#4]
0x00000048: B {pc}-0x16 ; 0x32
0x0000004a: B {pc}+0x2 ; 0x4c
0x0000004c: LDR r0,[sp,#8]
0x0000004e: ADDS r0,#1
0x00000050: STR r0,[sp,#8]
0x00000052: B {pc}-0x2e ; 0x24
Outer loop#
This is not our main discussion point; it will simply use jumps to execute the inner loop 50 times.
0x0000001e: LDR r0,[sp,#0]
0x00000020: STR r0,[sp,#8]
0x00000022: B {pc}+0x2 ; 0x24
0x00000024: LDR r0,[sp,#8]
0x00000026: CMP r0,#0x31
0x00000028: BGT {pc}+0x2c ; 0x54
0x0000002a: B {pc}+0x2 ; 0x2c
; .....inner loop
0x0000004a: B {pc}+0x2 ; 0x4c
0x0000004c: LDR r0,[sp,#8]
0x0000004e: ADDS r0,#1
0x00000050: STR r0,[sp,#8]
0x00000052: B {pc}-0x2e ; 0x24
Inner loop#
0x0000002c: MOVS r0,#0
0x0000002e: STR r0,[sp,#4]
0x00000030: B {pc}+0x2 ; 0x32
0x00000032: LDR r0,[sp,#4]
0x00000034: CMP r0,#0x31
0x00000036: BGT {pc}+0x14 ; 0x4a
0x00000038: B {pc}+0x2 ; 0x3a
0x0000003a: MOVS r0,#1
0x0000003c: BL HAL_Delay
0x00000040: B {pc}+0x2 ; 0x42
0x00000042: LDR r0,[sp,#4]
0x00000044: ADDS r0,#1
0x00000046: STR r0,[sp,#4]
0x00000048: B {pc}-0x16 ; 0x32
The instructions at 2c and 2e set the value at sp+4 to 0.
Then, using increment and jumps, it executes the loop 50 times.
This means that each time the outer loop runs, this set of operations targeting the stack at sp+4 will occur, while the outer loop will always target the stack at sp+8.
What if local variables are defined in advance?#
Change to the following code
int i = 0;
int j = 0;
for(i = 0; i < 50; i++)
{
for(j = 0; j < 50; j++)
{
HAL_Delay(1);
}
}
The disassembly for this part is as follows#
0x0000001e: LDR r0,[sp,#0]
0x00000020: STR r0,[sp,#8]
0x00000022: STR r0,[sp,#4]
0x00000024: STR r0,[sp,#8]
0x00000026: B {pc}+0x2 ; 0x28
0x00000028: LDR r0,[sp,#8]
0x0000002a: CMP r0,#0x31
0x0000002c: BGT {pc}+0x2c ; 0x58
0x0000002e: B {pc}+0x2 ; 0x30
0x00000030: MOVS r0,#0
0x00000032: STR r0,[sp,#4]
0x00000034: B {pc}+0x2 ; 0x36
0x00000036: LDR r0,[sp,#4]
0x00000038: CMP r0,#0x31
0x0000003a: BGT {pc}+0x14 ; 0x4e
0x0000003c: B {pc}+0x2 ; 0x3e
0x0000003e: MOVS r0,#1
0x00000040: BL HAL_Delay
0x00000044: B {pc}+0x2 ; 0x46
0x00000046: LDR r0,[sp,#4]
0x00000048: ADDS r0,#1
0x0000004a: STR r0,[sp,#4]
0x0000004c: B {pc}-0x16 ; 0x36
0x0000004e: B {pc}+0x2 ; 0x50
0x00000050: LDR r0,[sp,#8]
0x00000052: ADDS r0,#1
0x00000054: STR r0,[sp,#8]
0x00000056: B {pc}-0x2e ; 0x28
It can be seen that the loop part (26-56) is no different from the previous code (22-52), but it has added two instructions to set (sp+4) and (sp+8) to zero, resulting in negative optimization.
Will complicating the loop make a difference?#
The following code, along with its disassembly, still does not produce excessive stack operations for (sp+8) and (sp+12).
int test = 0;
for(int i = 0; i < 50; i++)
{
for(int j = 0; j < 50; j++)
{
if((test & 0x01) == 0)
HAL_Delay(1);
else
HAL_Delay(2);
}
test++;
}
0x0000001e: 9801 .. LDR r0,[sp,#4]
0x00000020: 9004 .. STR r0,[sp,#0x10]
0x00000022: 9003 .. STR r0,[sp,#0xc]
0x00000024: e7ff .. B {pc}+0x2 ; 0x26
0x00000026: 9803 .. LDR r0,[sp,#0xc]
0x00000028: 2831 1( CMP r0,#0x31
0x0000002a: dc21 !. BGT {pc}+0x46 ; 0x70
0x0000002c: e7ff .. B {pc}+0x2 ; 0x2e
0x0000002e: 2000 . MOVS r0,#0
0x00000030: 9002 .. STR r0,[sp,#8]
0x00000032: e7ff .. B {pc}+0x2 ; 0x34
0x00000034: 9802 .. LDR r0,[sp,#8]
0x00000036: 2831 1( CMP r0,#0x31
0x00000038: dc12 .. BGT {pc}+0x28 ; 0x60
0x0000003a: e7ff .. B {pc}+0x2 ; 0x3c
0x0000003c: f89d0010 .... LDRB r0,[sp,#0x10]
0x00000040: 07c0 .. LSLS r0,r0,#31
0x00000042: b920 . CBNZ r0,{pc}+0xc ; 0x4e
0x00000044: e7ff .. B {pc}+0x2 ; 0x46
0x00000046: 2001 . MOVS r0,#1
0x00000048: f7fffffe .... BL HAL_Delay
0x0000004c: e003 .. B {pc}+0xa ; 0x56
0x0000004e: 2002 . MOVS r0,#2
0x00000050: f7fffffe .... BL HAL_Delay
0x00000054: e7ff .. B {pc}+0x2 ; 0x56
0x00000056: e7ff .. B {pc}+0x2 ; 0x58
0x00000058: 9802 .. LDR r0,[sp,#8]
0x0000005a: 3001 .0 ADDS r0,#1
0x0000005c: 9002 .. STR r0,[sp,#8]
0x0000005e: e7e9 .. B {pc}-0x2a ; 0x34
The following code, with declarations moved up, still results in negative optimization.
int test = 0;
int i = 0;
int j = 0;
for(i = 0; i < 50; i++)
{
for(j = 0; j < 50; j++)
{
if((test & 0x01) == 0)
HAL_Delay(1);
else
HAL_Delay(2);
}
test++;
}
0x0000001e: 9801 .. LDR r0,[sp,#4]
0x00000020: 9004 .. STR r0,[sp,#0x10]
0x00000022: 9003 .. STR r0,[sp,#0xc]
0x00000024: 9002 .. STR r0,[sp,#8]
0x00000026: 9003 .. STR r0,[sp,#0xc]
0x00000028: e7ff .. B {pc}+0x2 ; 0x2a
0x0000002a: 9803 .. LDR r0,[sp,#0xc]
0x0000002c: 2831 1( CMP r0,#0x31
0x0000002e: dc21 !. BGT {pc}+0x46 ; 0x74
0x00000030: e7ff .. B {pc}+0x2 ; 0x32
0x00000032: 2000 . MOVS r0,#0
0x00000034: 9002 .. STR r0,[sp,#8]
0x00000036: e7ff .. B {pc}+0x2 ; 0x38
0x00000038: 9802 .. LDR r0,[sp,#8]
0x0000003a: 2831 1( CMP r0,#0x31
0x0000003c: dc12 .. BGT {pc}+0x28 ; 0x64
0x0000003e: e7ff .. B {pc}+0x2 ; 0x40
0x00000040: f89d0010 .... LDRB r0,[sp,#0x10]
0x00000044: 07c0 .. LSLS r0,r0,#31
0x00000046: b920 . CBNZ r0,{pc}+0xc ; 0x52
0x00000048: e7ff .. B {pc}+0x2 ; 0x4a
0x0000004a: 2001 . MOVS r0,#1
0x0000004c: f7fffffe .... BL HAL_Delay
0x00000050: e003 .. B {pc}+0xa ; 0x5a
0x00000052: 2002 . MOVS r0,#2
0x00000054: f7fffffe .... BL HAL_Delay
0x00000058: e7ff .. B {pc}+0x2 ; 0x5a
0x0000005a: e7ff .. B {pc}+0x2 ; 0x5c
0x0000005c: 9802 .. LDR r0,[sp,#8]
0x0000005e: 3001 .0 ADDS r0,#1
0x00000060: 9002 .. STR r0,[sp,#8]
0x00000062: e7e9 .. B {pc}-0x2a ; 0x38
0x00000064: 9804 .. LDR r0,[sp,#0x10]
0x00000066: 3001 .0 ADDS r0,#1
0x00000068: 9004 .. STR r0,[sp,#0x10]
0x0000006a: e7ff .. B {pc}+0x2 ; 0x6c
0x0000006c: 9803 .. LDR r0,[sp,#0xc]
0x0000006e: 3001 .0 ADDS r0,#1
0x00000070: 9003 .. STR r0,[sp,#0xc]
0x00000072: e7da .. B {pc}-0x48 ; 0x2a
Using Optimization#
O1#
Still using the above complex loop
Declaring inside the for loop
0x00000014: 2400 .$ MOVS r4,#0
0x00000016: bf00 .. NOP
0x00000018: f0040501 .... AND r5,r4,#1
0x0000001c: 2632 2& MOVS r6,#0x32
0x0000001e: bf00 .. NOP
0x00000020: 2002 . MOVS r0,#2
0x00000022: 2d00 .- CMP r5,#0
0x00000024: bf08 .. IT EQ
0x00000026: 2001 . MOVEQ r0,#1
0x00000028: f7fffffe .... BL HAL_Delay
0x0000002c: 3e01 .> SUBS r6,#1
0x0000002e: d1f7 .. BNE {pc}-0xe ; 0x20
0x00000030: 3401 .4 ADDS r4,#1
0x00000032: 2c32 2, CMP r4,#0x32
0x00000034: d1f0 .. BNE {pc}-0x1c ; 0x18
Declaring in advance, both are completely identical
0x00000014: 2400 .$ MOVS r4,#0
0x00000016: bf00 .. NOP
0x00000018: f0040501 .... AND r5,r4,#1
0x0000001c: 2632 2& MOVS r6,#0x32
0x0000001e: bf00 .. NOP
0x00000020: 2002 . MOVS r0,#2
0x00000022: 2d00 .- CMP r5,#0
0x00000024: bf08 .. IT EQ
0x00000026: 2001 . MOVEQ r0,#1
0x00000028: f7fffffe .... BL HAL_Delay
0x0000002c: 3e01 .> SUBS r6,#1
0x0000002e: d1f7 .. BNE {pc}-0xe ; 0x20
0x00000030: 3401 .4 ADDS r4,#1
0x00000032: 2c32 2, CMP r4,#0x32
0x00000034: d1f0 .. BNE {pc}-0x1c ; 0x18
O2#
Still using the above complex loop
Declaring inside the for loop
0x00000014: 2500 .% MOVS r5,#0
0x00000016: bf00 .. NOP
0x00000018: 2402 .$ MOVS r4,#2
0x0000001a: 2632 2& MOVS r6,#0x32
0x0000001c: 07e8 .. LSLS r0,r5,#31
0x0000001e: bf08 .. IT EQ
0x00000020: 2401 .$ MOVEQ r4,#1
0x00000022: bf00 .. NOP
0x00000024: 4620 F MOV r0,r4
0x00000026: f7fffffe .... BL HAL_Delay
0x0000002a: 3e01 .> SUBS r6,#1
0x0000002c: d1fa .. BNE {pc}-0x8 ; 0x24
0x0000002e: 3501 .5 ADDS r5,#1
0x00000030: 2d32 2- CMP r5,#0x32
0x00000032: d1f1 .. BNE {pc}-0x1a ; 0x18
Declaring in advance, both are completely identical
0x00000014: 2500 .% MOVS r5,#0
0x00000016: bf00 .. NOP
0x00000018: 2402 .$ MOVS r4,#2
0x0000001a: 2632 2& MOVS r6,#0x32
0x0000001c: 07e8 .. LSLS r0,r5,#31
0x0000001e: bf08 .. IT EQ
0x00000020: 2401 .$ MOVEQ r4,#1
0x00000022: bf00 .. NOP
0x00000024: 4620 F MOV r0,r4
0x00000026: f7fffffe .... BL HAL_Delay
0x0000002a: 3e01 .> SUBS r6,#1
0x0000002c: d1fa .. BNE {pc}-0x8 ; 0x24
0x0000002e: 3501 .5 ADDS r5,#1
0x00000030: 2d32 2- CMP r5,#0x32
0x00000032: d1f1 .. BNE {pc}-0x1a ; 0x18
O3#
O3 has no discussion value, as it completely unrolls the loop.
Situation under GCC environment#
Defining local variables in advance also leads to negative optimization.
Local variable defined inside the for loop, 20 instructions
Local variables defined in advance, 24 instructions
This article is updated by Mix Space to xLog. The original link is https://www.yono233.cn/posts/shoot/24_8_6_%E5%85%B3%E4%BA%8E%E5%B1%80%E9%83%A8%E5%8F%98%E9%87%8F%E7%9A%84%E6%A0%88%E8%A1%8C%E4%B8%BA%E2%80%94%E2%80%94%E7%94%B1%E5%BE%AA%E7%8E%AF%E8%AF%AD%E5%8F%A5%E5%86%85%E5%AE%9A%E4%B9%89%E5%BE%86%E7%8E%AF%E5%8F%98%E9%87%8F%E5%BC%95%E7%94%B3