How we found that the Linux nios2 memset() implementation had a bug!

NIOS II processorNiosII is a 32-bit RISC embedded processor architecture designed by Altera, for its family of FPGAs: Cyclone III, Cyclone IV, etc. Being a soft-core architecture, by using Altera’s Quartus Prime design software, you can adjust the CPU configuration to your needs and instantiate it into the FPGA. You can customize various parameters like the instruction or the data cache size, enable/disable the MMU, enable/disable an FPU, and so on. And for us embedded Linux engineers, a very interesting aspect is that both the Linux kernel and the U-Boot bootloader, in their official versions, support the NIOS II architecture.

Recently, one of our customers designed a custom NIOS II platform, and we are working on porting the mainline U-Boot bootloader and the mainline Linux kernel to this platform. The U-Boot porting went fine, and quickly allowed us to load and start a Linux kernel. However, the Linux kernel was crashing very early with:

[    0.000000] Linux version 4.5.0-00007-g1717be9-dirty (rperier@archy) (gcc version 4.9.2 (Altera 15.1 Build 185) ) #74 PREEMPT Fri Apr 22 17:43:22 CEST 2016
[    0.000000] bootconsole [early0] enabled
[    0.000000] early_console initialized at 0xe3080000
[    0.000000] BUG: failure at mm/bootmem.c:307/__free()!
[    0.000000] Kernel panic - not syncing: BUG!

This BUG() comes from the __free() function in mm/bootmem.c. The bootmem allocator is a simple page-based allocator used very early in the Linux kernel initialization for the very first allocations, even before the regular buddy page allocator and other allocators such as kmalloc are available. We were slightly surprised to hit a BUG in a generic part of the kernel, and immediately suspected some platform-specific issue, like an invalid load address for our kernel, or invalid link address, or other ideas like this. But we quickly came to the conclusion that everything was looking good on that side, and so we went on to actually understand what this BUG was all about.

The NIOS II memory initialization code in arch/nios2/kernel/setup.c does the following:

bootmap_size = init_bootmem_node(NODE_DATA(0),
                                 min_low_pfn, PFN_DOWN(PHYS_OFFSET),
                                 max_low_pfn);
[...]
free_bootmem(memory_start, memory_end - memory_start);

The first call init_bootmem_node() initializes the bootmem allocator, which primarily consists in allocating a bitmap, with one bit per page. The entire bootmem bitmap is set to 0xff via a memset() during this initialization:

static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
        unsigned long mapstart, unsigned long start, unsigned long end)
{
        [...]
        mapsize = bootmap_bytes(end - start);
        memset(bdata->node_bootmem_map, 0xff, mapsize);
        [...]
}

After doing the bootmem initialization, the NIOS II architecture code calls free_bootmem() to mark all the memory pages as available, except the ones that contain the kernel itself. To achieve this, the __free() function (which is the one triggering the BUG) clears the bits corresponding to the page to be marked as free. When clearing those bits, the function checks that the bit was previously set, and if it’s not the case, fires the BUG:

static void __init __free(bootmem_data_t *bdata,
                        unsigned long sidx, unsigned long eidx)
{
        [...]
        for (idx = sidx; idx < eidx; idx++)
                if (!test_and_clear_bit(idx, bdata->node_bootmem_map))
                        BUG();
}

So to summarize, we were in a situation where a bitmap is memset to 0xff, but almost immediately afterwards, a function that clears some bits finds that some of the bits are already cleared. Sounds odd, doesn’t it?

We started by double checking that the address of the bitmap was the same between the initialization function and the __free function, verifying that the code was not overwriting the bitmap, and other obvious issues. But everything looked alright. So we simply dumped the bitmap after it was initialized by memset to 0xff, and to our great surprise, we found that the bitmap was in fact initialized with the pattern 0xff00ff00 and not 0xffffffff. This obviously explained why we were hitting this BUG(): simply because the buffer was not properly initialized. At first, we really couldn’t believe this: how it is possible that something as essential as memset() in Linux was not doing its job properly?

On the NIOS II platform, memset() has an architecture-specific implementation, available in arch/nios2/lib/memset.c. For buffers smaller than 8 bytes, this memset implementation uses a simple naive loop, iterating byte by byte. For larger buffers, it uses a more optimized implementation, using inline assembly. This implementation copies data per blocks of 4-bytes rather than 1 byte to speed-up the memset.

We quickly tested a workaround that consisted in using the naive implementation for all buffer sizes, and it solved the problem: we had a booting kernel, all the way to the point where it mounts a root filesystem! So clearly, it’s the optimized implementation in assembly that had a bug.

After some investigation, we found out that the bug was in the very first instructions of the assembly code. The following piece of assembly is supposed to create a 4-byte value that repeats 4 times the 1-byte pattern passed as an argument to memset:

/* fill8 %3, %5 (c & 0xff) */
"       slli    %4, %5, 8\n"
"       or      %4, %4, %5\n"
"       slli    %3, %4, 16\n"
"       or      %3, %3, %4\n"

This code takes as input in %5 the one-byte pattern, and is supposed to return in %3 the 4-byte pattern. It goes through the following logic:

  • Stores in %4 the initial pattern shifted left by 8 bits. Provided an initial pattern of 0xff, %4 should now contain 0xff00
  • Does a logical or between %4 and %5, which leads to %4 containing 0xffff
  • Stores in %3 the 2-byte pattern shifted left by 16 bits. %3 should now contain 0xffff0000.
  • Does a logical or between code>%3 and %4, i.e between 0xffff0000 and 0xffff, which gives the expected 4-byte pattern 0xffffffff

When you look at the source code, it looks perfectly fine, so our source code review didn’t spot the problem. However, when looking at the actual compiled code disassembled, we got:

34:	280a923a 	slli	r5,r5,8
38:	294ab03a 	or	r5,r5,r5
3c:	2808943a 	slli	r4,r5,16
40:	2148b03a 	or	r4,r4,r5

Here r5 gets used for both %4 and %5. Due to this, the final pattern stored in r4 is 0xff00ff00 instead of the expected 0xffffffff.

Now, if we take a look at the output operands, %4 is defined with the "=r" constraint, i.e an output operand. How to prevent the compiler from re-using the corresponding register for another operand? As explained in this document, "=r" does not prevent gcc from using the same register for an output operand (%4) and input operand (%5). By adding the constrainst & (in addition to "=r"), we tell the compiler that the register associated with the given operand is an output-only register, and so, cannot be used with an input operand.

With this change, we get the following assembly output:

34:	2810923a 	slli	r8,r5,8
38:	4150b03a 	or	r8,r8,r5
3c:	400e943a 	slli	r7,r8,16
40:	3a0eb03a 	or	r7,r7,r8

Which is much better, and correctly produces the 0xffffffff pattern when 0xff is provided as the initial 1-byte pattern to memset.

In the end, the final patch only adds one character to adjust the inline assembly constraint and gets the proper behavior from gcc:

diff --git a/arch/nios2/lib/memset.c b/arch/nios2/lib/memset.c
index c2cfcb1..2fcefe7 100644
--- a/arch/nios2/lib/memset.c
+++ b/arch/nios2/lib/memset.c
@@ -68,7 +68,7 @@ void *memset(void *s, int c, size_t count)
 		  "=r" (charcnt),	/* %1  Output */
 		  "=r" (dwordcnt),	/* %2  Output */
 		  "=r" (fill8reg),	/* %3  Output */
-		  "=r" (wrkrega)	/* %4  Output */
+		  "=&r" (wrkrega)	/* %4  Output only */
 		: "r" (c),		/* %5  Input */
 		  "0" (s),		/* %0  Input/Output */
 		  "1" (count)		/* %1  Input/Output */

This patch was sent upstream to the NIOS II kernel maintainers:
[PATCH v2] nios2: memset: use the right constraint modifier for the %4 output operand, and has already been applied by the NIOS II maintainer.

We were quite surprised to find a bug in some common code for the NIOS II architecture: we were assuming it would have already been tested on enough platforms and with enough compilers/situations to not have such issues. But all in all, it was a fun debugging experience!

It is worth mentioning that in addition to this bug, we found another bug affecting NIOS II platforms, in the asm-generic implementation of the futex_atomic_cmpxchg_inatomic() function, which was causing some preemption imbalance warnings during the futex subsystem initialization. We also sent a patch for this problem, which has also been applied already.

About Romain Perier

Romain worked in 2016 at Free Electrons as a kernel and embedded Linux engineer.
This entry was posted in Technical and tagged , , . Bookmark the permalink.

6 Responses to How we found that the Linux nios2 memset() implementation had a bug!

  1. Atanas Palavrov says:

    Don’t see why this was written in assembly. GCC is good enough to produce exactly the same binary code with C i.e.

    uint32 fill16 = (fill << 8) | fill;
    uint32 fill32 = (fill16 << 16) | fill16;

    IMHO a it is a better to remove all that assembly code and replace it with C. This will allow GCC to allocate registers in a better way, will reduce the code complexity, etc.
    The only thing that need to be inline is just the tight fill loop and only in case that GCC is generating something not optimal.

    This is a rare good example for premature optimization.

  2. Alejandro Martinez says:

    Typo: s/8 bytes/8 bits/. Also agree with the comment above, the compiler should already produce correct assembly out of C code.

  3. Faz says:

    Typo… try s/shifted left by 8 bytes/shifted hinted left by 8 bits/

    Hth…

    • Faz says:

      Oops the ‘hinted’ is a typo in my fix for your typo.

      Auto correcting dictionary bites smartphone user again..
      Lol

  4. mwb says:

    GCC’s inline assembly syntax is almost as bad as regex in terms of being incomprehensible once written. It’s worse than regex in terms of writing it in the first place. I’m amazed at how much of it is written without a bug.

Leave a Reply to Atanas Palavrov Cancel reply

Your email address will not be published. Required fields are marked *