Boot time: choose your kernel loading address carefully

When the compressed and uncompressed kernel images overlap

At least on ARM32, there seems to be many working addresses where the compressed kernel can be loaded in RAM. For example, one can load the compressed kernel at offset 0x1000000 (16 MB) from the start of RAM, and the Device Tree Blog (DTB) at offset 0x2000000 (32 MB). Whatever this loading address, the kernel is then decompressed at offset 0x8000 from the start of RAM, as explained this the famous How the ARM32 Linux kernel decompresses article from Linus Walleij.

There is a potential issue with the loading address of the compressed kernel, as explained in the article too. If the compressed kernel is loaded too close to the beginning of RAM, where the kernel must be decompressed, there will be an overlap between the two. The decompressed kernel will overwrite the compressed one, potentially breaking the decompression process.

Overlapping compressed and decompressed kernel

As you see in the above diagram, when this happens, the bootstrap code in the compressed kernel will first copy the compressed image to a location that’s far enough to guarantee that the decompressed kernel won’t overlap it. However, this extra step in the boot process has a cost.

Measuring boot time impact

In the context of updating our materials for our upcoming Embedded Linux Boot Time Optimization course in June, we measured this additional time on the STM32MP157A-DK1 Discovery Kit from STMicroelectronics, with a dual-core ARM Cortex-A7 CPU running at 650 MHz.

Initially, in our Embedded Linux System Development course, we were booting the DK1 board as follows:

ext4load mmc 0:4 0xc0000000 zImage; ext4load mmc 0:4 0xc4000000 dtb; bootz 0xc0000000 - 0xc4000000

0xc0000000 is exactly the beginning of RAM! We are therefore in the overlap situation.

We used grabserial from Tim Bird to measure the time between Starting kernel in U-Boot and when the compressed kernel starts executing (Booting Linux on physical CPU 0x0):

...
[4.451996 0.000124] Starting kernel ...
[0.001838 0.001838] 
[2.439980 2.438142] [    0.000000] Booting Linux on physical CPU 0x0
...

On a series of 5 identical tests, we obtained an average time of 2,440 ms, with a standard deviation of 0.4 ms.

Then, we measured the optimum case, in which the compressed kernel is loaded far enough from the beginning of RAM so that no overlap is possible:

No overlap between compressed and decompressed kernel

Here we chose to load the kernel at 0xc2000000:

ext4load mmc 0:4 0xc2000000 zImage; ext4load mmc 0:4 0xc4000000 dtb; bootz 0xc2000000 - 0xc4000000

On a series of 5 identical tests, we obtained an average time of 2,333 ms, with a standard deviation of 0.7 ms.

The new average is 107 ms smaller, which you are likely to consider as a worthy reduction, if you have experience with boot time reduction projects.

What to remember

In your embedded projects, if you are using a compressed kernel, make sure it is loaded far enough from the beginning of RAM, leaving enough space for the decompressed kernel to fit in between. Otherwise, your system will still be able to boot, but depending on the speed of your CPU and storage, it will be slower, from a few tens to a few hundreds of milliseconds.

We checked the How to optimize the boot time page on the STM32 wiki, and it recommends optimum loading addresses: 0xc2000000 for the kernel and 0xc4000000 for the device tree. This way, the upper limit for the decompressed kernel is 32 MB, which is more than enough.

If you are directly using an uncompressed kernel, which is more rare, you should also make sure that it is loaded at an optimum location, so that there is no need to move it before starting it.

Embedded Linux Boot time optimization: training and webinar

Bootlin has been helping its customers optimize the boot time of embedded Linux systems for many years, ensuring that these systems meet their startup time requirements. Thanks to this, Bootlin has accumulated a significant experience in this field.

Next week, Bootlin is organizing or participating to two events related to Embedded Linux boot time optimization:

In addition, if you need help and support to optimize the boot time of your Embedded Linux systems, do not hesitate to as we offer engineering services, thanks to which we can take your existing Embedded Linux system, and significantly reduce its startup time to meet your requirements.

Slides and videos of Bootlin talks at Live Embedded Event #2

The second edition of Live Embedded Event took place on June 3rd, exactly 6 months after the first edition. Even though there were a few issues with the online platform, it was once again great to learn new things about embedded, and share some of the work we’ve been doing at Bootlin on various topics. For the next edition, we plan to switch to a different online platform, hopefully providing a better experience.

But in the mean time, all videos of the event have been posted on the Youtube Channel of the event. The talks from Bootlin have been posted on Bootlin’s Youtube Channel.

Indeed, in addition to being part of the organization committee, Bootlin prepared and delivered 5 talks as part of Live Embedded Event, covering different topics we have worked on in the recent months for our customers.

Understanding U-Boot Falcon Mode and adding support for new boards, Michael Opdenacker

Slides [PDF]

Introduction to RAUC, Kamel Bouhara

Slides [PDF]

Security vulnerability tracking tools in Buildroot, Thomas Petazzoni

Slides [PDF]

Secure boot in embedded Linux systems, Thomas Perrot

Slides [PDF]

Device Tree overlays and U-boot extension board management, Köry Maincent

Slides [PDF]

New training course: embedded Linux boot time optimization

For many embedded products, the issue of how much time it takes from power-on to the application being fully usable by the end-user is an important challenge. Bootlin has been providing its expertise and experience in this area to its customers for many years through numerous boot time optimization projects, and we have shared this knowledge through a number of talks at several conferences over the past years.

We are now happy to announce that we have a new training course Embedded Linux boot time optimization, open for public registration. This training course was already given to selected Bootlin customers and is now available for everyone.

Embedded Linux boot time optimization

The training course will be lead by Michael Opdenacker, Bootlin’s founder, and author of several publications on the topic of boot time optimization. The course is organized over 4 sessions of 4 hours, with a significant fraction of time spent on practical demonstrations showing on a real-life example the techniques to measure and reduce the boot time of an embedded Linux system.

As usual with Bootlin, the training materials are fully available: Agenda, Slides and Practical lab instructions.

Boot time optimization slide

Our first course open for public registration will take place from April 6th to April 9th, 2021, from 14:00 to 18:00 UTC+2 (Paris time) on each day. The session cost is 519 EUR if you take advantage of the early bird price available until March 9th. Otherwise, the regular rate is 619 EUR. You can register now for this course on Eventbrite.

Also, if you’re interested in organizing a dedicated session for your company, do not hesitate to contact us.

New training materials: boot time reduction workshop

We are happy to release new training materials that we have developed in 2013 with funding from Atmel Corporation.

The materials correspond to a 1-day embedded Linux boot time reduction workshop. In addition to boot time reduction theory, consolidating some of our experience from our embedded Linux boot time reduction projects, the workshop allows participants to practice with the most common techniques. This is done on SAMA5D3x Evaluation Kits from Atmel.

The system to optimize is a video demo from Atmel. We reduce the time to start a GStreamer based video player. During the practical labs, you will practice with techniques to:

  • Measure the various steps of the boot process
  • Analyze time spent starting system services, using bootchartd
  • Simplify your init scripts
  • Trace application startup with strace
  • Find kernel functions taking the most time during the boot process
  • Reduce kernel size and boot time
  • Replace U-Boot by the Barebox bootloader, and save a lot of time
    thanks to the activation of the data cache.

Creative commonsAs usual, our training materials are available under the terms of the Creative Commons Attribution-ShareAlike 3.0 license. This essentially means that you are free to download, distribute and even modify them, provided you mention us as the original authors and that you share these documents under the same conditions.

Special thanks to Atmel for allowing us to share these new materials under this license!

Here are the documents at last:

The first public session of this workshop will be announced in the next weeks.
Don’t hesitate to contact us if you are interested in organizing a session on your site.

Starting Linux directly from AT91bootstrap3

Here is an update for our previous article on booting linux directly from AT91bootstrap. On newer ATMEL platforms, you will have to use AT91bootstrap 3. It now has a convenient way to be configured to boot directly to Linux.

You can check it out from github:

git clone git://github.com/linux4sam/at91bootstrap.git

That version of AT91bootstrap is using the same configuration mechanism as the Linux kernel. You will find default configurations, named in the form:
<board_name><storage>_<boot_strategy>_defconfig

  • board_name can be: at91sam9260ek, at91sam9261ek, at91sam9263ek, at91sam9g10ek, at91sam9g20ek, at91sam9m10g45ek, at91sam9n12ek, at91sam9rlek, at91sam9x5ek, at91sam9xeek or at91sama5d3xek
  • storage can be:
    • df for DataFlash
    • nf for NAND flash
    • sd for SD card
  • our main interest will be in boot_strategy which can be:
    • uboot: start u-boot or any other bootloader
    • linux: boot Linux directly, passing a kernel command line
    • linux_dt: boot Linux directly, using a Device Tree
    • android: boot Linux directly, in an Android configuration

Let’s take for example the latest evaluation boards from ATMEL, the SAMA5D3x-EK. If you are booting from NAND flash:

make at91sama5d3xeknf_linux_dt_defconfig
make

You’ll end up with a file named at91sama5d3xek-nandflashboot-linux-dt-3.5.4.bin in the binaries/ folder. This is your first stage bootloader. It has the same storage layout as used in the u-boot strategy so you can flash it and it will work.

As a last note, I’ll had that less is not always faster. On our benchmarks, booting the SAMA5D31-EK using AT91bootstrap, then Barebox was faster than just using AT91bootstrap. The main reason is that barebox is actually enabling the caches and decompresses the kernel(see below, the kernel is also enaling the caches before decompressing itself) before booting.

Linux on ARM: xz kernel decompression benchmarks

I recently managed to find time to clean up and submit my patches for xz kernel compression support on ARM, which I started working on back in November, during my flight to Linaro Connect. However, it was too late as Russell King, the ARM Linux maintainer, alreadyaccepted a similar patch, about 3 weeks before my submission. The lesson I learned was that checking a git tree is not always sufficient. I should have checked the mailing list archives too.

The good news is that xz kernel compression support should be available in Linux 3.4 in a few months from now. xz is a compression format based on the LZMA2 compression algorithm. It can be considered as the successor of lzma, and achieves even better compression ratios!

Before submitting my patches, I ran a few benchmarks on my own implementation. As the decompressing code is the same, the results should be the same as if I had used the patches that are going upstream.

Benchmark methodology

For both boards I tested, I used the same pre 3.3 Linux kernel from Linus Torvalds’ mainline git tree. I also used the U-boot bootloader in both cases.

I used the very useful grabserial script from Tim Bird. This utility reads messages coming out of the serial line, and adds timestamps to each line it receives. This allow to measure time from the earliest power on stages, and doesn’t slow down the target system by adding instrumentation to it.

Our benchmarks just measure the time for the bootloader to copy the kernel to RAM, and then the time taken by the kernel to uncompress itself.

  • Loading time is measured between “reading uImage” and “OK” (right before “Starting kernel”) in the bootloader messages.
  • Compression time measured between “Uncompressing Linux” and “done”:
    ~/bin/grabserial -v -d /dev/ttyUSB0 -e 15 -t -m "Uncompressing Linux" -i "done," > booting-lzo.log

Benchmarks on OMAP4 Panda

The Panda board has a fast dual Cortex A9 CPU (OMAP 4430) running at 1 GHz. The standard way to boot this board is from an MMC/SD card. Unfortunately, the MMC/SD interface of the board is rather slow.

In this case, we have a fast CPU, but with rather slow storage. Therefore, the time taken to copy the kernel from storage to RAM is expected to have a significant impact on boot time.

This case typically represents todays multimedia and mobile devices such as phones, media players and tablets.

Compression Size Loading time Uncompressing time Total time
gzip 3355768 2.213376 0.501500 2.714876
lzma 2488144 1.647410 1.399552 3.046962
xz 2366192 1.566978 1.299516 2.866494
lzo 3697840 2.471497 0.160596 2.632093
None 6965644 4.626749 0 4.626749

Results on Calao Systems USB-A9263 (AT91)

The USB-A9263 board from Calao Systems has a cheaper and much slower AT91SAM9263 CPU running at 200 MHz.

Here we are booting from NAND flash, which is the fastest way to boot a kernel on this board. Note that we are using the nboot command from U-boot, which guarantees that we just copy the number of bytes specified in the uImage header.

In this case, we have a slow CPU with slow storage. Therefore, we expect both the kernel size and the decompression algorithm to have a major impact on boot time.

This case is a typical example of industrial systems (AT91SAM9263 is still very popular in such applications, as we can see from customer requests), booting from NAND storage operating with a 200 to 400 MHz CPU.

Compression Size Loading time Uncompressing time Total time
gzip 2386936 5.843289 0.935495 6.778784
lzma 1794344 4.465542 6.513644 10.979186
xz 1725360 4.308605 4.816191 9.124796
lzo 2608624 6.351539 0.447336 6.798875
None 4647908 11.080560 0 11.080560

Lessons learned

Here’s what we learned from these benchmarks:

  • lzo is still the best solution for minimum boot time. Remember, lzo kernel compression was merged by Bootlin.
  • xz is always better than lzma, both in terms of image size. Therefore, there’s no reason to stick to lzma compression if you used it.
  • Because of their heavy CPU usage, lzma and xz remain pretty bad in terms of boot time, on most types of storage devices. On systems with a fast CPU, and very slow storage though, xz should be the best solution
  • On systems with a fast CPU, like the Panda board, boot time with xz is actually pretty close to lzo, and therefore can be a very interesting compromise between kernel size and boot time.
  • Using a kernel image without compression is rarely a worthy solution, except in systems with a very slow CPU. This is the case of CPUs emulated on an FPGA (typically during chip development, before silicon is available). In this particular case, copying to memory is directly done by the emulator, and we just need CPU cycles to start the kernel.

Embedded Linux boot time reduction presentation for GENIVI

GENIVI LogoI was invited to speak at the GENIVI All Members Meeting that took place on May 3-6 in Dublin, Ireland. This was a very interesting opportunity to meet new people in the In Vehicle Infotainment (IVI) industry and community.

In addition to the friendly social event at the Guiness Brewery, there was also a very interesting technical showcase of products and software using the GENIVI stack. I could observe that Freescale and ARM chips in general dominate this market. I also wore my Linaro shirt and had interesting discussions with several people about partnership opportunities between GENIVI and Linaro.

I gave a presentation about reducing boot time in embedded Linux systems. The slides are available in PDF and ODF formats, and as usual, are released with a Creative Commons Attribution – Share Alike 3.0 license. Here is the description of the talk:

Cheap Linux boot time reduction techniques

By Michael Opdenacker, Bootlin

More and more feature rich Linux devices are put in the hands of consumers, and the average consumer shouldn’t even notice that they run Linux. To make the OS invisible, the system should boot in a flash.

Multiple boot time reduction techniques are now available, and can be used at the end of a development project, without incurring redesign costs. This presentation will guide embedded Linux system developers through the most effective ones. For each technique, we will detail how to use it and will report the exact savings achieved on a real embedded board.

Author’s biography

Michael Opdenacker is the founder of Bootlin (https://bootlin.com), a company offering development, consulting and training services to embedded Linux system developers worldwide. He is always looking for innovative techniques to share with customers and with the community.

Michael is also the Community Manager for Linaro (http://linaro.org), a not-for-profit engineering organization working on software foundations for Linux on ARM, to reduce fragmentation between ARM chip vendors, increase product performance and reduce time to market. Linaro currently employs more than 100 of the most active developers in the ARM and embedded Linux community.

I was pleased to have a good number of participants, and to get many questions during and after the talk.

Though GENIVI is about Free and Open Source Software, it is unfortunately not very open to the community yet. You have to become a member to access its specifications, wiki and other technical resources. While collecting membership fees makes sense to operate such an organization, and is acceptable for system makers, it makes it difficult for embedded Linux community developers to get involved. I hope that GENIVI will become more open to the wider embedded Linux community in the future.

Linux 2.6.33 features for embedded systems

Interesting features for embedded Linux system developers

Penguin workerLinux 2.6.33 was out on Feb. 24, 2010, and to incite you to try this new kernel in your embedded Linux products, here are features you could be interested in.

The first news is the availability of the LZO algorithm for kernel and initramfs compression. Linux 2.6.30 already introduced LZMA and BZIP2 compression options, which could significantly reduce the size of the kernel and initramfs images, but at the cost of much increased decompression time. LZO compression is a nice alternative. Though its compression rate is not as good as that of ZLIB (10 to 15% larger files), decompression time is much faster than with other algorithms. See our benchmarks. We reduced boot time by 200 ms on our at91 arm system, and the savings could even increase with bigger kernels.

This feature was implemented by my colleague Albin Tonnerre. It is currently available on x86 and arm (commit, commit, commit, commit), and according to Russell King, the arm maintainer, it should become the default compression option on this platform. This compressor can also be used on mips, thanks to Wu Zhangjin (commit).

For systems lacking RAM resources, a new useful feature is Compcache, which allows to swap application memory to a compressed cache in RAM. In practise, this technique increases the amount of RAM that applications can use. This could allow your embedded system or your netbook to run applications or environments it couldn’t execute before. This technique can also be a worthy alternative to on-disk swap in servers or desktops which do need a swap partition, as access performance is much improved. See this LWN.net article for details.

This new kernel also carries lots of improvements on embedded platforms, especially on the popular TI OMAP platform. In particular, we noticed early support to the IGEPv2 board, a very attractive platform based on the TI OMAP 3530 processor, much better than the Beagle Board for a very similar price. We have started to use it in customer projects, and we hope to contribute to its full support in the mainline kernel.

Another interesting feature of Linux 2.6.33 is the improvements in the capabilities of the perf tool. In particular, perf probe allows to insert Kprobes probes through the command line. Instead of SystemTap, which relied on kernel modules, perf probe now relies on a sysfs interface to pass probes to the kernel. This means that you no longer need a compiler and kernel headers to produce your probes. This made it difficult to port SystemTap to embedded platforms. The arm architecture doesn’t have performance counters in the mainline kernel yet (other architectures do), but patches are available. This carries the promise to be able to use probe tools like SystemTap at last on embedded architectures, all the more if SystemTap gets ported to this new infrastructure.

Other noticeable improvements in this release are the ability to mount ext3 and ext2 filesystems with just an ext4 driver, a lightweight RCU implementation, as well as the ability to change the default blinking cursor that is shown at boot time.

Unfortunately, each kernel release doesn’t only carry good news. Android patches got dropped from this release, because of a lack of interest from Google to maintain them. These are sad news and a threat for Android users who may end up without the ability to use newer kernel features and releases. Let’s hope that Google will once more realize the value of converging with the mainline Linux community. I hope that key contributors that this company employs (Andrew Morton in particular) will help to solve this issue.

As usual, this was just a selection. You will probably find many other interesting features on the Linux Changes page for Linux 2.6.33.

LZO kernel compression

As Michael stated in his review of the interesting features in Linux 2.6.30, new compression options have been recently added to the kernel. We therefore decided to have a look at those compression methods, from a compression ratio and decompression speed point of view.

This comparison will be based on “self-extractible kernels”, that is, kernel images containing bootstrap code allowing them to extract a compressed image. As underlined in the previous article, this approach is not used on all architectures. Blackfin, notably, chose a different path and compresses the whole kernel image, without including bootstrap code. While this has the clear advantage of making compression much simpler with respect to kernel code, it forces decompression out to the bootloader code.

Each of those methods has its advantages. Indeed, the Blackfin approach relies on the bootloader to provide the necessary functions, so that may be a problem to do things like bypassing u-boot to reduce the boot time. On the other hand, implementing it only once in the bootloader (as architecture-independent code) makes it unnecessary to write the low-level bootstrap code for each architecture in the kernel, which is surely interesting on virtually all architectures, the notable exceptions being x86 and ARM.

Gzip (also known as Zlib or inflate) has been the traditional (and, as a matter of fact, only) method used to compress kernels. Consequently, we’ll use it as the reference in the following tests. Our test environment is as follows:

ARM9 AT91SAM9263 CPU, 200MHz, using the mainline arch/arm/configs/usb-a9263.config.

This comparison includes figures for LZO, a new kernel image compression method that I have contributed to the Linux sources, and which hopefully will make its way into the mainline kernel. LZO support in the kernel is only new for kernel decompression, as it is already used by JFFS2 and UBIFS. LZO is a stream-oriented algorithm, and although its compression ratio is lower than that of gzip, decompression is lightning-fast, as we will soon find out.

So, here are the figures, average on 20 boots with each compression method:

Uncompressed 3.24Mo 200%
LZO 1.76Mo 0.552s 70% 109%
Gzip 1.62Mo 0.775s 100% 100%
LZMA 1.22Mo 5.011s 646% 75%

Bzip2 has not been tested here: the low-level bootstrap file, head.S, only allocates 64Kb for use by malloc() on ARM. Some quick tests showed that the kernel would not extract with less than 3.5Mib of malloc() space. That would require to modify head.S so that malloc can use more memory, which we will not do here. However, given that enough memory is usable on the system, one could well use bzip2. All the other algorithms performed the extraction using the standard 64Kib malloc space.

From the results, we can clearly see that LZMA is nearly unsuitable for our system, and should be considered only if the space constraints for storing the kernel are so tight that we can’t afford to use more space that was is strictly necessary.

LZO looks like a good candidate when it comes to speeding up the boot process, at the expense of some (almost neglectable) extra space. Gzip is close to LZO when it comes to size, although extraction is not as fast. That means that unless you’re hitting corner cases, like only having enough space for a Gzip compressed image but not for one made with LZO, choosing the latter is probably a safe bet.

Besides, the LZO-compressed kernel size is about 54% the size of the uncompressed kernel. As the kernel load time varies linearly with its size, load time for an uncompressed kernel doubles. While 0.55s are won because there’s no need to run a decompression algorithm, you spend twice as much time loading the kernel. This time is not negligible at all compared to the decompression time. Indeed, loading the uncompressed image takes roughly 0.8s. That means that at the cost of slowing down the boot process by 0.15s (compared to an uncompressed kernel), one gets a kernel image which is roughly twice as small. Rather nice, isn’t it?