E-commerce: WebP, AVIF or JPEG-XL?

Aug 29, 2021

Image optimization for e-commerce

We started optimizing images for e-commerce in 2017. We had to throw away our code and algorithms twice, and start again. Now we have had the third iteration online for three years, and it works. We optimize and serve images for e-commerce websites and ecommerce platforms. We optimize a few million distinct product images per month and serve them as WebP, AVIF or JPEG-XL, whichever is smallest, supported by the requesting browser, and of good quality.

What do we mean by image optimization for ecommerce? We don’t give any “artistic” or “director” feedback to the ecommerces that we serve. Our role is to simply take whatever image our customers put on their sites, and decrease its byte count as much as possible without anybody noticing. On a good day, we are simply invisible. And since our developers handle customer support – yours truly does a lot of that – we cherish our invisibility more than anything.

There are some particularities of ecommerce images worth mentioning. For example, the bulk of it are product images at different sizes. Those images are “bi-modal”, roughly half of their pixels are for background and thus low on important information, and the other half contains the product proper.

So, how exactly is an image optimized for web-serving? For you to understand our current recipe, we have to tell you how we got here.

WebP is almost always good

We were not much into image optimization back in 2017, instead, we were looking for ways to optimize HTTP resource delivery with HTTP/2. But what we gained in loading times with lots of HTTP/2 cleverness was dwarfed by what we could gain by converting a single image from JPEG to WebP. And Google Lighthouse already had that message inciting website owners to deploy WebP everywhere. What gives? As a nimble startup that we are, we decided to bite. How exactly is an image converted to WebP? Here is one way to go about it:

cwebp  <options> <input_image> -o <output_image>

Here, cwebp is a command-line program that leverages libwebp to do the actual encoding. We should say at this point that libwebp is a really good library, but as far as we know, the only WebP encoder there is. Using Python PIL or ImageMagick only changes the front-end to the library, the actual encoding algorithm remains the same. When one runs the program above, depending on the concrete options supplied, one gets a WebP image which is smaller than the input and has good quality. This is the most common outcome, anyway. The thing to understand about the WebP encoder is that even if the format would admit better compression than the JPEG input in 100% of the cases (we don’t know that), the encoder itself uses heuristics and shortcuts aimed at producing an output in a reasonable amount of time, and working well most times. It does both. But undesirable outcomes happen too in some cases. For example, the output image may look blurry, have more bytes than the input, or both.

An image showing different outcomes from running libwebp to do the encoding

The exact proportion for undesirable outcomes depends on the set of input images and the options provided to the encoder. It can be as low as 5%, but with some of our customers, it can be as high as 70%. This is particularly the case if the JPEG image we are using as input has already been optimized, or if there is lots of JPEG noise that the WebP encoder is trying to ship into its output. For the sake of argument, let’s say that undesirable outcomes happen 1% of the time. In a random page of one of our customers’ shops there are 790 images, so all it will take for anybody to notice an undesirable outcome is to open that page and scroll down. At that point, we cease being invisible and our game is up.

Worse, many of our prospective customers are aware of this issue with WebP from when they themselves tried to leverage the WebP encoder. When our sales guy approaches them, they immediately hand him the images that caused them to roll back WebP support in the past. If you can fix it for these, we will talk business, they say.

Adjusting the encoder options for each image

A fixed set of options for cwebp applied to all the images of a website will produce the worst results. We should know, we started there. What else can be done? In an ideal world, we would tweak the encoder itself to always produce good results and never produce undesirable outcomes. But let’s face it, if the maintainers of libwebp, backed by Google, didn’t achieve it, what chance would we have? And, for a fact, our investors don’t have the pockets to bankroll such an uncertain endeavor. The next thing we can do is to tweak the options we pass to the encoder, for each individual image. Never the ones to overcomplicate things, in our second iteration we used the following algorithm:

Input image: I, size in bytes of the input image: sI.
Start with q (the “quality” parameter passed to cwebp) equal to 1.0. This should produce more or less a lossless encoding of I.
Decrease q by 0.025
Encode the image and check the encoded sized sE. 
If sE < 0.9  sI (that is, if the result is less than 90% of the size of the input), stop, otherwise, go to step 3.

Okay, that’s a linear search, not a binary search, but at this point we were optimizing images for just two customers and the cost was not the most important issue. We rolled with this approach for several months, but this one too had its drawbacks. Sometimes, we would have to set q to 0.55 before we got under the target size. At this time, the quality of the output was already unacceptable. And oh boy, our customers noticed!

Initially, we just looked for remedial action. We needed a fuse that would tell us if the visual (“perceptual” is the term used in the specialized circles) quality of the image was too low:

Input image: I, size in bytes of the input image: sI.
Start with q (the “quality” parameter passed to cwebp) equal to 1.0. This should produce more or less a lossless encoding of I.
Decrease q by 0.025
Encode the image, (let’s call the encoded output “E”) and check the encoded sized sE. 
If sE < 0.9  sI (that is, if the result is less than 90% of the size of the input), go to step 6, otherwise, go to step 3.
Check the perceptual quality of the encoded image. If it is too low, bail out and say that the image can’t be converted to WebP

With step 6 we introduced a new moving cog, what we internally call “a quality metric”. Its full technical name is “perceptual quality metric”. The important thing to note is that with 6, we had a way to prevent our customers from poking us in the ribs with their steely knives because we were producing blurry images. We still had problems though. Step 6 would reject a lot of images. In many cases it was because those images were already well optimized by JPEG. In other cases it was because our quality metric was giving the same weight to a noisy background and to the product itself (see image below).

An image where the quality metric is giving the same weight to a noisy background and to the product itself

To ameliorate this problem we iterated on the quality metrics, and we currently have fourteen metrics.

Still, back in 2018, a significant fraction of our optimization operations bailed out by reaching step 6 above. We were refusing to replace a good JPEG with a worse WebP, on a case-by-case basis. This is a good way to reduce the total number of bytes in a web page, even by the computations that Google Lighthouse makes. However, Google Lighthouse suggests that new generation image formats should be used whenever it sees an old generation image format. The suggestion doesn’t weigh on the final score, but not all our customers understood that.

An image showing message of serving images in Next-Gen Formats

And finally, for those images that we did optimize, we were being too conservative, since we were targeting image size instead of perceptual quality.

Encoder meritocracy

Since we had introduced perceptual quality metrics, we could remake our encoding algorithm to take them into account from the start:

1. Input image: I, perceptual difference threshold between I and output E: H_min
2. Use binary search to find q (the quality parameter passed to cwebp) that produces an image with perceptual quality slightly above H_min

As you may suspect, our perceptual quality metrics are independent from the encoder. That’s a good thing, because now we can run the algorithm above for each image format/encoder:

We fix a perceptual quality measure, a threshold for the measure, and optimize in all the formats. Well, we don’t do JPEG2000 anymore because Safari supports WebP nowadays.

Now, you may ask, is it worth it? What’s the outcome in terms of byte sizes?

In the below table you can see an example dataset with 10 images from one of our customers. By picking the smallest format we beat the best performing format (WebP) with around 20%! But this is a small sample so let’s go into more details.

Image no	AVIF	JPEG	WEBP	Winner size
1	11729	2652	3474	2652
2	2027	2326	2754	2027
3	2611	2700	2992	2611
4	3948	2087	3272	2087
5	3096	2149	5294	2149
6	2182	2789	4986	2182
7	2713	1857	3486	1857
8	27069	39446	23364	23364
9	43183	23246	33736	23246
10	27761	43512	23548	23548
Total Size:	126319	122764	106906	85723

A word about perceptual quality metrics

There are many perceptual quality metrics out there. PSNR is a popular one, but it wasn’t good enough for our use cases. We also used and still support FSIM-C, which is quite good, but also expensive to compute and tune due to its need for Fourier transforms.

All the perceptual quality metrics we know about are, to a great extent, local: they operate in a relatively small set of neighboring pixels to obtain a value v, that is then aggregated over the entire image. As we said before, product images are usually bimodal; they often have uniform color backgrounds. Because compression formats are very good at reproducing big patches of uniform color, those areas tend to “boost” the resulting value of the metric, even if the important part is not so well reproduced. In many of our perceptual quality metrics we adjust for that, for example our “pearl” family of metrics completely skips uniform background areas.

These results below come from using such skips with a variation of SSIM. Our variation has its own weighting for gamma adjustments and it is multi-scale. It also includes some performance tuning. We created it guided by customer feedback, so that it produces the least amount of noticeable artifacts for the images we typically handle. For the results below, a different combination of metric and threshold will produce different results, but we expect the general pattern to persist.

The data

We got this data from ecommerce where we have already deployed WebP, AVIF and JPEG-XL.

Unfortunately, we can’t publish the images in our datasets because we don’t have the rights to do it. Furthermore, we have NDAs signed with most of our customers that forbid us from identifying them (but some consented being listed on our front-page, and see below for ways to reach out to us). However, we can publish some spreadsheets with numbers.

Dataset 1

This corresponds to a category page in one of our customers.

Generals:

Total JPEGs: 173 (the page also contains an abundance of PNGs, GIF and SVG images, but those don’t matter for the scope of this article)
Total size by using optimized MozJPEG: 5 739 198.

In this case we are also resizing (pixel-wise) the images, so the sum of the byte-size inputs can’t be compared to the sums of the optimized images.

Total size by using WebP: 5 451 736. So on average WebP images are smaller than optimized JPEG, by a little bit.
Total size by using AVIF: 2 616 427. On average, AVIF produces better compression than WebP (and JPEG)
Total size by using JPEG-XL: 2 185 430. The new kid in the block certainly does its job.
Total size by selecting the minimum size image: 1 738 779. That’s quite the jump, even from JPEG-XL. Note that there is a scale: the newest format generally compresses images better, but not always. As far as we know, this is to be expected.

Link to the dataset:

https://docs.google.com/spreadsheets/d/1HkkvErOiJf54-di_Zx6e2I1hymqvKUYFou79C8JAH3Q/edit?usp=sharing

Dataset 2

This corresponds to a category page in one of our customers.

Generals:

Total relevant images: 28 (the page also contains an abundance of PNGs, GIF and SVG images, but those don’t matter for the scope of this article)

In this case we are also resizing (pixel-wise) the images, so the sum of the byte-size inputs can’t be compared to the sums of the optimized images. Different from the previous dataset, this customer feeds us high quality input JPEGs.

Total size by using WebP: 1 989 786. Since this customer doesn’t let us serve optimized JPEGs, We don’t know how much of an improvement this would be over it.
Total size by using AVIF: 1 429 316. On average, AVIF produces better compression than WebP
Total size by using JPEG-XL: 527 676. That’s just amazing. Also the resulting images look very sharp.
Total size by selecting the minimum size image: 512 707. In this case, JPEG-XL images dominate the total.

Conclusions

New image formats are great! And the more image formats and encoders there are out there, the better the possibilities to reduce the byte payload of images in websites, and particularly in ecommerce websites.

Get in touch!

You can write to [email protected] for any questions you have. We would be delighted to set up a demo for your shop, or to answer any questions you have.

Share on Facebook Share on Twitter