The Curious Problem of Image Formats

My colleague has been building a side project that caused a bit of a brain worm for me.

This project is a kind of image proxy.

You know the kind of thing, many Hive front-ends probably have had one at some point. Instead of linking to an image directly from a web page, instead you go through some service that both speeds up delivery (CDN) but also has the ability to deliver a thumbnail version and so on.

Now comes the challenge.

What do you do when you have to deliver a perfectly valid web format image that your image library borks over.

That's what happened when he was passed an SVG file.

My first response was to convert the svg to bitmap and then continue on.

I even gave him this code:

import sys
import cairosvg

# This code converts an SVG file to a PNG file using the cairosvg library.
# It defines a function `convert_svg_to_png` that takes the input SVG file path,
# output PNG file path, and desired width and height as arguments.
# The function attempts to convert the SVG file to PNG format and returns True if successful,
# or False if an error occurs. The function is then called with example file paths and dimensions.
# Note: Ensure that the cairosvg library is installed in your Python environment.
# You can install it using pip:
# pip install cairosvg
# Make sure to replace 'input.svg' and 'output.png' with your actual file paths.
# The code is designed to be run in a Python environment where the cairosvg library is available.
# The conversion process may take some time depending on the complexity of the SVG file.
# The output PNG file will be saved at the specified output path with the given dimensions.


def convert_svg_to_png(input_svg_path, output_png_path, width, height):
    try:
        cairosvg.svg2png(
            url=input_svg_path,
            write_to=output_png_path,
            output_width=width,
            output_height=height
        )
        return True
    except Exception as e:
        print(f"Conversion failed: {e}")
        return False

convert_svg_to_png(
    input_svg_path=sys.argv[1],
    output_png_path=sys.argv[1].replace('.svg', '.png'),
    width=int(sys.argv[2]),
    height=int(sys.argv[3])
)

But the problem is a little more complex than I thought.

First, you might get a well-formed filename/path, or you might not.

Consider example.com/thumbnail?1234

That could be completely valid as a web hosted image, as browsers and web servers can pass the correct mime type and header stuff for it to just work out ok.

So my colleague should check the mime type, right? Well ... he is just getting given a URL, he can't count on the original host of the image to do anything up to and including sending a byte stream.

Assumptions can be deadly, as we learned from Reacher.

Another thing to factor into our thoughts is that SVGs can (can) deliver bad guy payloads because of the allowance for <script> tags.

Might be a good idea to reject or at least sanitise certain formats.

So if we are having to sniff image formats anyway, we might as well only allow certain file types through.

How do we do that?

You can detect image formats by reading the first few bytes of the file, known as the 'magic numbers':

def quick_detect_format(filepath):
    with open(filepath, 'rb') as f:
        magic = f.read(2)

    if magic == b'BM':
        return 'bmp'
    elif magic == b'\xFF\xD8':
        return 'jpeg'
    elif magic == b'\x89P':
        return 'png'
    elif magic == b'GI':
        return 'gif'
    elif magic == b'RI':
        # Could be WebP (needs deeper inspection)
        return 'riff'
    else:
        return 'unknown'

For GIF, you usually need the first 6 bytes to confirm whether it’s "GIF87a" or "GIF89a".
PNG ideally check the first 8 bytes
JPG lways starts with 0xFF 0xD8
BMP starts with 0x42 0x4D (ASCII "BM" for bitmap)