Things need to be improved for Structural SIMilarity (SSIM) image quality index

For image quality assessment, Structural SIMilarity (SSIM) index can be considered as a breakthrough technology that change both academia and industry players. 10K+ citations and lots of moves in many public image/video projects had evident that, especially in video compression community. x264, one of the most popular H.264 video encoder, seems optimize SSIM during the past several years.

The inventors did a good job to showcase the power of SSIM and its impacts. The original paper is very well written and Matlab implementation is also available here:

A brief introduction can be found on Wiki here:

Basically, as long as your business related to image or video, SSIM would be a much better optimization goal than conventionally used PSNR/MSE. Despite its simplicity, which is also a major reason of getting popular so quickly, SSIM is considered as capturing the fundamental and essential characteristics of human visual system (HVS).

However, after years of using and studying SSIM, the following improvements are suggested to make it a practical IQA tool.

  1. The output number of SSIM is range from 0 to 1 (theoretically -1 to 1?), which looks nice. But there are two problems:
    1. The physical meaning of SSIM value is unclear and lack of unit. If I have an image with SSIM=0.95, what’s that mean? If there is another image with SSIM=0.96, how much of improvement in terms of perceptual quality? Is it noticable? On the other hand, if an image was improved from 0.7 to 0.75, and another one is from 0.9 to 0.95, are the improvements same?
    2. There are difficulties of using it for image enhancement applications, such as contrast enhancement and equalization.
  2. If the same video is played on different devices, such as TV, cellphone, and desktop monitor, the perceptual quality would be drastically different. One can expect more distortions are perceived on TV due to big screen, but hardly noticeable on cellphone. But SSIM only give one number despite this factor.
  3. SSIM requires the distorted video and reference one are perfectly aligned spatially and temporally. But in practical, especially transcoding, this requirement is too restrict to be satisfied, because the video resolution need to be scaled to meet the bandwidth limitations. CW-SSIM is proposed to deal with the problem of slight spatial misalignment, but seems quite computation intensive.
  4. Despite the widely deployment of SSIM, the accuracy of predicting subjective benchmark is still needs to be improved, because it’s still easy to find a case to fail SSIM, even in the LIVE IQA database.
  5. Specifically for video encoding, developers want and eager to use SSIM as optimization goal. However, the principle of SSIM, that take the neighborhood of a pixel into consideration for quality measure, is somehow contradictory with the desired property that is requested by the encoder developer. They want a measure that can be break down to pixel with as less dependency to other element as possible to facilitate the converge of target function during optimization, as well as parallel processing. This one maybe the hardest point to be changed.

All in all, SSIM opened another gate to a new world and people want make use of it either for new applications or for better performance of a human-center system. Along with the rising of video monitization, analytic and quality-of-experience monitoring are expected to bloom in the near future.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s