Encoding Tech

Best Practices for Netflix’s VMAF Metric – Streaming Learning Center

Best Practices for Netflix’s VMAF Metric – Streaming Learning Center

Zhi Li, senior software program engineer at Netflix, lately co-wrote a paper entitled VMAF: The Journey Continues posted on the Netflix Know-how Weblog on Medium. After reviewing why VMAF was developed, the way it’s been adopted within the business, and a few current enhancements, the paper discusses some greatest practices for utilizing VMAF. Particularly, this final part will probably be worthwhile to anybody who makes use of this high quality metric.

As detailed within the weblog submit, VMAF combines human imaginative and prescient modeling with machine studying and was developed in collaboration with a number of college professors. In June 2016, Netflix open sourced the know-how and revealed it on Github.

Since 2016, VMAF has been built-in into a number of video high quality measurement instruments, together with merchandise from Moscow State College and Elecard, in addition to FFmpeg, although it’s solely been examined for Mac and Linux distributions, not Home windows. Netflix makes use of VMAF to guage codecs, to energy encoding selections all through their manufacturing pipeline, and for A/B experimentation.

Since launching VMAF, Netflix has optimized working velocity and launched a frame-skipping function that permits the metric to be computed on each certainly one of N frames. Because the article states, “that is the primary time that VMAF might be computed in actual time, even in 4K, albeit with a slight accuracy loss.” Netflix has additionally improved VMAF’s accuracy by enhancing the elementary metrics and machine studying mannequin, and by broadening the coaching set.

One deficit within the first VMAF launch was a one-size-fits-all strategy that assumed that each one viewers watched a 1080p show in a dwelling room-like surroundings. Since then, Netflix launched a telephone mannequin and extra lately a 4K mannequin. These fashions might allow organizations to make knowledgeable selections to save lots of bandwidth with out considerably impacting perceived high quality and QoE. For instance, cellular viewers understand much less of a distinction between 720p and 1080p movies, so in some instances, it’d make sense to restrict the adaptive group distributed to cell phones to 720p.

The determine above is the Outcome Plot from the Moscow State College Video High quality Measurement Device displaying the default (inexperienced) and telephone (pink) VMAF rating for a similar 720p video. For 1080p viewing in a front room, the default rating of 88.6 could be thought-about too low, suggesting that a 1080p file would enhance QoE. Nevertheless, the typical rating of 99.28 for the 720p file within the telephone mannequin suggests that a 1080p model of the file doubtless wouldn’t enhance perceived high quality on a cell phone. On this occasion, it’d make sense to take away the 1080p file from the manifest file retrieved by cell phones.

With this as background, let’s talk about probably the most vital greatest practices mentioned within the Netflix article.

VMAF Greatest Practices

First, the article discusses the right way to interpret VMAF scores, or extra particularly learn how to map them to predicted subjective scores from actual viewers. To elucidate, whereas we all know that a rating of 70 signifies greater high quality than a rating of 60 (larger is all the time higher), how would an precise viewer price that video with a rating of 70? The article explains:

Viewers voted the video high quality on the size of “dangerous,” “poor,” “truthful,” “good,” and “wonderful,” and roughly talking, “dangerous” is mapped to the VMAF scale 20 and “wonderful” to 100. Thus, a VMAF rating of 70 might be interpreted as a vote between “good” and “truthful” by a mean viewer underneath the 1080p and 3H situation. [Editor’s Note: 3H specifies a distance from the screen that is 3 times the height of the screen.]

Usually, compressionists in search of to make an encoding determination care extra about how an actual viewer will fee the video than a single goal quantity. Because of this, the power to tie VMAF scores to predicted subjective scores provides a variety of utility.

Subsequent, Netflix tackled the technical problem of how you can measure decrease than full-resolution movies. To elucidate, like most metrics, you’ll be able to solely calculate VMAF when the supply and encoded movies share the identical decision. So, when you’re computing the rating for a 480p video encoded from 1080p supply, do you have to scale the 480p video again to 1080p, or the 1080p supply to 480p? The fast reply is the previous; you need to scale the encoded video to 1080p and measure towards the unique supply. Most practitioners use FFmpeg for this operation.

In fact, FFmpeg helps a number of upscaling methods, together with bilinear, bicubic, Lanczos, and others. Right here, the article recommends utilizing bicubic upsampling in the event you don’t know the algorithm utilized by the precise show gadget. I’ve all the time used and advisable Lanczos as a result of in accordance with a white paper right herethat’s the method utilized by NVIDIA GPUs, which personal the lion’s share of the market. That stated, the scoring distinction between totally different upscaling methods is minimal, and as long as you’re constant, it in all probability doesn’t matter which method you employ.

Metrics for A/B Experimentation

Like most high quality metrics, VMAF produces a single rating per body, and most instruments that I’ve seen common the scores for the person frames right into a single general rating. Although Netflix notes that there are different averaging methods obtainable, the article states that “easy arithmetic imply (AM) is one of the simplest ways of averaging, in that it yields the very best correlation with subjective scores.” This can be a good verification of the method utilized by the MSU device, and I’m assuming most others.

That stated, the article additionally acknowledges that utilizing a single rating to characterize the standard of an extended file includes sure dangers, amongst them lacking high quality drops which may degrade QoE however not considerably scale back the general rating. To counteract this, the article recommends assessing combination high quality, begin play high quality, and variability, or “the typical VMAF over all the session, common VMAF over the primary N seconds, and the variety of occasions the VMAF worth drops under a sure threshold relative to the earlier values.”

In my very own follow, I discover monitoring high quality drops completely important. For instance, the Outcome Plot under exhibits the VMAF scores for a file encoded utilizing 2-pass VBR (pink) and 1-pass CBR (inexperienced). The general rating isn’t too totally different, because the VBR file scored 96.24 and the CBR 94.5, each nicely into the superb vary.

Nevertheless, the CBR file has a number of areas that drop properly under the 2-pass VBR file, suggesting a a lot decrease high quality of expertise than the composite rating suggests. Observe that the End result Plot additionally makes it easy to show the frames (unique, file 1, file 2) at any location on the graph, which simplifies subjectively verifying that the decrease scores correlate with visibly decrease high quality.

Past the Outcome Plot, the MSU software enables you to save “dangerous frames,” or extract N decrease high quality frames with their rating and body quantity, one other option to monitor high quality variability over the period of the file. It can save you dangerous frames when working this system from the command line, which is beneficial as a result of the Outcome Plot is just out there by way of the GUI.

The underside line is that a single VMAF rating, or the only rating from any high quality metric for that matter, supplies an incomplete image of the QoE that file will ship. It’s a must to discover the decrease high quality areas of the file, together with subjectively verifying that the visible high quality of the body is as poor because the rating suggests. You additionally should examine that the standard deficit is seen throughout real-time playback; in any other case, it might be irrelevant. For instance, on the acute proper of the Outcome Plot above is a drop in high quality for a single body, which most viewers wouldn’t truly understand.

The Journey Continues

Within the last part of the article, Netflix lists areas the place they hope to enhance VMAF sooner or later. Whereas some have been simple (including fashions that higher measure temporal perceptual results) some raised important questions on how one can use VMAF, which, thankfully, Netflix agreed to reply.

For instance, the article states, “VMAF doesn’t absolutely seize the advantages of perceptual optimization choices discovered in lots of codecs, though it’s shifting in the best course in comparison with PSNR.” This raises the query of whether or not these optimizations must be enabled or disabled for information measured with VMAF.

The weblog publish states that “VMAF doesn’t absolutely seize the advantages of perceptual optimization choices discovered in lots of codecs.”  Does this imply that you must flip them off (tune PSNR or SSIM) when encoding for evaluation with VMAF, or does that matter?

Netflix responded, “Since VMAF partially captures the good thing about perceptual optimization, and if on the finish of the day you may be encoding with these settings on, we nonetheless advocate turning them on.”

The article additionally states, “The VMAF mannequin works one of the best with movies of some seconds. It doesn’t seize long-term results akin to recency and primacy, in addition to rebuffering occasions.”

We requested Netflix to make clear, they usually responded, “The primary purpose we made that assertion is as a result of VMAF has been educated utilizing brief clips of some seconds within the subjective checks. We consider VMAF averaged over the complete video file nonetheless properly captures the first-order results of perceptual video high quality. There are second-order results, nevertheless, akin to recency (i.e. topic tends to weigh extra on video parts in the direction of the top) and primacy (i.e. topic tends to weigh extra on the start portion of a video), that aren’t captured by a short-term mannequin like VMAF, however these are much less essential.”

The place to Go from Right here

General, in each my writing and consulting follow, I’ve discovered VMAF a really helpful metric for assessing video high quality, notably for measuring the standard of various decision rungs on an encoding ladder, the place I’ve discovered PSNR insufficient. It’s nice that Netflix selected to open-source VMAF so the remainder of us might use it, and has continued to spend money on VMAF’s improvement.

Zhi Li spoke at Streaming Media West 2018 and I’ve embedded his speak under. Right here’s an outline.

VMAF (Video Multi-Evaluation Fusion) is a top quality metric that mixes human imaginative and prescient modeling with machine studying. It demonstrates a excessive correlation to human notion and provides a rating that’s constant throughout content material. VMAF was launched on Github in 2016 and has had appreciable updates since that point. This speak focuses on the newest VMAF enhancements and enrichment, similar to velocity optimization, correct fashions to foretell cellular and 4K TV viewing circumstances, and including a confidence interval to quantify the extent of confidence within the high quality prediction. As well as, we talk about VMAF use instances and take a look at the VMAF street map for the close to future.

About the author