Today, we'll "briefly" look at quality assesment of media(audio and video) and how we can use it to measure the quality of media being transmitted to our users.
This blog contains contents from multiple ITU documents and related studies
Quality assessment is important due to various reasons:
- It allows us to measure the quality of content being provided to our users
- Benchmark multiple products' quality
- Use it as a guideline to compare any improvements or degradations of product being developed
By developing a good quality assessment metrics and systems, we can work on providing a better product to our users while ensuring its capacity.
Quality assessment of media aims to predict the quality of the given media using certain set of metrics and algorithms. This results in a MOS value (Mean Opinion Score) which is based on a 5 level scale that predicts real human subject's opinion on the given media. There are many different types of MOS values with different qualities in mind.
As shown in the image above, the original video goes through the distortion channel (such as WebRTC, Streaming service, or some sort of network transmission), which is then fed into the Quality Assessment that generates a quality score.
Some types of quality assessment methodologies uses both original signal and distorted signal to produce a quality score.
Type of Quality Assessments
Because there are many different types of media such as real-time streaming, conversational media, and video streaming service, different types of quality assessment methods are used.
Perceptual vs. Error sensitivity
Perceptual and Error sensitivity quality assessment models differ in how they perdict the degradation in the given media.
Perceptual model uses human audio-visual qualities to predict the actual quality of the given media in a way that actual human subjects would observe it.
In contrast, error sensitivity model uses the measured error signals in the media to "linearly" predict the quality of the given media. Some examples of error sensitivity model is signal-to-noise ratio, packet loss, or pixel distortion.
Because perceptual quality assessment models use human attributes to model the perceived quality of the media, it is often more accurate than error-sensitivity models.
In the above image, the error-sensitivity model generates the same level of image quality for all three rows of image, while perceptual model generates image quality levels in an ascending order, which corresponds with our actual perceivment of the images.
Subjective vs. Objective
Another differences in quality assessment is subjective quality assessment and objective quality assessment.
Subjective quality assessment is subjective assessment of the given media done by real people with their own subjective views. Because everyone has different standards, subjective quality assessments can produce different outputs for the same input.
Objective quality assessment predicts the given media using a standardized model such as algorithms and machine learning models. Unlike subjective quality assessment, it produces same output for the same input.
Subjective quality assessment is often costly and time-consuming as it requires human efforts, so it is usually used in the process of making a quality assessment model and verifying one. Objective quality assessment is simpler and easier to test, so it is often used for iterative testing and in production environment.
Full reference vs. No reference vs. Reduced reference
Quality assessment is done by either of these three methods
- Full reference: Uses both the original signal and distorted signal to produce a quality score. Because it can directly compare the original and distorted signals, it can use more accurate and detailed models. However, due to privacy and performance issues, it usually isn't used in production environments.
- No reference: Uses only the distorted signal to produce a quality score. Because it has no original signal to compare to, it can only use simpler and perhaps less-accurate models. No reference quality assessment methodology is still being developed and lacks accuracy to be readily used.
- Reduced reference: Uses the distorted signal and some parts of original signal to produce a quality score. Because it only uses some parts of the original signal, it can overcome the privacy and performance issues seen in full reference models. For example, a model could extract only few details about the frame from the original media and use it to improve the accuracy of the quality score. Similar to a no reference model, reduced reference model is still yet to be developed.
Most of standardized models are based on full reference model.
Mean Opinion Score
|Impairment||Very Annoying||Annoying||Slightly Annoying||Perceptible||Imperceptible|
Mean Opinion Score is a standardized scale used to predict the quality of a media. It ranges from 1 to 5, with 1 being the worst and 5 being the best.
MOS has many different types depending of the type of media and use cases being assessed.
Now that we looked at the basics of quality assessment and different types of it, let's look at Audio Quality Assessment in specific.
When conducting an audio quality assessment, the methodology is varies depending on the type of audio being assessed.
- Speech Audio: Uses frequencies up to 8000Hz in which speech normally takes place.
- General Audio: Uses audible frequencies up to 48000Hz.
How we perceive audio can not be defined in a simple linear scale. Also frequencies can not be compared directly as external factors like noise can distort the signals.
So, if you want to actually model how people perceive audio, we must model human audio system into a psychoacoustic model. There are many different ways to achieve this, and FFT(Fast Fourier Transform) and Filter bank based ear model are some of the examples.
The following are few of the strategies that such models use:
- use different responses to each frequency
- band pass and smoothing
- modeling different reactions that happen inside an actual ear
There are a lot of different ways to model human's auditory system, and many other techniques are used in order to improve the model.
This topic is to broad cover in a blog post. For implementations of such method, take a look at google's VISQOL.
Now, let's look at video quality assessment.
Similar to audio, video quality assessment is done by either modeling human's visual system or checking for literal errors in the pixels.
While images were compared as a whole until a few years ago, current methodologies relies more on how human perceives objects structurally. Instead of comparing every pixel, more important parts of the image (such as a center of focus) is more heavily compared.
- Perceputal: Models the human visual system
- Error sensitivity: Uses basic signals such as mean squared error or signal to noise ratio.
The most significant difference between these two is that perceptual model checks for relative errors while error sensitivity model checks for absolute errors. Unlike absolute error, relative errors takes extra features such as image's focus and amount of light into account.
As media goes through the process of compression and decompression and network transmission, there may be a number of side effects such as jitter, timing issue, temporal effects, masking, and compression artifacts. These factors must be correctly taken into account and correctly model how it affects human's ability to perceive the image.
Few examples of existing video quality assessment methodologies are
Unlike static image, different types of videos must be tested differently. Depending on the purpose of video (such as for gaming, conference, etc), people focus on different parts of the video, and the factors of the video have different effects on the perceived quality. For example, while the video's smooth frame rate is not necessarily important for a conference video, it is very important for gaming videos.
Additionally, because video is a continuous flow of images, its flow and natural-ness must be tested as well. But due to its complexity, there is yet to be a standardized methodology for objective video quality assessment.
As the actual end users listen to and look at audio and video at the same time, we must also be able to predict the audiovisual quality of the media. The audiovisual quality can be produced by either directly adding the audio and video scores or measuring them together.
However, there is no standardized method of objectively predicting the audiovisual quality of a media content. Few difficulties involve the need to correctly sync the audio and video.
Objective quality assessment methodologies need to be validated with corresponding subjective quality assessment, checking if their results correspond with each other.
Subjective quality assessments can show biased results depending on the subject's mood, audiovisual abilities, preference for the media, surrounding environments, and so on. To oversome these biases, Single Stimulus Continuous Quality Evaluation and Double Stimulus Continuous Quality Scale are conducted.
- Single Stimulus Continuous Quality Evaluation: plays one stimulus (one media) continuously and allow the subject to freely change the quality scale throughout the lifetime of the content. This allows for more accurate data within each time intervals
- Double Stimulus Continuous Quality Scale: plays two stimuluses (two media) in mixed sequence. By showing the original and distorted in mixed sequence makes the subject forget the previously played media. This allows for less biased opinion for both original and disorted signals.
Even with such efforts, depending on the participants and their viewing condition, the quality score may change. However, these methods are still valid efforts to improve the subjective data points.
This was a general look at media quality assessment. If you wish to learn more about it, you can look at ITU documents and studies about quality assessment.
- ITU-T P.800: Methods for subjective determination of transmission quality
- ITU-T P.805: Subjective evaluation of conversational quality
- ITU-T P.834: Methodology for the derivation of equipment impairment factors from instrumental models
- ITU-T P.835: Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm
- ITU-T P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs
- ITU-T P.863: Perceptual objective listening quality prediction
- ITU-T P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment
- ITU-T P.910: Subjective video quality assessment methods for multimedia applications
- ITU-T P.911: Subjective audiovisual quality assessment methods for multimedia applications
- ITU-T P.912: Subjective video quality assessment methods for recognition tasks
- ITU-T P.1204: Video quality assessment of streaming services over reliable transport for resolutions up to 4K
- ITU-R BS.1284-2
- ITU-R BS.1387
- Assessment of QoE for Video and Audio in WebRTC Applications Using Full-Reference Models
- Objective Video Quality Assessment Methods: A Classification, Review, and Performance Comparison
- Objective Video Quality Assessment
- Audio Quality Assessment Techniques – A Review, and Recent Developments
- ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
- Toward A Practical Perceptual Video Quality Metric
For more ITU-T Recommendations, please see https://www.itu.int/rec/T-REC-P/en