Artificial intelligence (AI), with machine learning (ML) and deep learning (DL) technologies, promises to transform every aspect of businesses, devices, and their usage. In some cases, it enhances the current process, and in some cases replaces them.
As video streaming devices (including TV & STBs ) vie to be one of the AI assistants in the home. However, there are several touchpoints in regular streaming devices and Video processing that are ripe for TV transformation with AI advancements.
The quality of experience expected to increase with enhanced models visually and respect to ease of use. The video streaming devices with these features will surely outperform the ones without.
This blog discusses the use of different AI/ML models performed in video streaming devices to enhance the TV user experience.
Super-resolution is a technique that enhances the quality of a given low- resolution to high resolution by upscaling. Whereas people incline too high quality and visually appealing images and video quality.
Therefore, today the super-resolution feature can be found in many video streaming display devices, producing resolutions that are 2-4 times higher than the pixel count of the sensor.
Neural Network-based scaling is performing a lot better than the traditional bilinear scaling as can be seen from the example pictures below.
As can be seen in the pictures scaled with NN models, the details are preserved or enhanced.
Within the Neural Networks using for scaling there are two types of models –
• Convolution/Deconvolution (Deconv) based
• Generative Adversarial Network (GAN) based.
Deconv based models are simpler and generalize better, whereas the GAN based ones seem to perform very well for scaling function. The Deconv model has an initial section that is similar to familiar networks such as ResNet, which reduces the size of the image but increases the number of features and the latter part is an expansion network, which uses deconvolution operation to increase the resolution using the features. The residual connections in the network seem to help to generalize better.
Generative Adversarial Networks (GAN) consist of a system of two neural networks — the Generator and the Discriminator — dueling each other. Given a set of target samples, the Generator tries to produce samples that can fool the Discriminator into believing they are real. The Discriminator tries to resolve real (target) samples from fake (generated) samples.
Using this iterative training approach, we eventually end up with a Generator that is really good at generating samples similar to the target samples.
InPainting with Neural Networks
There are several cases where a part of the image is damaged or distorted or obscured with a previous watermark. Neural Network can repair such blemishes quite well by guessing the parts of the image missing. Similar to Super Resolution, Deconv networks and GANs can be accomplished.
However, Deconv networks seem to be performing better owing to the better generalization. The single network trained to do both Super Resolution and Inpainting (and other noise reduction) together.
Frame Rate Conversion
Interpolation and upscaling for animated graphics is hard as it has sharper edges and simple interpolation tends to make the images blurred. These softer edges are not visually pleasing in high resolution. Again, Neural networks seem to be out-performing traditional methods in frame rate conversion (and up-sampling). The paper linked below shows how using Variational Autoencoder, GANs, and Deep Convolution Networks can be used to perform this function and Deep Convolution Networks seem to have performed better than other and traditional upsampling.
Integrated Home Assistant in TV
AI integrated Video streaming devices at home tv can know the family members watching and can keep track of each individual’s preferences, viewing habits and specific programming.
With the several applications providing content, it takes quite some time to tune to the specific programming to switch the applications and to browse through the menus.
As per the latest Nielsen report, the average viewer takes 7 minutes to pick what to watch. Just one-third of them report browsing the menu of streaming services to find content to watch, with 21% saying they simply give up watching if they are not able to make up their minds.
A background application can perform this task knowing the specific user. Video viewing through TV-connected devices has increased by 8 minutes daily, in large part due to the penetration of those devices and the falling away of older, less nimble setups.
Seven in 10 homes now have an SVOD service and 72% use video streaming-capable TV devices. Also, the report asserts that customers are not in a comfortable groove when it comes to choosing what to watch.
A background application running on TV can perform the task of recommending the specific content as per individuals’ preferences and viewing habits.
Anonymized Person Detection
Above all functions performed without intruding into the privacy of the users and without recording specific user’s video/face etc.
Also, several new AI techniques are available to anonymize incoming video streaming at the source and perform person recognition at the ball/stick level. Moreover, the NN identifies the user based on the gait of the person without intruding into privacy.
Post-processing can remove noise and improve PSNR caused by Encoders’ blocking, Blurring, Ringing and Perceptual Quantizer (PQ).
Encoder introduced impairments are removed with specific filters like deblocking filter etc. However, using Neural Networks can improve the performance of noise reduction and they all can be performed by one algorithm.
Post-processing options in TVs to perform MEMC (Most Estimation and Motion Compensation). Also, which has staunch critics, can be enhanced with Neural Networks.
Automatic Scene Detection
There are several applications that can use the information about the running image/scene. Video analytics to detect specific actions can be used to rate the specific section of the video. Similarly, voice can be analyzed for Language rating (for kid proofing).
Every frame can be classified for detection of Specific Objects/location, Face recognition of the character and Auto Classification of Nudity
The insights can be used for auto-captions for regulatory warnings such as Smoking/Drinking.
Very similar to the Home AI assistants such as Amazon Alexa and Google Home, Video streaming devices at home expected to respond to voice commands. Today many remote controllers have already voice-activated.
For a high sound output device to be an input device is quite complex, having to cancel the output audio.
In addition, there will be a variable delay in the ambient noise, if the voice enhanced with external speakers. Wake-word detection and Automatic Speech Recognition (ASR) implemented using complex Neural networks today. And for TVs, the ambient noise needs to be compensated to hear the commands.
With proper Neural Network acceleration hardware in the TV/STBs, all or some of the above functions can be implemented for better user experience compared to the current post-processing.
In addition, this processing can alleviate the chip designers to do away with the current expensive filters and post-processing hardware for a unified hardware element to perform Neural Network processing.
One of the problems of the AI algorithm development is dataset availability and annotation. Also, for most of the AI problems mentioned in this blog. Where it is relatively easy to develop dataset as raw higher resolution, error-free data is available.
Gyrus provides AI Models as a Service. by working with customers to clean up, normalize and enrich their data for Data Science. Gyrus develops custom models based on the cleaned customer data and datasets/models.
Above all, Gyrus has worked with several customers in developing similar ML/AI algorithms for vision processing.