Most of the software systems that have been built so far are not intelligent. They have a static behavior as a result of a static implementation. No matter how complex the logic is, such a software system cannot really be considered intelligent since it’s behavior, even if complex, is still static and limited. The system doesn’t learn.
Artificial intelligence can really make an application shine and engage users at a level that hasn’t been seen before but the main problem with it is that it is so hard to implement – it takes very rare skills and months, if not years of work just to implement some basic things like making the application understand user intent from free-form speech or text, distinguish users by speech or understand some basic things about the user such as gender and age estimation by using the device camera.
Microsoft Cognitive Services is an umbrella term that Microsoft is now using to refer to a set of artificially intelligent components that they have been building in the past few years as part of what was previously called “The Project Oxford”. These components are artificially intelligent in the sense that they have been built using machine learning and deep learning techniques, have been trained with huge data sets and are continuously learning and getting better at what they do.
The goal of these services is to bridge the gap between traditional software development and artificial intelligence by making AI accessible. In the “as a service” world that we currently live in, Microsoft is offering AI as a service – and it’s pretty cheap too. These APIs are being offered as separate resources under an Azure subscription. All of these offer a free pricing tier (A limited number of calls per month – the limit differs between APIs) and also some paid plans with the $x/number of calls pricing model.
At the time of this writing there are 24 APIs under the “Cognitive Services” umbrella but here’s a short summary of some of the coolest:
Computer Vision API: Takes an image as input and returns JSON data about the image including automatic tags (e.g. “water”, “sport”, “swimming” etc.) or text found in an image (OCR)
Emotions API: Takes an image as input and returns JSON data including an array of emotions and the score identified for each emotion for each of the human faces identified in the image (E.g. “scores”: {“anger”: 9.075572e-13, “contempt”: 7.048959e-9, “disgust”: 1.02152783e-11, “fear”: 1.778957e-14, “happiness”: 0.9999999, “neutral”: 1.31694478e-7, “sadness”: 6.04054263e-12, “surprise”: 3.92249462e-11})
Face API: Takes an image as input and returns JSON data about the faces identified in the image such as gender, age estimation, glasses and facial hair.
Speech API: Simple text to speech and speech to text.
Language Understanding Intelligent Service (LUIS): LUIS can understand user intent from free form text. It allows the definition of some custom models with an intent and parameters associated to each model and then it matches a free form text from the user against these models and returns the best match.
Text Analytics API: Does key phrase extraction and sentiment analysis on text. It takes a piece of text as input and returns JSON data including the key talking points and the overall emotional level identified in the text.
Now what can you actually do with these APIs? Well, imagination is the limit! Some of these APIs can be combined in a natural way in the same application in order to create a cohesive and outstanding user experience. Cortana for example, the personal assistant built into Windows 10, it already using some of these APIs (Speech API, LUIS)
Imagine an application that acts like your virtual friend – The application can take snapshots of you using the device camera and send those to Computer Vision API and Emotions API. Using these two APIs the application can understand what you’re doing and how are you feeling. When it wants to talk to you about your emotions or about what you’re doing the application can generate text then send that to Speech API and get it converted to speech in near real time. When you talk back, the application can again send the speech captured through the device microphone to Speech API and get it converted to text. It can then use that text to understand your intent by sending it to LUIS or do a sentiment analysis to understand how you’re feeling and then again talk back and so on. Pretty cool right?
Other possible usages of these APIs include automatic customer support like automatic dynamic replies based on key phrases and sentiment analysis done on the customer’s email or serving dynamic content to the user based on the emotions identified while the user is using the application.