What’s a bot OS?

11 min readApr 20, 2016

A few weeks ago I wrote down a few thoughts on what chat meant as an interface. It was gratifyingly popular, and led to a huge number of discussions with companies (both public and stealthy) and entrepreneurs about what it might mean. It also got me thinking more about what a bot-based future might look like.

I start with some examples of bots and bot platforms, to show just how diverse they are. Then I propose some criteria by which we might segment types of bot. And then I look at what functions a bot “platform” needs to have to manage many bots that might interact with a user.

First, some examples of bots

WeChat is a chat platform from Chinese tech titan Tencent. It has hundreds of millions of users, and people rely on it for banking, transportation, and more. Because of its popularity, many chat pundits say, “Wechat is years ahead of us in because they have bots.” But if you actually bother to install and use it, those “bots” are actually just micro-applications.

Most Westerners haven’t see Wechat behind the scenes (get someone to send you a red packet of money, and you’ll see all sorts of features.) But you don’t need to—Connie Chan disabused people of this misconception recently, offering a glimpse of what WeChat’s McDonalds interface looks like:

Nope, that chat looks like your old-fashioned app. Sorry.

Bots are appearing in all kinds of chat-based platforms these days. Microsoft recently added bots to Skype. You add them just as you’d add contacts:

Invite a bot to your next call and confuse everyone!

And then you ask them to do something. Here’s me asking Project Murphy, “what if Yoda were three.” I think it did pretty well.

Disney’s still trying to sort out the copyright on this.

While Zuckerberg talked about his plans for m, Facebook’s personal agent, you can already do bot-like things inside Messenger. Here’s me playing chess with Ben Yoskovitz:

Twitter’s had bots for years, because there are zero barriers to entry. Here’s what the interactive Twitter chatbot @ultrahal told me when I asked about the current Bot hype:

This bot makes as much sense as most tech pundits.

Why the bot hype?

The chatbot hype machine is going full force, and that means a lot of confusion and valuation. But I think there are two big reasons it’s gaining traction: it bypasses the barriers to entry and “app exhaustion” that limit the growth of traditional apps; and it lets developers test and push code constantly.

Bots bypass garden walls

That barrier to entry is important. All of these host environments—Skype chat, Facebook Messenger, WeChat, WhatsApp, Slack, Twitter, and so on—have different capabilities. A Twitter bot can only interact within the constraints of a default tweet (text and URLs), but that also means the barrier to entry is tiny. Anyone can use the Twitter API to make a chatbot.

There are entire startups, like Magic and Helloshopper, predicated on using SMS to interact with assistants, in part because the barrier to entry for SMS is zero.

In fact, I’m willing to bet a big part of the current chat hype came from conversations like this:

VC: “Deploy in the app store!”
Founder: “Nobody’s downloading apps any more, and getting to the top of an app store means dealing with a seedy world of grey-hat marketers that make me want to wash my eyes with soap.”
VC: “Well get to market somehow!”
Founder: “I can make something for chat. Chat is cool, right? And I can update the code myself because it’s centralized, and learn from every user interaction. Let’s do chat! Everyone has SMS so my total addressable market must be huge!”
VC: Cool, we have a chat startup.

Bots can be updated constantly

In the software industry, the old-fashioned approach of installing software on a desktop has largely been replaced by hosted Software-as-a-Service, paid for by the month or by the seat. Microsoft Office and the Adobe Suite are now sold as online services.

One of the reasons for this is that companies love recurring revenue—you don’t have to sell the software over and over again. But there’s another important aspect to SaaS that means vendors who’ve embraced it are winning: you can always update it. You can be constantly running experiments on users, testing new features. You’re publishing now. And now. And now.

Another swing of the pendulum

There’s a pendulum in computing: Mainframes centralized it; client-server computing pushed it to the edge; the web centralized it; the app pushed it to the edge; and now bots are centralizing it once again.

With these things in mind, how should we think about the diverse bot landscape?

A taxonomy of bots

Clearly, there are different types of bots, and a conversation about “bots” that doesn’t recognize this is too generalized to be useful:

If you want to display a rich UI like the one in the Wechat example, everyone in the chatroom needs to support it.
If one person’s client can’t, it has to fail back to some default mode.
I can only play chess with Ben because Messenger supports images.
If you’re speaking out loud and driving a car, I can’t hit read and tap a touchscreen.

Here’s a slightly more reasoned way to think about the bot biome:

What senses you use, what the UX is like, and how it instantiates.

If the chat interface is designed for typing and vision, then it can include rich micro-apps—assuming the host supports them (as WeChat does.) But if the interface is designed for hands-free operation, (as Jibo, Siri, and Amazon’s Echo do,) then it can only use spoken words.

In WeChat, you launch a payment app by going to find it within the menu system.

Here’s how you launch an app that winds up in chat (like Lucky Money.) Doesn’t look very chatty.

In Skype, to use the Summarize bot, you send a command (“Summarize http://medium.com”) and wait for an answer:

That seems like an awfully limited summary.

New iPhone users will say, “Hey, Siri” to activate the phone’s chat agent. And in more advanced cases, bots will try to guess what you’re after and offer help (Slackbot does this when it thinks you’re trying to do something, or use a command prefaced by a /:)

Chat platform becomes operating system

That means the chat host environment starts to look a lot like an operating system, launching bots, managing what they can do, killing them if they get out of hand, and allowing context switching between them. Which brings me to the point of all this: What does a Chat OS look like?

On a traditional mobile or desktop OS, you have plenty of things laying dormant: Buttons and screens and drivers that you seldom use; copy and paste functions; adaption layers; networking stacks. One of the jobs of the OS is to activate these things when needed.

Similarly, on a chat OS, you’ll have bots waiting for their cue. As we’ve seen, today this is pretty rudimentary—usually you have to tell the bot to do something, either by tapping a menu item or typing a specific string. But smarter agents are going to interrupt gracefully when they think they can help. So what is initially a summoned request (“get me some dinner”) will one day be an inferred one (“gee, I’m hungry, everyone.”)

The first thing the OS will have to figure out is which bot to activate. To do this, it’ll look at the request (is this for food, or transportation?) Then it’ll look at a blacklist for all the chatroom members (I have blocked the Donuts On Demand chatbot, for obvious reasons.)

Next, the OS will have to decide who gets the task. If I have said, “Dominos, I’m hungry,” this is easy enough. But if it’s inferring that I want food, then the OS does something akin to choosing a browser or search engine on a desktop, following a heuristic such as:

If I have a favorite food bot (say a personal life coach — sidenote, life coach bots are going to be a killer app, IMHO) then it will get the request.
If I’m using an ad-backed model, bots will bid for the right to suggest some food (“how about Thai?”) Yes, I can hear many of you cringing.
In a social model there will be some negotiation between the participants in a group chat (“does everyone have Uber? No? What about Lyft?”) Nothing like peer pressure to encourage mass installs.

Now the OS “launches” the chosen bot, giving it permission to chime in actively within the conversation. It also gives the bot whatever data it has on participants and context to help it complete its task. The bot has a “goal” of acquiring the information it needs—such as knowing what you want for dinner, how much to pay the dog walker, whether to restart the server, which bank account you want a balance for, or whatever.

The OS has a role to play here too:

What data does the bot know, that the OS is allowing it to have? The bot might want to know things like the user’s location; or the permissions of others in the chat room; or everyone’s names for a reservation, or payment information. This is like a transient OAuth, federating permissions between the host and the bot for the purpose and duration of the interaction.
What format can the bot use? If it’s a chatroom that permits rich HTML5 micro-apps then use that. But there may be constraints: Perhaps not everyone has the most recent version; or maybe someone is participating while driving, using voice. So the bot may have to fall back to a less-engaging, less-efficient, lowest-common-denominator level of interaction for some or all users.

At this point, the is launched, and will then work towards acquiring all the information needed to complete its task.

A concrete example

Here’s a hypothetical use case of this in action.

I add three transportation bots to my host OS: Uber, Lyft, and a fictional one called Rideshare. When I do so, I grant these bots permission to offer to help with transportation. They’re then registered with the host OS (the chat platform) which will notify them of relevant messages to allow them to infer what’s going on and interrupt they think it’s useful to do so. To do their jobs, the various ridesharing bots need to know the pickup location; number of passengers; and who will be paying. It might also be nice to know the destination.
A few days later, I’m at work, chatting with two other people, all at the same location, and someone says, “let’s go to the party at Mike’s.”
The OS recognizes “let’s go to” as a transportation construct.
The OS notifies the registered transportation bots—Rideshare, Lyft, and Uber, in my case—and manages some kind of bidding process for the “best” bot, based on factors like price, past use, distance to be travelled, climate (bike, walk, or car) and so on. This is the equivalent of paid ads in search results for chat, and I would bet good money on it being a competitive part of the bot ecosystem, with affiliate payments for services subsidizing personal “agents.”
The selected bot looks at what information it already has: It knows my location from location services, shared by the host OS; and the number of passengers it can assume from the people in the chat thread.
The selected bot alsolooks at what else it needs to know. This may be disambiguation—it knows that everyone in the chatroom has a shared contact named Mike; but also that there is a bar called Mike’s that several people in the group have been to before.
The bot might use a more advanced visual interface if all users can support it, to confirm the information. We haven’t really explored multi-user social interfaces like this yet—Rideshare might show a map and let everyone touch where they want to go, and hold some kind of voting mini-game like “where should we go next?”
If it must use plaintext, then it will start a conversation to acquire the information it needs or to disambiguate and confirm things. This is where conversational nuance comes in, with the bot chiming in: “Hey, everyone, this is Alistair’s Rideshare bot. When you say Mike’s did you mean Mike Smith, or the bar Mike’s on Main Street?”
Once it has all of the information needed, it will take action, possibly with a confirmation step and a payment step.
After the transaction takes place, it may provide additional information (“The ride is here!”; “The driver wants to know where you are!”; and even “Alistair’s Rideshare rating has now dropped by one star.”)

The emergence of a chat OS

Today, the bot world is the Wild West. The Facebook chess bot looks like an old MS-DOS game, complete with text commands. There isn’t even the equivalent of top-level domains and DNS for bots; instead, we’ve got dozens of directories reminiscent of early-day Yahoo directories.

Every chat has a rudimentary API by definition—chatting. That’s why developers have been able to create interactive Twitter accounts since the service launched. For more sophisticated interactions, bots need an API, and platforms like Slack, Echo, and Facebook are racing to define them.

There are a ton of chat OS “functions” needed to make this more sophisticated—from metadata, to consensus, to deprecating to the lowest common denominator, to shared bans—that aren’t a function of the individual bot, but of the OS.

There are also services like payment that require additional confirmation and security. And of course, advertising models (like the one imagined above, where each ridesharing vendor decides what kind of offer it’s willing to make to get the ride.)

Finally, there needs to be standardization. In every industry, things start out proprietary; then someone builds a gateway or universal abstraction atop them (see: Javascript frameworks, Trillian); then the industry standardizes around a few core protocols (see: TCP/IP, Sabre.) This will happen in micro-app interfaces, chat commands, and other conventions, just as @name and hashtags became standards in modern social networks.

I fully expect m, Alexa, Siri, Cortana, Slack, and other bot hosts to evolve in this way, with the usual standards wars, posturing, and erection of walled gardens. Am I being facetious by referring to these chat platforms as operating systems? Absolutely. Will the next few years of chat look like a new wave of OS wars? I’m willing to bet on it.