# Stop Thinking, Just Do!

Sung-Soo Kim's Blog

# Universe

We’re releasing Universe, a software platform for measuring and training an AI’s general intelligence across the world’s supply of games, websites and other applications.

A sample of Universe game environments played by human demonstrators.

Universe allows an AI agent to use a computer like a human does: by looking at screen pixels and operating a virtual keyboard and mouse. We must train AI systems on the full range of tasks we expect them to solve, and Universe lets us train a single agent on any task a human can complete with a computer.

In April, we launched Gym, a toolkit for developing and comparing reinforcement learning (RL) algorithms. With Universe, any program can be turned into a Gym environment. Universe works by automatically launching the program behind a VNC remote desktop — it doesn’t need special access to program internals, source code, or bot APIs.

Today’s release consists of a thousand environments including Flash games, browser tasks, and games like slither.io and GTA V. Hundreds of these are ready for reinforcement learning, and almost all can be freely run with the universe Python library as follows:

    import gym
import universe # register Universe environments into Gym

env = gym.make('flashgames.DuskDrive-v0') # any Universe environment ID here
observation_n = env.reset()

while True:
# agent which presses the Up arrow 60 times per second
action_n = [[('KeyEvent', 'ArrowUp', True)] for _ in observation_n]
observation_n, reward_n, done_n, info = env.step(action_n)
env.render()


{width=”100%”}

The sample code above will start your AI playing the Dusk Drive Flash game. Your AI will be given frames like the above 60 times per second. You’ll need to have Docker and universe installed.

Our goal is to develop a single AI agent that can flexibly apply its past experience on Universe environments to quickly master unfamiliar, difficult environments, which would be a major step towards general intelligence. There are many ways to help: giving us permission on your games, training agents across Universe tasks, (soon) integrating new games, or (soon) playing the games.

With support from EA, Microsoft Studios, Valve, Wolfram, and many others, we’ve already secured permission for Universe AI agents to freely access games and applications such as Portal, Fable Anniversary, World of Goo, RimWorld, Slime Rancher, Shovel Knight, SpaceChem, Wing Commander III, Command & Conquer: Red Alert 2, Syndicate, Magic Carpet, Mirror’s Edge, Sid Meier’s Alpha Centauri, and Wolfram Mathematica. We look forward to integrating these and many more.

{.universe-logos width=”749” height=”259”}

## Background

The area of artificial intelligence has seen rapid progress over the last few years. Computers can now see, hear, and translate languages with unprecedented accuracies. They are also learning to generate images, sound, and text. A reinforcement learning system, AlphaGo, defeated the world champion at Go. However, despite all of these advances, the systems we’re building still fall into the category of “Narrow AI” — they can achieve super-human performance in a specific domain, but lack the ability to do anything sensible outside of it. For instance, AlphaGo can easily defeat you at Go, but you can’t explain the rules of a different board game to it and expect it to play with you.

Systems with general problem solving ability — something akin to human common sense, allowing an agent to rapidly solve a new hard task — remain out of reach. One apparent challenge is that our agents don’t carry their experience along with them to new tasks. In a standard training regime, we initialize agents from scratch and let them twitch randomly through tens of millions of trials as they learn to repeat actions that happen to lead to rewarding outcomes. If we are to make progress towards generally intelligent agents, we must allow them to experience a wide repertoire of tasks so they can develop world knowledge and problem solving strategies that can be efficiently reused in a new task.

{width=”100%”}

The Atari 2600 game “Montezuma’s Revenge,” which is notoriously difficult to learn with reinforcement learning. A human player can immediately see that they control the person, that the skull is probably bad to touch, or that it is probably a good idea to collect the key. An AI agent starting from scratch and without any transfer from past experience is forced to discover the solution through a trial and error approach that may require millions of attempts.

## Universe Infrastructure

Universe exposes a wide range of environments through a common interface: the agent operates a remote desktop by observing pixels of a screen and producing keyboard and mouse commands. The environment exposes a VNC server and the universe library turns the agent into a VNC client.

Our design goal for universe was to support a single Python process driving 20 environments in parallel at 60 frames per second. Each screen buffer is 1024x768, so naively reading each frame from an external process would take 3GB/s of memory bandwidth. We wrote a batch-oriented VNC client in Go, which is loaded as a shared library in Python and incrementally updates a pair of buffers for each environment. After experimenting with many combinations of VNC servers, encodings, and undocumented protocol options, we now routinely drive dozens of environments at 60 frames per second with 100ms latency — almost all due to server-side encoding.

Here are some important properties of our current implementation:

General. An agent can use this interface (which was originally designed for humans) to interact with any existing computer program without requiring an emulator or access to the program’s internals. For instance, it can play any computer game, interact with a terminal, browse the web, design buildings in CAD software, operate a photo editing program, or edit a spreadsheet.

Familiar to humans. Since people are already well versed with the interface of pixels/keyboard/mouse, humans can easily operate any of our environments. We can use human performance as a meaningful baseline, and record human demonstrations by simply saving VNC traffic. We’ve found demonstrations to be extremely useful in initializing agents with sensible policies with behavioral cloning (i.e. use supervised learning to mimic what the human does), before switching to RL to optimize for the given reward function.

VNC as a standard. Many implementations of VNC are available online and some are packaged by default into the most common operating systems, including OSX. There are even VNC implementations in JavaScript, which allow humans to provide demonstrations without installing any new software — important for services like Amazon Mechanical Turk.

Easy to debug. We can observe our agent while it is training or being evaluated — we just attach a VNC client to the environment’s (shared) VNC desktop. We can also save the VNC traffic for future analysis.

We were all quite surprised that we could make VNC work so well. As we scale to larger games, there’s a decent chance we’ll start using additional backend technologies. But preliminary signs indicate we can push the existing implementation far: with the right settings, our client can coax GTA V to run at 20 frames per second over the public internet.

## Environments

We have already integrated a large number of environments into Universe, and view these as just the start. Each environment is packaged as a Docker image and hosts two servers that communicate with the outside world: the VNC server which sends pixels and receives keyboard/mouse commands, and a WebSocket server which sends the reward signal for reinforcement learning tasks (as well as any auxiliary information such as text, or diagnostics) and accepts control messages (such as the specific environment ID to run).

## Atari games

Universe includes the Atari 2600 games from the Arcade Learning Environment. These environments now run asynchronously inside the quay.io/openai/universe.gym-core Docker image and allow the agent to connect over the network, which means the agent must handle lag and low frame rates. Running over a local network in the cloud, we usually see 60 frames per second, observation lags of 20ms, and action lags of 10ms; over the public internet this drops to 20 frames per second, 80ms observation lags, and 30ms action lags.

Human demonstrators playing Atari games over VNC.

## Flash games

We turned to Flash games as a starting point for scaling Universe — they are pervasive on the Internet, generally feature richer graphics than Atari, but are still individually simple. We’ve sifted through over 30,000 so far, and estimate there’s an order of magnitude more.

Our initial Universe release includes 1,000 Flash games (100 with reward functions), which we distribute in the quay.io/openai/universe.flashgames Docker image with consent from the rightsholders. This image starts a TigerVNC server and boots a Python control server, which uses Selenium to open a Chrome browser to an in-container page with the desired game, and automatically clicks through any menus needed to start the game.

Human demonstrators playing Flash games games over VNC.

Extracting rewards. While environments without reward functions can be used for unsupervised learning or to generate human demonstrations, RL needs a reward function. Unlike with the Atari games, we can’t simply read out success criteria from the process memory, as there is too much variation in how each game stores this information. Fortunately, many games have an on-screen score which we can use as a reward function, as long as we can parse it. While off-the-shelf OCR such as Tesseract performs great on standard fonts with clean backgrounds, it struggles with the diverse fonts, moving backgrounds, flashy animations, or occluding objects common in many games. We developed a convolutional neural network-based OCR model that runs inside the Docker container’s Python controller, parses the score (from a screen buffer maintained via a VNC self-loop), and communicates it over the WebSocket channel to the agent.

Our score OCR model in action. A human integrator has provided the bounding box for the score. The OCR model parses the score at 60 frames per second.

## Browser tasks

Humanity has collectively built the Internet into an immense treasure trove of information, designed for visual consumption by humans. Universe includes browser-based environments which require AI agents to read, navigate, and use the web just like people — using pixels, keyboard, and mouse.

Today, our agents are mostly learning to interact with common user interface elements like buttons, lists and sliders, but in the future they could complete complex tasks, such as looking up things they don’t know on the internet, managing your email or calendar, completing Khan Academy lessons, or working on Amazon Mechanical Turk and CrowdFlower tasks.

Mini World of Bits. We first set out to create a new benchmark that captures the salient challenges of browser interactions in a simple setting. We call this benchmark Mini World of Bits. We think of it as an analogue to MNIST, and believe that mastering these environments provides valuable signal towards models and training techniques that will perform well on full websites and more complex tasks. Our initial Mini World of Bits benchmark consists of 80 environments that range from simple (e.g. click a specific button) to difficult (e.g. reply to a contact in a simulated email client).

Human demonstrators completing Mini World of Bits tasks.

Real-world browser tasks. We’ve begun work on more realistic browser tasks. The agent takes an instruction, and performs a sequence of actions on a website. One such environment hands the agent details of a desired flight booking, and then requires it to manipulate a user interface to search for the flight. (We use cached recordings of these sites to avoid spamming them, or booking lots of real flights.)

Human demonstrators completing a flight booking task on various websites given the instruction: “San Francisco to Boston, departing on 12/5 and returning on 12/10”.

## Future integrations

This infrastructure is general-purpose: we can integrate any game, website, or application which can run in a Docker container (most convenient) or a Windows virtual machine (less convenient). We’d like the community’s help to continue extending the breadth of Universe environments, including completing the integrations of our partners’ games, Android apps (emulators can run inside of Docker), fold.it, Unity games, HTML5 games, online educational games, and really anything else people think of.

Microsoft’s Project Malmo team will be integrating with Universe, and we look forward to supporting other AI frameworks as well.

Human demonstrators playing some games and applications from our partners.

## Running an environment

Despite the huge variety, running Universe environments requires minimal setup. You’ll need only to install Docker and universe:

$git clone https://github.com/openai/universe && pip install -e universe  We package each collection of similar environments into a “runtime”, which is a server exposing two ports: 5900 (used for the VNC protocol to exchange pixels/keyboard/mouse) and 15900 (used for a WebSocket control protocol). For example, the quay.io/openai/universe.flashgames Docker image is a runtime that can serve many different Flash game environments. Starting a runtime. You can boot your first runtime from the console as follows: # -p 5900:5900 and -p 15900:15900 expose the VNC and WebSocket ports # --privileged/--cap-add/--ipc=host needed to make Selenium work$ docker run --privileged --cap-add=SYS_ADMIN --ipc=host \
-p 5900:5900 -p 15900:15900 quay.io/openai/universe.flashgames


This will download and run the Flash games Docker container. You can view and control the remote desktop by connecting your own VNC viewer to port 5900, such as via TurboVNC or the browser-based VNC client served via the webserver on port 15900. The default password is openai. OSX also has a native VNC viewer which can be accessed by running open vnc://localhost:5900 in Terminal. (Unfortunately, the OSX viewer doesn’t implement Tight encoding, which is the best option for bigger games.)

Writing your own agent. You can write your own agent quite easily, using your favorite framework such as TensorFlow or Theano. (We’ve provided a starter TensorFlow agent.) At each time step, the agent’s observation includes a NumPy pixel array, and the agent must emit a list of VNC events (mouse/keyboard actions). For example, the following agent will activate Dusk Drive and press forward constantly:

    import gym
import universe # register Universe environments into Gym

env = gym.make('flashgames.DuskDrive-v0') # any Universe environment ID here
# If using docker-machine, replace "localhost" with your Docker IP
env.configure(remotes="vnc://localhost:5900+15900")
observation_n = env.reset()

while True:
# agent which presses the Up arrow 60 times per second
action_n = [[('KeyEvent', 'ArrowUp', True)] for _ in observation_n]
observation_n, reward_n, done_n, info = env.step(action_n)
env.render()


You can keep your own VNC connection open, and watch the agent play, or even use the keyboard and mouse alongside the agent in a human/agent co-op mode.

Environment management. Because environments run as server processes, they can run on remote machines, possibly within a cluster or even over the public internet. We’ve documented a few ways to manage remote runtimes. At OpenAI, we use an “allocator” HTTP service, which provisions runtimes across a Kubernetes cluster on demand, and which we can use to connect a single agent process to hundreds of simultaneous environments.

## Validating the Universe infrastructure

Universe agents must deal with real-world griminess that traditional RL agents are shielded from: agents must run in real-time and account for fluctuating action and observation lag. While the full complexity of Universe is designed to be out of reach of current techniques, we also have ensured it’s possible to make progress today.

Universe Pong. Our first goal was solving gym-core.PongDeterministic-v3. Pong is one of the easiest Atari games, but it had the potential to be intractable as a Universe task, since the agent has to learn to perform very precise maneuvers at 4x realtime (as the environment uses a standard frameskip of 4). We used this environment to validate that Universe’s variable latencies still allowed for learning precise and rapid reactions. Today’s release includes universe-starter-agent, which takes an hour to train to a Pong score of +17 out of 21. Humans playing the same version of Pong were only able reach a score of -11 on a scale between -21 and 21, due to the game’s high speed.

{width=”100%”}

Trained agent playing gym-core.PongDeterministic-v3. The video plays at real time.

Additional experiments. We applied RL to several racing Flash games, which worked after applying some standard tricks such as reward normalization. Some browser tasks where we tried RL had difficult exploration problems, but were solvable with behavioral cloning from human demonstration data.

Some of our successful agents are shown below. While solving Universe will require an agent far outside the reach of current techniques, these videos show that many interesting Universe environments can be fruitfully approached with today’s algorithms.

Three trained agents playing several racing Flash games. Each of these agents uses the same code and hyperparameters.

An agent that was trained with behavior cloning on approximately 2 hours of human demonstrations (left), and two agents that were trained with reinforcement learning (middle, right) to recognize and click specific buttons and to play Tic Tac Toe. Each of these agents uses the same code and hyperparameters as the Flash game agents.

Remote training. We’ve also experimented with slither.io agents on our physical infrastructure (which has access to Titan X GPUs) and environments in the cloud. Generally, the agent will control 32 simultaneous environments at 5 frames per second — observations are available much more frequently, but lower framerates help today’s RL algorithms, which struggle with dependencies over many timesteps.

{width=”100%”}

Our typical setup for training an agent over the public internet, with 32 environments running in the cloud provisioned by our allocator instance.

Our agent’s “reaction time” averages around 150ms over the public internet: 110ms for an observation to arrive, 10ms to compute the action, and 30ms for the action to take effect. (For comparison, human reaction time averages around 250ms.) Reaction times drop to 80ms over a local network, and 40ms within a single machine.

An agent playing slither.io (left), using just screen pixels and mouse. The goal is to eat the fruit and avoid hitting other snakes. We’ve used JavaScript to simplify the colors. We also show the activations of each layer of the neural network’s which controls the snake (right).

## Looking forward

Research progress requires meaningful performance measurement. In upcoming weeks, we’ll release a transfer learning benchmark, allowing researchers to determine if they are making progress on general problem solving ability.

Universe draws inspiration from the history of the ImageNet dataset in the Computer Vision community. Fei-Fei Li and her collaborators deliberately designed the ImageNet benchmark to be nearly impossible, but error rates have dropped rapidly from 28% in 2010 to 3% in 2016, which reaches (or in some cases even surpasses) human-level performance.

If the AI community does the same with Universe, then we will have made real progress towards systems with broad, general intelligence.

## Help us improve Universe

Universe will only succeed with the community’s help. There are many ways to contribute (and one particularly great way is to join us):

## Grant us permission to use your game, program, website, or app.

If your program would yield good training tasks for an AI then we’d love your permission to package it in Universe. Good candidates have an on-screen number (such as a game score) which can be parsed as a reward, or well-defined objectives, either natively or definable by the user.

## Train agents on Universe tasks.

AI advances require the entire community to collaborate, and we welcome the community’s help in training agents across these tasks. We’ve released a starter agent which should be a helpful starting point for building your own agents. In upcoming weeks, we’ll release the sub-benchmarks we think are the right places to start.

## Integrate new environments. coming soon

We have many more environments waiting to be integrated than we can handle on our own. In upcoming weeks, we’ll release our environment integration tools, so anyone can contribute new environment integrations. In the meanwhile, we’ll be running a beta for environment integrators.

## Contribute demonstrations. coming soon

We’re compiling a large dataset of human demonstrations on Universe environments, which will be released publicly. If you’d like to play games for the good of science, please sign up for our beta.

## Credits

• Acquisition & partnerships: Erin Pettigrew, Jack Clark, Jeff Arnold
• Core infrastructure: Greg Brockman, Catherine Olsson, Alex Ray
• Demonstrations: Tom Brown, Jeremy Schlatter, Marie La, Catherine Olsson
• Distributed training infrastructure: Vicki Cheung, Greg Brockman, Jonas Schneider
• Documentation & communications: Jack Clark, Andrej Karpathy, Catherine Olsson
• Environment integrations: Alec Radford, Jonathan Gray, Tom Brown, Greg Brockman, Alex Ray, Catherine Olsson, Trevor Blackwell, Tambet Matiisen, Craig Quiter
• Initial agent results: Rafal Jozefowicz, Dario Amodei, Ilya Sutskever, Jonathan Ho, Trevor Blackwell, Avital Oliver, Yaroslav Bulatov
• Remote environment management: Vicki Cheung, Greg Brockman, Catherine Olsson, Jie Tang
• RL baselines: Dario Amodei, Harri Edwards
• Website: Ludwig Petterson, Jie Tang, Tom Brown, Alec Radford, Jonas Schneider, Szymon Sidor
• World of Bits: Andrej Karpathy, Tianlin (Tim) Shi, Linxi (Jim) Fan, Jonathan Hernandez, Percy Liang

The following partners have been key to creating Universe: EA, Valve, Microsoft, NVIDIA, Kongregate, Newgrounds, Yacht Club Games, Zachtronics, Ludeon Studios, Monomi Park, 2D Boy, Adam Reagle, Alvin Team, Rockspro, Anubhav Sharma, Arkadium, Beast Games, Char Studio, Droqen, Percy Pea, deeperbeige, Denny Menato, Dig Your Own Grave, Free World Group, Gamesheep, Hamumu Software, Hemisphere Games, Icy Lime, Insane Hero, inRegular Games, JackSmack, Nocanwin, Joe Willmott, Johnny Two Shoes, The Gamest Studio, László Cziglédszky, Madalin Games, Martian Games, Mateusz Skutnik, Mikalay Radchuk, Neutronized, Nitrome, ooPixel, PacoGames, Pixelante, Plemsoft, Rob Donkin, robotJam, Rumble Sushi 3D, SFB Games, Simian Logic, Smiley Gamer, Sosker, tequibo, kometbomb, ThePodge, Vasco Freitas, Vitality Games, Wolve Games, Xform Games, XGen Studios

comments powered by Disqus