Enjoy the GO with Alpha Go

How AlphaGo Zero beat AlphaGo

Bryan Lee
7 min readNov 20, 2019

Go has been around for more than 2 500 years with humans leading the pack in terms of skill. This was all until AlphaGo came around and laid down a new battlefield for humans vs computers. Now, humans aren’t even in the picture in terms of being the best (and literally in the picture). Here’s how AlphaGo took the world of Go by storm ⛈️.

(See no humans)

October 2015, the first time an AI beat a Go professional. The three-time European world champion, Mr Fan Hui, was defeated 5–0 by Alpha Go version 1. This version was named Alpha Go Fan. March 2016, the legendary Lee Sedol, 18-time world title winner and 9th dan Go player was defeated by Alpha Go 4–1. They called it Alpha Go Lee, as it had defeated Lee Sedol, creative right 😉. January 2017, Alpha Go released a version of Alpha Go called master. The online player won 60 straight games against the top international players. Now, it’s clear to see that Google’s AI company, Deepmind, and creator of the many versions of Alpha Go loves the game of Go. However, the workers at Google couldn’t stop there. They had to make a newer and better version called AlphaGo Zero. This new version of Alpha Go beat AlphaGo Lee 100–0 and it’s without debate, the best Go player in the entire world 😲.

The GoGo Gadgets of Alpha Go

Now, what makes Alpha Go so special and so good at the game of Go? Well, the gadgets behind Alpha Go are the different machine learning tactics used by Google Deepmind. This article will just be focusing on AlphaGo Lee in comparison to AlphaGo Zero.

For starters, Go is a perfect imperfection meaning that each player can see each other’s moves and that each player has the same information that would be available at the end of the game. So, theoretically, no matter how far you are into the game, it is possible to correctly guess who will win or lose (assuming both players play “perfectly” from that point onwards). This concept is hard to wrap your head around but imagine you are in a half marathon race. You and your friend are running together but from the start of the race, you are 2 steps in front of him. Imagine that you guys run at the same pace throughout the whole race. Obviously, you will win because you are ahead, and if you knew you were going to run at the same pace for the whole race, you could predict this outcome at any stage in the race. Hopefully, that helps to understand the concept of a perfect imperfection game.

The AI engineers at Google understood this property of the game of Go and developed 2 different neural networks in the first version of AlphaGo. The policy neural network would decide the most sensible and most common moves that could be played on a particular board. And the value neural network would estimate how advantageous a board position was for a player or in other words, how likely they were going to win based on the board.

This is where the learning came into play. Google trained these neural networks with millions of human played games and using supervised learning, the AI was able to mimic an average human player to a certain degree. Next, came the reinforcement learning. They played the AI against itself millions of times to give the AI more practice and time to explore which moves are the best. Finally, the last little gadget that AlphaGo has is the Monte Carlo Tree Search (MCTS) algorithm. If you pair all of these gadgets together, you get this unstoppable Go machine that can defeat any human it plays.

How it works

One thing we haven’t covered is how all of these things work together. Think of the policy neural network and the value neural network as your partners in crime, Poly and Val. Poly will tell you what your possible moves are and how likely a Go master would make them. The better moves will be more probable than bad moves. This helps quite a bit but your second partner in crime, Val, can tell you whether or not you are in a winning position or a losing one based on the pieces on the board and what move you are going to make. Currently, you are a powerhouse of a player but that’s not all, you still have MCTS algorithm as well.

MCTS is similar to your brain in the sense that it allows you to look a few moves into the future and “rollout” different scenarios based on your move this turn. Now, if Poly, Val, and MCTS can work together, MCTS can rollout different scenarios with Poly telling you which is the most popular and best moves in each scenario. Finally with Val telling you what your chances of winning are with each possible move in the rollout. Now, you are unstoppable which is what the AlphaGo program is. Now, Poly and Val start with very little information from the supervised learning they had done, but they slowly get better with the reinforcement learning. Eventually, Poly and Val will become Go masters with practice and time and be able to read each scenario perfectly in a matter of minutes.

Is Zero Better Than One?

If that’s how AlphaGo works, how on earth🌎 did AlphaGo Zero beat Alpha Go 🤔? Well, there are 4 main differences between AlphaGo and it’s Zero counterparts.

  1. The only things input into the AI were the black and white stones and the rules of the game.
  2. AlphaGo Zero uses only reinforcement learning.
  3. AlphaGo Zero uses only 1 neural network.
  4. It uses a simpler tree search mechanism.

AlphaGo Zero has some very similar features to AlphaGo Lee, but its distinct differences are what makes the new version so dominant. For starters, the newest version of AlphaGo was told nothing but the rules of Go and the fact that there were black and white stones. Without any other instructions, it just started playing with the hopes of figuring out new strategies and learning how to play.

One of the biggest differences, between the two versions is that only reinforcement learning was used on AlphaGo Zero. Because there was no supervised learning done, the AI had to be creative and come up with its own strategies and techniques. AlphaGo Zero played against itself millions of times, and with time and its own creativity, it had managed to find these techniques used by the masters of Go. But with not so long after, it had completely discarded those techniques. The AI had managed to use it’s creativity and reinforcement learning to come up with completely new strategies that no human had ever discovered in the past 2 500 years 🤯. Now, the padawan has become the master. Instead of AlphaGo learning from us humans, eager Go players are reviewing the hours of AlphaGo Zero footage to try and learn these new techniques.

Also, AlphaGo Zero uses one single neural network instead of two. It still contains the policy neural network and the value neural network but it combines them into one. So imagine, Poly and Val are just one superhuman making it faster and more efficient since they don’t have to communicate externally anymore, they can do everything in their head. Furthermore, during the training, the AlphaGo team gave the task to Poly and Val to play out 1 600 different simulations for each board state. In doing so, this neural network could figure out the best possible move for each board state with the highest chance of winning, making the AI stronger and more confident than ever. This process was done for repeated for every single board state so that the perfect move could be chosen for each board state. This teaches the neural network, which moves are strong and which moves aren’t since it didn’t know originally, as it had no data to initially to learn from.

Finally, AlphaGo Zero uses a better and more efficient search mechanism. Due to the fact that 1 600 simulations were conducted for each board state, the MCTS algorithm is extremely smart and can help you predict and chose which is the next best move. This basically allowed for the MCTS to play out every scenario extremely quickly and choose the path that led to the win every single time. This is how AlphaGo Zero, beat AlphaGo Lee 100–0. Due to it’s better search algorithm and it’s newly invented moves, AlphaGo stood zero 😉 chance.

The search tree after testing 1600 possible moves.

Thanks to the work that Deepmind put into developing AlphaGo Zero, we have found that AI is no longer constrained by the limits of human knowledge. We know that AI can be creative and find new ways of improving only by playing against itself. This is a huge leap forward for AI, as now we know how much power reinforcement learning can have and we can apply this to other problems to help us try and solve more complex problems that humans are facing.

Key Takeaways

  • Limits can always be pushed, even when it comes to playing a game.
  • Reinforcement learning is extremely powerful.
  • Even after thousands of years, AI can come up with new solutions to complex problems.

For more information on AlphaGo and AlphaGo Zero, check out this link to the Google Deepmind website: https://deepmind.com/research/case-studies/alphago-the-story-so-far.

--

--