Jacob Pettit

VSCode is dead! Long Live Vim!

2021-02-14T12:30:48-08:00

Well, really Long Live NeoVim, but that just isn’t as catchy is it? For more about topics like this or to read the original article, click here.

A quick disclaimer to the emacs folks; I’m not trying to pick a fight here. I just like vim and never really got into using emacs, so I’m writing from my perspective.

Some context #

I’ve spent the last year plus using VSCode and generally had a great experience. I started off with default keybindings, until a coworker told me I should try vim keybindings.
After spending a little time learning to use the vim keys, I was editing code quicker and my mind was blown. 🤯

This lasted until I started working on bigger projects and VSCode got kind of slow. I just suffered and kept using it until I saw a YouTube video talking about how to configure vim like VSCode. I got inspired and had to try it for myself. Several hours and a couple of days worth of finetuning later, I had a setup that felt perfect. Now, I can’t imagine going back anytime soon, and I like using my vim setup so much that I decided I had to write this blog post about it!

The long, long history of vim #

vim (Vi IMproved) was released in 1991 and started as a clone of the vi text editor. Vi itself was originally written as the visual mode of a line-editor called ex in 1976.
ex was written by Billy Joy and Chuck Haley, and
version 1.1 was part of the first edition of the Berkeley Software Distribution (BSD) Unix, released in March of 1978.
In version 2 of ex, released as part of the Second BSD in May of 1979, the editor was released under the name vi.

vim had another forerunner, called Stevie (ST Editor for Vi Enthusiasts), which was created by Tim Thompson in 1987.

Finally, in 1988, Bram Moolenaar started his work on vim. The first public release was in 1991.
For years, folks continued to use vim, and then NeoVim was released in 2014.

So, vim is old and we shouldn’t care about it right? Wrong.

I swear, vim is still relevant #

Stack Overflow’s 2019 Developer Survey shows that vim is still in the top 5 editors used by developers!

You may be wondering why a text editor born 30 years ago (as of 2021) is still in use today. Hasn’t technology gotten better? And haven’t text editors gotten better alongside the rest of technology?

While technology in general has certainly gotten better, my answer for text editors is: Kind of, but not really.

I think what’s mostly happened is that text editors have gotten good GUIs (Graphical User Interfaces), and many people gravitate towards those.
It makes sense; after all, it does feel more natural to use your mouse to navigate around a computer, especially when you’re starting out.

However, all that convenience does come at a price. It is vastly less efficient to rely on a mouse for pointing, selecting, scrolling, and navigating around files than it is to use the keyboard to do these things.
Vim sidesteps these problems, since it relies on keybindings to do all of this.
Plus, vim is widely considered one of the most ergonomically efficient editors.
When you take the time to really learn vim, you can move around files at what feels like the speed of thought. And your hands and wrists will thank you after spending hours coding every day.

Who should learn vim? #

Anyone who is starting to feel slowed down by their text editor of choice should learn vim.
Often, it can be really easy to pick up vim since many popular text editors have a vim plugin.
These plugins let you use vim commands and keybindings without leaving the editor you’re used to.
vim is also very configurable, and while this is nice, it can also be intimidating when first starting out. Using a plugin in an editor you’re alreadyused to helps soften the learning curve for using vim. As a related sidenote; if you want a nice introduction to learning vim commands, check out this online game.

If you’re new to programming, like maybe you’re in your first CS class or you’ve just started your first project, you probably don’t need to learn vim. Programming already has quite a learning curve; you don’t need to compound it by learning vim at the same time.

On the other side, if you’re comfortable programming and you’ve done at least a few CS classes or you’ve got a couple of projects under your belt, you’ll likely benefit from vim.
When you first start learning vim, you’ll move slower and might be a little less productive, but once you’ve put in the time to get familiar with it, your productivity will increase.

That’s all cool, but I still love VSCode… #

And don’t get me wrong, VSCode is great. It was the number one text editor on Stack Overflow’s 2019 Developer Survey.
I mentioned above that I used VSCode for a bit before turning to vim, and I’ve gotta say, I’ve been able to replicate every feature I cared about from VSCode and it’s all quicker in vim.
Part of my reason for switching was that I’d check the activity manager on my machine and it would show that VSCode was sometimes using half a gigabyte of RAM! That felt a little silly to me, since all I was doing was editing text. Of course, part of that RAM usage is from language servers, but it still felt like too much. I think there are also things that VSCode does that I just never used. With vim, I can configure exactly what features I’m going to use and there’s much less bloat, since I had to install those features directly.

If you’re a VSCode user and you’re happy, then great! I’m only kind of trying to convince you to use vim. But if you’re a VSCode user and you wish you could keep only certain parts of the VSCode experience because the whole thing is too much for you… Then you should definitely check out vim.

What’s the point? #

The point was for me to talk about how much I love vim and how I like my new setup, and to try to convince you to try it out too.

While you’re here, I’ll share my init.vim file with you.
Credit to Ben Awad for providing a helpful starter init.vim.
I started my file from his and added/modified what I wanted to change.

So here it is:

set nocompatible
set number
set autoindent
set smarttab
set shortmess+=I
set relativenumber
set laststatus=2
set backspace=indent,eol,start
set hidden
set ignorecase
set smartcase
set incsearch
nmap Q <Nop> " 'Q' in normal mode enters Ex mode. You almost never want this.
set noerrorbells visualbell t_vb=
set mouse+=a

" Below enables fuzzy file finding. In NORMAL mode, type :find <file>. Can
" autocomplete with tab.
set path+=**
set wildmenu
let mapleader = ","
" Keep cursor in middle of screen.
set so=999

" Setup plugin manager
call plug#begin('~/.vim/plugged')

" Main leanguage server, should configure similar to VScode
Plug 'neoclide/coc.nvim', {'branch': 'release'} 

Plug 'preservim/nerdtree' " Filetree viewer

" View git status in nerdtree, shows stars by dirty files.
Plug 'Xuyuanp/nerdtree-git-plugin' 

Plug 'ryanoasis/vim-devicons' " Cool icons for each filetype

" Navigate between open panes using Ctrl + h/j/k/l
Plug 'christoomey/vim-tmux-navigator' 

Plug 'airblade/vim-gitgutter' " View git changes in the file gutter.
Plug 'preservim/nerdcommenter' " Extra commenting powers.

" Syntax higlighting support for a truly ridiculous number of languages.
Plug 'sheerun/vim-polyglot' 

Plug 'joshdick/onedark.vim' " Onedark theme.
Plug 'kaicataldo/material.vim', { 'branch': 'main' } " Material theme
Plug 'vim-airline/vim-airline' " Airline bar on the bottom
Plug 'jiangmiao/auto-pairs' " Autocomplete parentheses, braces, etc.
Plug 'psliwka/vim-smoothie' " Smoother scrolling in vim

" Preview markdown files in the internet browser.
Plug 'iamcco/markdown-preview.nvim', { 'do': { -> mkdp#util#install() }, 'for': ['markdown', 'vim-plug']} 

call plug#end()

" NERDTree configurations.
autocmd VimEnter * NERDTree | wincmd p
autocmd FileType nerdtree setlocal nolist
autocmd BufEnter * if tabpagenr('$') == 1 && winnr('$') == 1 && exists('b:NERDTree') && b:NERDTree.isTabTree() |
    \ quit | endif
autocmd BufWinEnter * silent NERDTreeMirror

" Remap commands.
nnoremap <leader>n :NERDTreeFocus<CR>
" Ctrl-n toggles nerdtree
nmap <C-n> :NERDTreeToggle<CR>
" Ctrl-m runs Markdown preview. Will only work in markdown files
nmap <C-m> <Plug>MarkdownPreview
" Ctrl-s stops Markdown preview. Only does stuff if you're already running a
" markdown preview.
nmap <C-s> <Plug>MarkdownPreviewStop

vmap ++ <plug>NERDCommenterToggle
nmap ++ <plug>NERDCommenterToggle

let g:NERDTreeGitStatusWithFlags=1

" Set color theme styles.
let g:material_theme_style = 'darker'
colorscheme material 

if (has('termguicolors'))
  set termguicolors
endif


let g:mkdp_auto_start = 0

" Remap tab to code autocompletion.
inoremap <silent><expr> <TAB>
        \ pumvisible() ? "\<C-n>" :
        \ <SID>check_back_space() ? "<TAB>" :
        \ coc#refresh()
inoremap <expr><S-TAB> pumvisible() ? "\<C-p>" : "\<C-h>"

function! s:check_back_space() abort
  let col = col('.') - 1
  return !col || getline('.')[col - 1]  =~# '\s'
endfunction

" Use <c-space> to trigger completion.
if has('nvim')
  inoremap <silent><expr> <c-space> coc#refresh()
else
  inoremap <silent><expr> <c-@> coc#refresh()
endif

In theory this should let you get the exact setup that I have, but of course often these sorts of things require a little trial-and-error to get working perfectly.

Well, I hope you enjoyed my ramblings about vim, dear reader. Hopefully you think vim is worth checking out!

Thanks for reading. ✌️

References #

Why I think RL tooling matters

2021-01-23T21:45:33-08:00

The tools we use influence the research we do, and while there are many good RL tools out there, there are still areas where tools need to be built. For more about topics like this or to read the original article, click here.

The RL Tools Everyone Has #

Maybe I’m wrong, but I think every RL researcher has some tools they’ve built and that they use across their projects. Since they’ve built them, these tools are the perfect fit for them. But their tools might also be useful for someone else. However, we rarely see code for RL tools get packaged up and open-sourced. I’m guilty of it too. I’ve got the same NN code that I copy across projects, and my method of reuse of some core utility functions is criminal. Suffice to say, any software engineer would cringe at my process. But it works for me. I run experiments, try new things, and get results.

The thing that sucks, though, is when I want to let someone else run my code. Suddenly, they’re looking at this mess that I’ve been hacking at for months (and it’s really a mess). I doubt this is a problem that only I have. So lately, I’ve been trying to be better about writing clear, maintainable code. I’ve made an effort to make my few NNs and utilities easier to reuse across projects (as in, I’ve stopped just cp-ing files into whatever project directory I’m currently in). I think this has made a big difference in my work. Coworkers say my code is “pretty” and “readable”. Imagine that! I’m basking in my own glow a bit over here, but my point is that a little work up front goes a long way towards enabling future you (and your coworkers) to understand and reuse your code.

I’m not talking about building some massive, engineered, impressive library just for you and whatever coworkers happen to find it useful (I mean, if you open-source it and get lucky, maybe everyonewill love it) but rather I’m talking about taking a bit of extra time at the start of a new project to properly set it up as a repository and as an editable Python package to make it easier for future you to reuse your code. If you like Jupyter Notebooks, I think nbdev is an awesome resource for this kind of stuff.

The RL Tools We Need #

Again, I’m going out on a limb here and extrapolating from a few experiences (mine and coworkers), but it seems like there are some RL tools that don’t really exist yet and might be generally useful.

I think a widely used and trusted collection of environment wrappers would be excellent. Things like frame stacking, state and reward normalization, and meta-env wrappers would be extremely helpful. Some libraries include some of these wrappers, but I’ve yet to see anyone put it front-and-center in their package. I think that there is too much focus on implementing algorithms for people to run, and not enough focus on providing distinct tools that researchers can use when building new ideas. Like there really just needs to be a few high quality RL algorithm implementations (hopefully in general a set of scalable implementations for each DL framework) for benchmarking purposes and production work (if you can productionize RL, serious props) and such. Besides that, I suspect it’s more useful to provide a framework that does something enabling researchers to only implement the novel parts of their work, and use pre-built components where they can.

This has been a bit of a rant, but this problem has been bugging me a bunch lately. Unfortunately, it seems like the modular RL frameworks that I think look easy and pleasant to use are all made by DeepMind, which means that they work with JAX and TensorFlow, but not PyTorch. As a PyTorch user and lover, this makes me sad. I mean, JAX seems awesome too but on top of working full time, it’s just hard to set aside the time to learn a new framework.

Recently, PFRL came out, and I think it helps with this problem some, but of course there is still progress to be made. And I still haven’t found a package offering a set of high-quality, modular environment wrappers!

If someone wrote a set of wrappers that worked with NumPy arrays (because that’s what Gym takes as arguments and what it returns) then it is easy for people using frameworks to convert those NumPy arrays to the array format of their chosen framework. Maybe I’ll work on this myself, we’ll see.

The Tools I Love #

If I were to talk only about the RL tools I truly love, this would be a short list. We’d talk about OpenAI’s Gym and about the PyBullet environments and that would be it. So instead I’ll also talk about a couple of ML tools that I use a ton in RL that I love.

PyTorch #

I use PyTorch for all of the general NN and math stuff it’s built for.

PyTorch-Lightning #

PyTorch-Lightning is awesome. It can feel kind of weird and hacky to use it for RL, but it imposes a common structure across code that is just fantastic. Plus, it comes with a ton of free stuff, like GPU/multi-node training, automatic experiment logging, and so on. I won’t do a full sales pitch for PyTorch-Lightning, but check out their documentation if you’re a PyTorch user.

OpenAI’s Gym #

The Gym package provides a standard set of RL environmnents, and gives an API for implementing new environments. It’s been flexible and helpful in RL research for a while.

PyBullet #

I have yet to build a custom environment with PyBullet, but I love that it’s open-source and that they provide alternatives to the MuJoCo environments. It works well, is easy to install, and runs quickly on my laptop (a MacBook Pro). Love it.

Weights and Biases #

This is the best experiment logger I’ve used. If you’re doing RL, and you wrap your environment in a Monitor, then Weights and Biases can automatically log videos of your agent in the environment to their dashboard.

nbdev #

If you like Jupyter Notebooks then I think nbdev is a great thing to look at for building your code up as a package. All the stuff that you don’t export into your package becomes tests for your code, so the tests are pretty much built-in from your notebook! However, I also like to experiment with Python files a bunch, so I’ve also found it useful in the case where you’re cleaning up and packaging some code you’ve pre-written.

Wrapping it up #

Sorry the structure of this post doesn’t really make much sense. It’s basically been a big rant about how I think RL tools aren’t good enough yet. I’m experimenting with writing less-polished, more-frequent posts instead of agonizing over trying to make a post as perfect as possible. So this is much more stream-of-consciousness than earlier posts.

But otherwise, yeah, this is a pretty big problem. A library or set of libraries that provide common utilities to RL folks will probably help a ton with the reproducibility crisis in deep RL right now. If we could design algorithms using trusted component implementations and then only custom-write what’s new to our algorithm, that would likely help eliminate bugs in research code. It would even help people re-implementing your algorithm later on. Plus, I (and I bet other researchers) would be able to iterate a lot quicker if I could use and swap out or custom-write whatever algorithm components I need during the experimentation process.

I suppose this is all just food for thought.

Beginner friendly reinforcement learning with rlpack

2019-09-01T18:25:33-07:00

Trained PPO agent playing LunarLander

Lately, I’ve been working on learning more about deep reinforcement learning and decided to start writing my own RL framework as a way to get really familiar with some of the algorithms. In the process, I also thought it could be cool to make my framework a resource for beginners to easily get started with reinforcement learning. With that in mind, the goal wasn’t state of the art performance. However, in the future, I might decide that a high standard of performance is something I want to prioritize over beginner-friendliness.

This framework isn’t finished, but this post indicates the first version of it that I’m happy with being done. Up until now, it’s been the main project I’ve focused on, but moving forward I’ll be putting more time into exploring different things and will more passively work on expanding rlpack. Find the repository here or by clicking on the title of this post.

Algorithms Implemented #

So far, I’ve implemented and tested a couple of policy gradient algorithms.

I haven’t put in the trouble to benchmark these algorithms directly against existing packages, but instead have used performance compared to the Gym leaderboards as a metric of success.

The versions of REINFORCE and A2C implemented are just your vanilla, basic implementations. PPO is implemented with the clipped objective function, AKA PPO-clip.

Performance #

Each of the algorithms here were tested on CartPole-v0, Acrobot-v1, and LunarLander-v2 from OpenAI’s Gym library of RL environments. Below are the reward curves for each environment. Training was stopped when the average reward over the last 100 episodes was over some set threshold. The thresholds were obtained from the Gym leaderboards. CartPole-v0 was considered solved if the last 100 episode average reward was greater than 195, Acrobot-v1 does not have a threshold at which it is considered solved, and LunarLander-v2 is considered solved when the last 100 episode average reward is greater than 200. Here are the reward plots over training.

I’ve been pretty impressed by PPO’s performance. For fun, look at the video of PPO playing LunarLander below. The objective here is to control the ship to land in the flags as efficiently as possible. The terrain is also randomly generated. Here, though, the agent tries to land but slides out from between the flags and then is able to make a second attempt to land before the episode ends!

PPO is a very versatile algorithm and is able to obtain good performance on a wide variety of environments. Below is a video of it trying to accomplish the challenging CartPole swingup task, where it aims to swing a pole from beneath a cart and balance it on top of the cart. You can find my implementation of CartPole SwingUp as a Gym environment here.

It does a mediocre job there. Here’s a video of it doing not well at all:

I’ve also got GIFs of REINFORCE playing CartPole and A2C playing Acrobot; have a look here.

REINFORCE playing CartPole-v0. The objective here is to balance the pole on the cart for as long as possible.

A2C playing Acrobot-v1. Here, the goal is to swing the arm up over the line as quickly as possible.

Installation and Usage #

Installation is simple. Head over to my repository on GitHub, open a terminal on your computer, and clone it.

git clone https://github.com/jfpettit/rl-pack.git
pip install -e rl-pack

And you’ve got it installed! If you want to quickly run some code to test it, you can navigate to the /examples/ directory within the rl-pack folder and run the following command:

python [FILE_TO_RUN] --watch --plot --save_mv

The --watch, --plot, and --save_mv tags are optional and indicate whether you’d like to watch your trained agent play in the environment, whether you’d like to see a plot of the agent’s reward over training, and whether you’d like to save an MP4 file of your agent playing in the environment. The [FILE_TO_RUN] is one of reinforce_cartpole.py, a2c_acrobot.py, ppo_lunarlander.py and ppo_swingup.py. If you’d like to run an algorithm on a different environment, you can open one of these prewritten files and modify the environment or you can get more involved and write your own file. Please note, at present this package only supports discrete actions. It is on my to-do list to extend these algorithms to include continuous actions too.

Alternatively, if you’d like to take a departure from the example files and would like to write your own, the example files are still an excellent example of how I intended this package to be used. The idea here was to abstract away the reinforcement learning algorithm and leave it up to the user to define the policy network and the environment. These algorithms should work with any PyTorch neural network and any environment that uses the OpenAI Gym API. However, I haven’t tested recurrent networks so those may or may not work properly with this code.

rlpack requires that your actor-critic networks output both action probabilities and a value estimate. You can implement this by sharing all of the layers except for the output layers of the network or by implementing one actor-critic class that effectively contains two neural networks, one for the policy and one for the value function. It’s all the same to rlpack as long as your forward function returns both action probabilities and a value estimate. Finally, to use your network with rlpack you need to implement an evaluate function within your network class. This function should take in a batch of states and actions and return action log probabilities, value estimates, and distribution entropy for each state and action in the batch. An example of how I’ve done this is in the neural_nets.py file in the repository.

Future work #

For now, I’m going to put less time on rlpack and focus on some other things I find interesting. I will, however, still passively work on things like extending to continuous actions and implementing a couple more algorithms. If you’d like to contribute, feel free to submit a pull request on the repository.

Additional reading. #

If you’re totally brand new to RL and would like a couple of resources to read up on some of the theory behind these things, check out the few links below:

OpenAI’s SpinningUp
Sutton and Barto’s Reinforcement Learning: An Introduction
SpinningUp also has a gigantic list of resources that I’m still using to learn new things too, so I’d recommend checking that out.

Wrapping Up #

If you stuck around this long, thanks. I hope that rlpack is helpful if you choose to use it. If you have any questions you can email me or find me on twitter.

Should someone find this useful in academic work, please cite it using the following:

@misc{Pettitrlpack,
Author = {Pettit, Jacob},
Title = {rlpack},
Year = {2019},
}

Making it easier to play my Tic-Tac-Toe agent

2019-08-06T21:04:33-07:00

Background #

As part of my senior project in undergrad, I made a Tic-tac-toe playing RL agent. It used a simple temporal difference (TD) update rule that can be found in Chapter 1 of Sutton and Barto’s RL book. In fact, all of the details for how to build the agent can be found in Chapter 1 of that book. They do a case study of making a Tic-Tac-Toe playing algorithm and cover everything from what’s needed from the environment, to the update rule for the agent. Definitely worth a read, especially if you want to implement one yourself.

The Update #

A friend asked me how he could play my trained agent, so I chose to go ahead and write a simple script to make it easy for anyone (with a tiny bit of terminal knowledge) to play against it. Here’s how to do it:

Head over to my GitHub repository and clone it:

git clone https://github.com/jfpettit/senior-practicum.git

Once you’ve cloned it, go ahead and cd into the repository and into the Tic-tac-toe folder:

cd senior-practicum/TD_tictactoe/

At last, you can run the game with:

python tictactoe_runner.py

Here’s a sample of a game I played with it so you know what kind of output should show up in your terminal:

Jacobs-MacBook-Pro:TD_TicTacToe jacobpettit$ python tictactoe_runner.py 
Select piece to play as: input X or O:x
[['-' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]
Input your move coordinates, separated by a comma: 1,1
[['-' '-' 'O']
 ['-' 'X' '-']
 ['-' '-' '-']]
Input your move coordinates, separated by a comma: 2,0
[['-' '-' 'O']
 ['-' 'X' '-']
 ['X' '-' 'O']]
Input your move coordinates, separated by a comma: 1,2
[['-' '-' 'O']
 ['O' 'X' 'X']
 ['X' '-' 'O']]
Input your move coordinates, separated by a comma: 2,1
[['-' 'O' 'O']
 ['O' 'X' 'X']
 ['X' 'X' 'O']]
Input your move coordinates, separated by a comma: 0,0
[['X' 'O' 'O']
 ['O' 'X' 'X']
 ['X' 'X' 'O']]

So, in this case, nobody won. The code doesn’t print out the winner of the game, so don’t expect any output after that last move.

Introducing gym-snake-rl

2019-08-06T21:01:11-07:00

github_repo

Motivation #

Although there are existing implementations of the classic Snake game (play the game here), I wanted to create my own implementation for a few reasons:

Opportunity to learn more about environment design, including designing an observation space and reward signals.
Write an implementation with random map generation, so that this code could be used to work on generalization in RL. See OpenAI’s blog post on this topic.
Create snake as a multi-agent system, and create versions of the environment where there are fewer units of food than there are snakes, so that we can investigate what competitive behavior evolves.
Implement a vectorized observation space for the snake game, in an attempt to require less computational power than games that only provide the screen images as observations. For example, CoinRun, OpenAI’s procedurally generated environment for working on generalization in RL, only gives screen images as input. It was also fickle and difficult to set up when I tried to use it.

Getting up and running #

Installing and using my code is very simple. Start with cloning my GitHub repository;

git clone https://github.com/jfpettit/gym-snake-rl.git

Then, run the following:

pip install -e gym-snake-rl

Now you’re able to use the environment in your Python code like any other Gym environment:

import gym
import gym_snake_rl

env = gym.make('BasicSnake-small_vector-16-v0')

Training our first agent #

This step is optional, but if you’d like, you can use OpenAI’s Baselines to train from the command line. You can do this with the following:

Make sure you’ve installed gym-snake-rl, then install Baselines

git clone https://github.com/openai/baselines.git
cd baselines

Now, ensure you have Tensorflow installed

pip install tensorflow

pip install tensorflow-gpu # if you have a GPU that works with CUDA and have it set up with the appropriate drivers

Finally;

pip install -e .

That should set you up with Baselines installed. If you have an error or need more clarification on installation, look on OpenAI’s repository, which I’ve linked just above.

Now we can train an agent in our snake environment with the following command:

python -m baselines.run --alg=ppo2 --env=gym_snake_rl:BasicSnake-small_vector-16-v0 --num_timesteps=5e5 --save_path=~/basicsnake_smallvec_ppo2

A dialogue box with diagnostics should pop up and will update periodically throughout training. It takes about 15 minutes to train on my computer, a Mid-2012 Macbook Pro.

Once training has completed, you can watch your agent playing the game with:

python -m baselines.run --alg=ppo2 --env=gym_snake_rl:BasicSnake-small_vector-16-v0 --num_timesteps=0 --load_path=~/basicsnake_smallvec_ppo2 --play

Now that you’re up and running, let’s look at some of the finer points of the code I wrote.

Multi-agent Snake #

First off, disclaimer: multi-agent snake is still very much in beta. I guess the whole project is in beta, but multi-agent snake is particularly so.

When you are using the multi-agent snake environment, observations will be output as a tuple of tuples. So, you’ll need to make sure to feed each of your agents the appropriate tuple of observations. There are variants of multi-agent snake written with 2, 3, and 10 snakes already. You’ll be able to either use Gym to make one of those pre-built environment configurations, or you can take a look at my code and directly set up your own environment configuration. Each of the 2, 3, and 10 snake variants also have sub-variants; one with one food per snake (so 3 snakes, 3 foods) and one with one food total (so 10 snakes, 1 food). Hopefully the setups with fewer foods than snakes will enable some research into the evolution of competitive behavior between agents when they are placed in every-agent-for-themself type situations.

Random map generation #

An agent can be trained on randomly generated maps using the exact same command as above, but instead of setting --env=gym_snake_rl:BasicSnake-small_vector-16-v0, set --env=gym_snake_rl:ObstacleBasicSnake-small_vector-16-v0. Mine didn’t learn much besides to be afraid of the walls.

The code to randomly generate maps isn’t very sophisticated. As such, I don’t recommend placing more obstacles than 1/3rd the size of your longest map side. So, if you’re in a 16 by 16 grid, You probably shouldn’t request more than 5 or 6 obstacles to be placed on the map. But, you only need to worry about this if you’re not using one of the prebuilt configurations. Each of the prebuilts are conservative in the number of obstacles selected in order to avoid any issues popping up.

What issues? #

Well, when setting a really large number of obstacles, I’d end up with pockets of open map space completely encircled by wall and obstacles, so the snake couldn’t get to it. This is an issue because the food placement algorithm randomly places food anywhere that isn’t a wall or a snake, so the food could get placed somewhere that the snake couldn’t reach. Here’s an example of what I’m talking about:

This picture was produced by using a 25 by 25 map and placing 50 obstacles on it. You can actually see here, that the snake (green) is in a completely different pocket of the map than the food (red) and so will never be able to reach the food. Luckily, the map regenerates each episode, but it’s probably still best to avoid this kind of situation during training. It’s still possible for this issue to occur with a smaller number of obstacles, however, I observed that my code was really robust when requesting no more than about a third of the largest side of the map for number of obstacles. So in this case, that would be num_obstacles=8.

Sidenote #

If you’re using the small_vector observation setup, it is necessary to use a map with walls. We’ll get into this more now, as we cover different observation types.

Observation types #

This environment has 5 different observation types.

raw: This observation type is the raw numpy array your agent is operating in. One channel, colors range from zero to 255.
rgb: This is an RGB image of your map. So three color channels, and colors in typical RGB range.
rgb5: This is the RGB image, but zoomed in a few pixels.
small_vector: Our first vectorized state representation, this includes a flattened vector of the 8 squares surrounding the snake’s head and the square the snake’s head is on (so it would look something like this (0, 0, 0, 0, 100, 0, 0, 0, 0), where 100 is where the snake’s head is). Included in our full observation vector is the length of the snake (an integer) and Euclidean distance from each food on the map.
big_vector: This is simply the raw numpy array of the map, flattened. On the end of our observation vector is the length of the snake and Euclidean distance from each piece of food on the map.

It is necessary to use a map with walls when you’re using the small_vector observation type because the observation is gathered by indexing into the array at each of the 8 squares around the snake’s head. If you do this when there are no walls, you are all but guaranteed to run into an error with the index you’re trying to access being outside of the array. I didn’t write a better fix for this because it didn’t seem very important.

Reward functions #

My bit of googling around about reward functions in snake indicates that the typical reward function is +1 for eating food, -1 for dying, and 0 for everything else. This code contains that reward function as default. However, there is a second option. The second reward function implements Potential-Based-Reward-Shaping (PBRS). You can read more about it here, or just doing a google search will turn up some relevant information.

The current PBRS function is very much in beta and is an area of this code that needs to be improved. I’ve run some simple experiments with the current function and haven’t had much luck with learning anything. It’s an area I’ll actively be experimenting with, and, if someone wants to contribute to this code, this would be a simple way to get started. Altering the existing potential based reward function doesn’t require much understanding of how the code works. You’d just need to navigate to gym-snake-rl/gym_snake_rl/core/core_snake.py and edit lines 263 through 269. The current reward function formulation is:

if step_closer_to_food:
    reward = 0.01 * abs((gamma * current_dist) - old_dist)
elif step_away_from_food:
    reward = -0.01 * abs((gamma * current_dist) - old_dist)

Gamma is a discount factor and is set to 0.99 by default.

In this environment, it is important to be aware that applying reward standardization may yield NaNs and cause errors. It is possible to have an episode or set of episodes where the return is all zeros or all ones. Of course, a vector of all one number has 0 standard deviation, so when you do (returns - mean(returns)) / standard_deviation(returns) that division by zero will produce the NaNs. If you’re using PBRS, this is no longer a concern as you will not have a return vector that is made up entirely of the same number.

I anticipate that when the PBRS function in this code is fixed, it’ll prove helpful in training agents like the one trained on randomly generated maps above to actually go towards the food, instead of getting stuck in any weird behavior loops.

Environment names #

There are a bunch of environments contained in this code, and I’m going to cover the naming convention and list all of their names here.

Naming convention #

The structure of environment names follows this structure:

Name-observation_type-size-v0

All of the environments here are currently v0, and the prebuilt sizes are 16, 32, and 64.

Example list of names. #

Now for the fun part.

Partial List #

BasicSnake-raw-16-v0
BasicSnake-raw-32-v0
BasicSnake-raw-64-v0
BasicSnake-rgb-16-v0
BasicSnake-rgb-32-v0
BasicSnake-rgb-64-v0
BasicSnake-rgb5-16-v0
BasicSnake-rgb5-32-v0
BasicSnake-rgb5-64-v0
BasicSnake-small_vector-16-v0
BasicSnake-small_vector-32-v0
BasicSnake-small_vector-64-v0
BasicSnake-big_vector-16-v0
BasicSnake-big_vector-32-v0
BasicSnake-big_vector-64-v0

BasicSnakePBRS-raw-16-v0
BasicSnakePBRS-raw-32-v0
BasicSnakePBRS-raw-64-v0
BasicSnakePBRS-rgb-16-v0
BasicSnakePBRS-rgb-32-v0
BasicSnakePBRS-rgb-64-v0
BasicSnakePBRS-rgb5-16-v0
BasicSnakePBRS-rgb5-32-v0
BasicSnakePBRS-rgb5-64-v0
BasicSnakePBRS-small_vector-16-v0
BasicSnakePBRS-small_vector-32-v0
BasicSnakePBRS-small_vector-64-v0
BasicSnakePBRS-big_vector-16-v0
BasicSnakePBRS-big_vector-32-v0
BasicSnakePBRS-big_vector-64-v0


HungrySnake-raw-16-v0
HungrySnake-raw-32-v0
HungrySnake-raw-64-v0
HungrySnake-rgb-16-v0
HungrySnake-rgb-32-v0
HungrySnake-rgb-64-v0
HungrySnake-rgb5-16-v0
HungrySnake-rgb5-32-v0
HungrySnake-rgb5-64-v0
HungrySnake-small_vector-16-v0
HungrySnake-small_vector-32-v0
HungrySnake-small_vector-64-v0
HungrySnake-big_vector-16-v0
HungrySnake-big_vector-32-v0
HungrySnake-big_vector-64-v0

HungrySnakePBRS-raw-16-v0
HungrySnakePBRS-raw-32-v0
HungrySnakePBRS-raw-64-v0
HungrySnakePBRS-rgb-16-v0
HungrySnakePBRS-rgb-32-v0
HungrySnakePBRS-rgb-64-v0
HungrySnakePBRS-rgb5-16-v0
HungrySnakePBRS-rgb5-32-v0
HungrySnakePBRS-rgb5-64-v0
HungrySnakePBRS-small_vector-16-v0
HungrySnakePBRS-small_vector-32-v0
HungrySnakePBRS-small_vector-64-v0
HungrySnakePBRS-big_vector-16-v0
HungrySnakePBRS-big_vector-32-v0
HungrySnakePBRS-big_vector-64-v0


BabySnake-raw-16-v0
BabySnake-raw-32-v0
BabySnake-raw-64-v0
BabySnake-rgb-16-v0
BabySnake-rgb-32-v0
BabySnake-rgb-64-v0
BabySnake-rgb5-16-v0
BabySnake-rgb5-32-v0
BabySnake-rgb5-64-v0
BabySnake-small_vector-16-v0
BabySnake-small_vector-32-v0
BabySnake-small_vector-64-v0
BabySnake-big_vector-16-v0
BabySnake-big_vector-32-v0
BabySnake-big_vector-64-v0

BabySnakePBRS-raw-16-v0
BabySnakePBRS-raw-32-v0
BabySnakePBRS-raw-64-v0
BabySnakePBRS-rgb-16-v0
BabySnakePBRS-rgb-32-v0
BabySnakePBRS-rgb-64-v0
BabySnakePBRS-rgb5-16-v0
BabySnakePBRS-rgb5-32-v0
BabySnakePBRS-rgb5-64-v0
BabySnakePBRS-small_vector-16-v0
BabySnakePBRS-small_vector-32-v0
BabySnakePBRS-small_vector-64-v0
BabySnakePBRS-big_vector-16-v0
BabySnakePBRS-big_vector-32-v0
BabySnakePBRS-big_vector-64-v0

… and so on.

For a full list of names, see the __init__.py file in gym-snake-rl/gym_snakerl/. There is a full list of names, and they all follow the above convention. So, it’s pretty simple to get the environment you want by picking the name from the list, and then subbing it in with the observation type, size, and v0 tags.

HungrySnake and BabySnake variants. #

Both of these game variants are from here.

The BabySnake variant is very simple. Every time the snake eats a food, the episode ends. The HungrySnake variant has a limit for how many steps it can take after eating, this is done to simulate increasing hunger. The limit resets each time the snake eats.

References #

I referenced OpenAI’s Baselines for their setup instructions and code running instructions and such. Much of my code was built on top of nicomon24’s code. In fact, my project started off as a clone of his project and then I built the features I wanted to add on top of their code, and in some cases I rewrote the code to work with what I wanted to do.

My creation of this environment was inspired by OpenAI’s work titled Quantifying Generalization in Reinforcement Learning and I wanted to enable people wihout lots of computational power, like myself, to work on things similar to what they did.

Closing out #

If you’ve read for this long, thank you.

Any issues with the code should be opened at the GitHub repository linked towards the start of the blog, and if you write an addition to the code, you should submit a pull request on GitHub. If you have further questions, you can tweet me or email me (jfpettit@gmail.com).

Citation #

@article{gym-snake-rl2019,
    author={Jacob Pettit},
    title={Introducing gym-snake-rl},
    url={https://jfpettit.github.io/Introducing-gym-snake-rl/},
    year={2019}
}

Update History #

Update 1 on July 12, 2019: Renamed package from gym-snake-rl to gym-snake-rl.

Update 2 on July 23, 2019: Fixed stray gym-snakerl’s to gym-snake-rl.