Redis is a name that’s been popping up a lot in the office recently, it seems a fair chunk of our projects are going to start using it. It’s used for storing key value data and runs in memory for faster access. It’s a developer’s dream, but from a SysOps point of view, it’s currently a little limited when it comes to deployment in a high availability situation.
I discovered this when trying to deploy some dedicated Redis servers in Windows Azure. Building single node instances was easy enough, but when trying to build a redundant pair, I ran into the issue where Redis doesn’t currently support a Master-Master type deployment, though they are planning to implement such features in the future.
After a bit of research, I found that Redis does support a Master-Slave deployment, and that an extension to Redis (Redis Sentinel) can be used to allow a slave to take on the role of the master should the original master fail. The final trick was to ensure that when the user or application calls the service they’ll always get through to the master node. Sadly, the only documentation I could find was either incomplete, had errors, or was too vague, I hope this guide will be a more complete solution.
Essentially, we’re going to build something similar to this:
The diagram above shows the link between the different applications rather than the servers themselves.
Preparing the environment
Firstly, we’re going to build a new Cloud Service in Windows Azure, in there we’re going to build three CentOS Linux virtual machines. Two of these will be medium sized (you can make these whatever size you think you’ll need) – these will host our Redis servers, Redis Sentinel, and HAProxy. The third server only needs to be an extra small server (because, lets face it, hosted VMs are expensive!) This will purely be used as an additional Redis Sentinel server, essentially a quorum server. On the two main servers, you’ll need to set up a load balanced endpoint on port 6379.
For this guide, lets say the following servers have the following IPs:
Redis 01: 10.0.0.3
Redis 02: 10.0.0.4
Once Azure has finished provisioning the servers, log in and configure them for your environment.
Installing the software
Now we’re going to install Redis on all three boxes.
For Redis Sentinel to function properly, the Redis project recommend using Redis version 2.8. This is not included in the CentOS repositories, but is found on the Remi repo. Do the following to install it.
rpm -Uvh remi-release-6*.rpm epel-release-6*.rpm
After adding the Remi repo, it needs to be enabled when running yum, or else you’ll receive an old version of Redis.
yum --enablerepo=remi install redis -y
To permanently enable the Remi repo, edit the /etc/yum.repos.d/remi.repo file and change the line that reads “enabled=0” to “enabled=1”
When Redis has successfully been installed, you’ll also want to install HAProxy on to the two main boxes. We need to use features of HAProxy that are only available in version 1.5dev20 or above. This, as far as I’m aware, isn’t available in any repository, so I’ve compiled version 1.5dev22 (the latest version at the time of writing) and provided the packages here:
You may want to use an up to date version over my precompiled versions, if so, use the following guide:
Install the RPM file(s) you’ve downloaded:
rpm -Uhv haproxy-1.5-dev22*.rpm
Now we’re ready to configure Redis and HAProxy.
Firstly, on both main boxes, we need to edit the /etc/redis.conf file.
Change the port Redis listens on (default is 6379):
Also, remove or comment out the line that binds the service to the local IP:
# bind 127.0.0.1
On the second Redis box, you’ll also want to make sure it’s slaving off the first.
slaveof 10.0.0.3 6380
Save changes, and start the Redis on both boxes and set it to start at boot.
service redis start
chkconfig redis on
Now check if you can connect to the Redis services locally (do this on both boxes):
sentinel monitor mymaster 10.0.0.3 6380 2
sentinel down-after-milliseconds mymaster 10000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
sentinel monitor resque 10.0.0.4 6380 2
sentinel down-after-milliseconds resque 10000
sentinel failover-timeout resque 60000
sentinel parallel-syncs resque 2
Save this, and just like Redis server, we’ll want to start this and make it run at boot:
service redis-sentinel start
chkconfig redis-sentinel on
Finally, we’re going to configure HAProxy to listen on Redis’s default port (6379), determine which server is currently the master, and forward requests to port 6380 on that box.
On both of the main boxes, edit the /etc/haproxy/haproxy.conf file.
timeout connect 4s
timeout server 15s
timeout client 15s
bind *:6379 name redis
tcp-check send PING\r\n
tcp-check expect string +PONG
tcp-check send info\ replication\r\n
tcp-check expect string role:master
tcp-check send QUIT\r\n
tcp-check expect string +OK
server R1 10.0.0.3:6380 check inter 1s
server R2 10.0.0.4:6380 check inter 1s
A quick overview of the above.
The first section tells HAProxy that the connection will be TCP, to abort attempting to connect if it takes longer than 4 seconds, and to disconnect if the server or the client haven’t communicated in 15 seconds. This may need adjusting in the future.
The next section makes HAProxy listen on port 6379, which is Redis’s default port. From the outside, it will look like a normal Redis server. It is then told to use the backend named “bk_redis” to deal with the requests.
The final section is where the magic comes into play. We’ll be using the new features (from HAProxy 1.5dev20 and above) in the tcp-check command to determine which server is the master. Firstly we tell HAProxy to use the tcp-check option, and to attempt to connect with tcp. Next, we send a simple PING to the Redis server to see if it’s alive. The next line is to tell HAProxy what the expected reply is, in this case “+PONG”. Next, it sends the following string “info replication” and listens for the line “role:master”. Finally it sends the “QUIT” command to ensure the connection is closed properly. It will run this against both servers, and whichever one responds with “role:master” will be the one chosen.
Save the file and start HAProxy on both servers and set them to start at boot:
service haproxy start
chkconfig haproxy on
At this point, we should be able to test this externally – make sure the load balanced endpoint is configured in Azure for the two main boxes.
The easiest way of doing this is to use Telnet. Run the following:
telnet your-redis-service.cloudapp.net 6379
You should get something like the following:
Connected to your-redis-service.cloudapp.net.
Escape character is ‘^]’.
If it appears to hang here, this is a good sign. Lets see if it responds. Start by typing PING followed by enter. You should get a response from Redis saying +PONG.
Now for the moment of truth, type info replication followed by enter. This should return some information about how the server is backed up. The line we’re interested in is role:master, if this is present then we’re hitting the correct server. Now type QUIT followed by enter to close the connection. Repeat this a few times to make sure you’re always connecting to the master.
Now that it appears that Redis is functioning properly, lets break it! If this guide has been followed correctly, the first box should be the master, and the second one is the slave. Shut down the first box (yes, a full “shutdown -h now”), this will simulate what will happen when Microsoft update their platform and shut down parts of your cloud services whilst maintaining uptime.
After about a minute, try telnetting into the Redis server again. This time when you run info replication it will still say role:master but there will be 0 slaves connected. This tells us that there is indeed only one Redis server running, and that the slave has successfully been promoted by the two remaining Redis Sentinel services.
Now from the Azure portal, start the first box again. After a couple of minutes try telnetting into the service again. This time when you run info replication it should still say role:master but this time, it will say that the machine 10.0.0.3 is its slave. This tells us that the second server has retained its master status, and now the original master is our failover machine.
Locking it down
Ok, now we’ve got Redis working with failover, we’re probably not going to want to leave it open to the world, so we’re going to lock it down with iptables. Log into your boxes, and open the /etc/sysconfig/iptables file. Something like this will do the job.
# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
:INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
### Azure Load Balancer - allow all for port checking
-A INPUT -s 18.104.22.168/32 -j ACCEPT
# Allow the machines in the cloudapp
-A INPUT -m iprange --src-range 10.0.0.2-10.0.0.4 -j ACCEPT
-A INPUT -s 22.214.171.124/32 -m tcp -p tcp --dport 6379 -j ACCEPT
-A INPUT -s 126.96.36.199/32 -m tcp -p tcp --dport 6379 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
A quick run through. We’ll allow anything to connect to port 22 so we can administrate the server over SSH. The Azure LoadBalancer on 188.8.131.52 needs access or it’ll report all nodes are down. Next we need to allow the three boxes in the cloudapp to talk to each other. Also we want our office to have access to the Redis server as well as any predefined servers that will be using it. After that, we’ll block everything else.
There are many good tutorials on iptables out there, this is just more of a guide.
Of course, once you’ve configured iptables, start it and set it to run at boot:
service iptables start
chkconfig iptables on
So, hopefully this should be a definitive guide to deploying an high availability Redis service. As you can see, due to the lack of Master-Master support, other works arounds where needed, but so far this seems to work. Hopefully this will help anyone in a similar situation.