Sunday, 4 September 2016

Multi master write/write MySQL cluster troubleshooting

Multi-master write/write MySQL Cluster

Background

So I have 4 payments servers located around the world:
London, Paris, Netherlands and Frankfurt. My customers are based around Europe, so these locations are best suited for us.

I operate a auto failover via Amazon DNS and using its health checks is very cost effective. If a server is non-responsive, AWS takes it out and routes traffic to another payments server.

Due to the fact that communication between servers has to be encrypted, and I will want to, eventually, pass more traffic between them (i.e. mysql + wddx, api calls etc etc), I decided to skip the SSL connection between MySQL processes.

Instead I decided to establish, effectively a site-to-site VPN connection using OpenVPN.
Due to the fact that this relies on a 'central' node, like a star topology, having that central node go down would be an issue. So I decided to make at least 2 central nodes, that would act as 'servers' to the client VPNs to connect to.

With this method, I'd only have to maintain 2 sets of VPN connections, rather than the 4 (one for each payment server), that would make it truly resilient. It would also need 50% of payment servers to go down before a split-brain situation would occur.

For MySQL, I decided to use Galera cluster for master/master (active/active) replication. This has been working very well, and ~600mb of state data (on initial node setup) takes very little time. If you find that initial startup takes longer than 30 minutes, then I would suspect a networking issue (rate limiting etc). This had occurred to myself in with the Paris payments server, and sadly it took around 6 hrs to sync the node, before it could take on incrementals.

Configuration

They are all connected via VPN tunnels to each other in a mesh network, to allow secure communications (it was easier than messing around with SSL certs), and are running Galera.
This works very well, as any changes to a server are instantly replicated out within a couple of seconds.

Troubleshooting

Node re-syncing

If a node in the cluster drops off, you can tell with the query:
SHOW STATUS LIKE 'wsrep_cluster_size';
which in my case, shows 4.

If for any reason something does drop off, this will show less than 4.

To resync the node that's disconnected, you can simply restart mysql, or if the donor is too slow, you can select another donor by using the command:
/etc/init.d/mysql restart --wsrep_sst_donor=payments3

Identifying the most advanced node

 This is done by checking the Global Transaction ID, in the grastate.dat file (usually in /var/lib/mysql/)
Check the latest sequence number

Tuesday, 31 May 2016

OpenVPN clients, allowing access between clients

See I need access between clients and also to set them on a static IP address.


This is achieved by setting up a client configuration directory on your server.
First get the CN from the certificate you created for each client.

 ./build-key client1
Country Name (2 letter code) [UK]:
State or Province Name (full name) [LDN]:
Locality Name (eg, city) [London]:
Organization Name (eg, company) [CNetwork]:
Organizational Unit Name (eg, section) [IT]:
Common Name (eg, your name or your server's hostname) [p1]:
Name [EasyRSA]:pclient1



So the name of your client is pclient1

Create a directory for client configuration e.g.

mkdir /etc/openvpn/ccd

Then set your server config

openvpn.cnf
client-to-client
client-config-dir ccd
push "route 1.2.3.4 255.255.255.0"
route 1.2.3.4 255.255.255.0
 

Then create a file in /etc/openvpn/ccd called pclient1
The contents should be as follows:
iroute 1.2.3.4 255.255.255.0
ifconfig-push 10.30.30.30 255.255.255.0


The above pushes the route for pclient1 (1.2.3.4) into the route table of the kernel and opevpn, and the pclient1 file allocates 10.30.30.30 as the static ip address.

Sunday, 10 April 2016

Hacking the Amazon Dash for the UK

 Intro

So you want to use the Amazon Dash button but live in the UK?First issue, is actually getting them. They are only available in the US and you require a US address. Once you've actually obtained one of them, the next bit is pretty easy...

Configuration- android

Well you have to setup the dash button through the amazon.com app....

The easiest thing to do is to press and hold the dash until it flashes blue. Once it is in this mode, it creates a access point called 'Amazon ConfigureMe'. I used the excellent tool iStumbler to find this:


Once connected to it (192.168.0.2), I used firefox to http://192.168.0.1

There I entered the SSID and password of my local wireless LAN network and bam. It gave me the success sceen.


This, unfortunately, did not work, because the amazon app has a cert which automagically activates the dash, and in part of that process requires you to have a US IP address. So I fired up openvpn and connected to an endpoint in the US then did the setup via the amazon app.

Each time the dash connects to my wireless LAN, it sends a gratuitous arp and then shuts down. Note that the configuration mac address ( 6c:0b:84:34:ce:ed )is different to the actual mac address it uses once setup.

Programming

I am using my current favourite language python, and we're going to use the scapy library.
Firstly, we need to install the library:
sudo pip install scapy
Then run this script to find out what the mac address is:
from scapy.all import *


def arp_display(pkt):
  if pkt[ARP].op == 1: #who-has (request)
    if pkt[ARP].psrc == '0.0.0.0': # ARP Probe
      print "ARP Probe from: " + pkt[ARP].hwsrc

print sniff(prn=arp_display, filter="arp", store=0, count=10)
 
So now that I know this I can plug this into a proper python program, which sends me an email when my 9month old baby poos.
from scapy.all import *
import smtplib
server = smtplib.SMTP('192.168.2.254', 25)


def arp_display(pkt):
  if pkt[ARP].op == 1: #who-has (request)
    if pkt[ARP].psrc == '0.0.0.0': # ARP Probe
      if pkt[ARP].hwsrc == 'a0:02:dc:88:94:ea': # digestive disadvantage
        print "Pushed Poo"
    msg = "Scarlett pooed" # The /n separates the message from the headers
    server.sendmail("dash@amz.org", "phil.spencer@gmail.com", msg)   
  
      elif pkt[ARP].hwsrc == '10:ae:60:b1:97:73': # Depends
        print "Pushed Depends button"
      else:
        print "ARP Probe from unknown device: " + pkt[ARP].hwsrc

while True:
  print sniff(prn=arp_display, filter="arp", store=0, count=10)
 
Next part I will show how to log this data into an Excel sheet for later analysis.


Sunday, 20 March 2016

Dual internet connections at home (primary/backup) with martians pt1

So I work a lot from home, approx 99% of the time, and soon my wife will be joining me in working from home, one day a week.

Preamble

One line is fibre from Virgin Media (150mb+), the other is an ADSL BOnline (7mb+), which uses the Tiscali network (AS9105).

I kept them separate, as the missus isn't technical so auto route failing issues, might not be diagnoseable for her, and it's easy for her to swap between primary and backup lines by just changing wifi SSIDs.

So one of the things that was provided by BOnline was a Technicolor TG582n router (not the best, but it'll do) - it is afterall a backup line to get internet access.
Current setup. The switches are HP Procurve managed switches with 4gb trunked, portchanneled connections between them both.

Internet1 ---> [eth2] Firewall1 (vlan1) +----- wifi1 (vlan1)
                                        |
                                        +----- switch1 (vlan 1,2)
                                               | | | |
Internet2 ---> Technicolor (vlan2)  ----+----- switch2 (vlan 1,2)
                                        |
                                        +----- wifi2 (vlan2)

Requirements

For traffic on the backup line to be able to access the internal LAN (192.168.2.0/24)
For traffic on the internal LAN, to be access anything on the backup line LAN (192.168.1.0/24)
For both lines, to be able to access the internet independently of each other.
To be able to VPN/SSH into to the firewall from either the primary line or backup line.

Steps

So one of the first things to do is get it connected to my main LAN.
Steps needed

* Add VLAN for backup line
* ensure DHCP scopes do not conflict
* Add static routing to the Technicolour.
* Add routing to the firewall

Setup

So I added a VLAN to the HP procurve switch (conneted to eth1), and untagged it to force all traffic to be backup VLAN, and excluded all others to eth1


Internet ---> [eth2] Firewall +----- [eth0] LAN 192.168.2.254/24
                              |
                              +----- [eth1] Backup 192.168.1.200/24
                              |
                              +----- [eth3] DMZ 10.40.0.0/24
                              |
                              |
    


I allocated eth1 to the new LAN, and assigned 192.168.1.200 to it (set it in /etc/network/interfaces). I needed to add a static route on the Technicolour so that everything on the backup line knew how to access everything on the main LAN (192.168.2.0/24). You can't do this via the web i/f as it doesn't have anything that advanced listed there.
The Technicolour has telnet access, so after seting myself and account and telnetting in I issued:

ip rtadd dst=192.168.2.0/25 gateway=192.168.1.200
ip saveall

Don't forget to saveall, otherwise it'll be running under the running-config, and next boot, it will not be applied.

Then I portforwarded a port from the Technicolour WAN for SSH access to my firewall [eth1/192.168.1.200].
Testing SSH access I tried sshing to my backup line and got this in the firewall logs (if you've enabled martian logging, your syslog will have entries similar to this):

Mar 18 15:56:31 aibo2 kernel: [586653.881530] IPv4: martian source 192.168.1.200 from 77.96.x.x, on dev eth1

This looked funny to me, as 77.96.x.x is my primary line (virgin media). My backup line was 79.78.x.x W00t was going on?

This is due to the fact that linux is not expecting a packet with that source address from that destination. i.e It's not expecting an internal address with that subnet to come from an external IP address. The external IP address is actually the interface belonging to Virgin Media, as that is my default route.

So we need to change the routing, so that all packets from the backup line are associated with the backup interface, and not get routed through my default route)

Routing 

Pre-req: iproute2 (this should be installed by default)

So we need linux to understand that packets from eth1, stay with eth1, and are not routed via the default eth2.
So edit /etc/iproute2/rt_tables
I added a table for beonline
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep
1 beonline

Then added routing to tell that anything from 192.168.1.0/24 and eth1, store it in table beonline
Then that a default route for traffic destined for table beonline is the default gateway of beonline
Then add anything from table beonline has a src of 192.168.1.200

ip route add 192.168.1.0/24 dev eth1 192.168.1.200 table beonline
ip route add default via 192.168.1.254 table beonline
ip rule add from 192.168.1.200 table beonline


This works for me, with my routing table looking like so:
root@aibo2:/etc# ip route
default via 77.96.x.x dev eth2  metric 100
77.96.x.0/22 dev eth2  proto kernel  scope link  src 77.96.x.x
192.168.1.0/24 dev eth1  proto kernel  scope link  src 192.168.1.200
192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.254

NAT/Masquerade

As an extra step, I added an iptables rule to masquerade all traffic from eth1 to the Technicolour router. I wasn't sure if this was necessary, but added it anyway (thinking about it, probably not since I added a static route on the technicolour)

To test this is all working, you can use ping or better, traceroute from your firewall:

via virgin media
root@aibo2:/etc# traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  10.89.x.x (10.89.x.x)  6.719 ms  6.675 ms  6.640 ms
 2  croy-core-2a-ae6-648.network.virginmedia.net (81.96.228.201)  9.544 ms  9.576 ms  9.528 ms
 3  * * *
 4  * * *
 5  * * *
 6  72.14.198.97 (72.14.198.97)  18.411 ms  18.529 ms  18.611 ms
 7  72.14.233.247 (72.14.233.247)  23.630 ms 209.85.253.95 (209.85.253.95)  27.590 ms  27.279 ms
 8  209.85.245.187 (209.85.245.187)  32.126 ms 209.85.242.123 (209.85.242.123)  32.164 ms 209.85.142.177 (209.85.142.177)  32.024 ms
 9  google-public-dns-a.google.com (8.8.8.8)  31.025 ms  21.973 ms  11.934 ms


via beonline
root@aibo2:/etc# traceroute -s 192.168.1.200 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.1.254 (192.168.1.254)  0.729 ms  0.946 ms  1.408 ms
 2  host-62-24-254-203.as13285.net (62.24.254.203)  27.596 ms  29.215 ms  28.727 ms
 3  host-78-151-228-57.as13285.net (78.151.228.57)  29.979 ms  31.896 ms  31.348 ms
 4  host-78-151-228-78.as13285.net (78.151.228.78)  33.601 ms host-78-151-228-72.as13285.net (78.151.228.72)  34.557 ms host-78-151-228-70.as13285.net (78.151.228.70)  34.967 ms
 5  host-78-144-11-223.as13285.net (78.144.11.223)  36.972 ms host-78-144-11-117.as13285.net (78.144.11.117)  37.424 ms host-78-144-9-81.as13285.net (78.144.9.81)  38.420 ms
 6  72.14.214.222 (72.14.214.222)  39.440 ms  25.707 ms  27.260 ms
 7  216.239.56.67 (216.239.56.67)  27.685 ms 216.239.56.203 (216.239.56.203)  26.193 ms 216.239.56.67 (216.239.56.67)  26.010 ms
 8  216.239.57.131 (216.239.57.131)  31.134 ms 216.239.57.153 (216.239.57.153)  29.699 ms 216.239.57.131 (216.239.57.131)  32.021 ms
 9  google-public-dns-a.google.com (8.8.8.8)  34.251 ms  32.332 ms  34.870 ms


Note that the traffic by default routes over my primary line (virgin media), so no source address needed. With beoline, I have to specify the source address, so that it knows to push the traffic via eth1 (backup line)

In part 2, I will be discussing how to route specific traffic over one connection or another.
e.g. You want FTP traffic going over your primary line, by skype traffic going over your backup line.

Monday, 15 February 2016

NodeMCU + DHT22 (ESP8266) wifi thermometer/humidity sensors pt2 with OLED

Wifi temperature / humidity sensor for under £20 with an OLED display

So I wanted a display output for the wifi temp modules, so I bought a few of these: http://www.ebay.co.uk/itm/111474769297?_trksid=p2057872.m2749.l2649&ssPageName=STRK%3AMEBIDX%3AIT
which are 128x64 displays with an i2c interface.It uses a multi-master protocol and has 2 x i2c lines which are SDA (serial data) and SCL (serial clock).

The reason for these, is that they use less power than a traditional LCD, have persistence when powering them, and are brighter than LCDs.


Code wise, I separated it into main code and a init loader (init being the first code loaded when the esp boots).
This basically 
Loader:
gpio.mode(1,gpio.OUTPUT)
gpio.write(1,gpio.HIGH)
wifi.setmode(wifi.STATION)
wifi.sta.config("wifi name","wifi passwd")
print (wifi.sta.getip())
gpio.write(1,gpio.LOW)
dofile("main.lua")


Main.lua
print("Running main")
gpio.mode(2,gpio.OUTPUT)
BLUE=1
gpio.mode(BLUE,gpio.OUTPUT)
gpio.write(BLUE,gpio.HIGH)
-- Set the LED to flash
pwm.stop(2)
pwm.setup(2,1,10)
pwm.start(2)
-- Load library

sda_pin = 5
scl_pin = 6
dht_pin = 4
oled_addr = 0x3c
-- Counter for heartbeat
cnt = 1
state = 0
-- global for display/read dht
humidtwo = 0
temptwo = 0
-- Heap limit
heap_limit= 22000

function init_OLED(sda,scl)
     sla = 0x3c
     i2c.setup(0, sda, scl, i2c.SLOW)
     disp = u8g.ssd1306_128x64_i2c(sla)
     disp:setFont(u8g.font_6x12)
     disp:setFontRefHeightExtendedText()
     disp:setDefaultForegroundColor()
     disp:setFontPosTop()
end
function read_sensor_values()
  local varh,vart
  dht22.read(dht_pin)
  varh = dht22.getHumidity()
  vart = dht22.getTemperature()
  if varh ~= nil then
    humidtwo = (varh/10).."."..(varh%10)
  else
    print ("Previous H : " ..humidtwo)
  end
  if vart ~= nil then
    temptwo = (vart/10).."."..(vart%10)
  else
    print ("Previous T : " ..temptwo)
  end
end

function display_sensor_values(vvar)
  disp:firstPage()
  disp:setFont(u8g.font_6x10)
  disp:setFontRefHeightExtendedText()
  disp:setDefaultForegroundColor()
  disp:setFontPosTop()
  local x,y,ip,nm,st,deg,result
  if state==1 then
    st=string.char(176)
  else
    st=" "
  end
  deg=string.char(176)
  repeat
    disp:drawRFrame(0, 0, 128-1, 64-1, 1)
    x=4
    y=8
    disp:drawStr(x, y, 'T:' .. (temptwo) .. deg ..'C   H:' .. (humidtwo) .. '%RH' )
    y = y+14
    disp:drawStr(x, y,  st .. ' H:' .. (node.heap()) .. ' C:' .. (cnt) )
    y = y+14
    if wifi.sta.status()==0 then result='STA_IDLE' end
    if wifi.sta.status()==1 then node.restart() end
    if wifi.sta.status()==2 then result='STA_WRONG PASSWD' end
    if wifi.sta.status()==3 then result='STA_NO AP FOUND' end
    if wifi.sta.status()==4 then result='STA_CONNECT FAILED' end
    if wifi.sta.status()==5 then result='STA_GOT_IP' end
    if vvar==1 then
      ssid,password,bssid_set,bssid=wifi.sta.getconfig()
      result= ssid
      ssid,password,bssid_set,bssid=nil,nil,nil,nil
    end
    disp:drawStr(x, y, result  )
    y = y+14
    ip,nm=wifi.sta.getip()
    if ip ~= nil then
        disp:drawStr(x, y, 'IP:' .. (ip)  )
    else
        disp:drawStr(x, y, 'IP: Cannot get IP')
        node.restart()
    end

  until disp:nextPage() == false

  cnt = cnt + 1
  if cnt > 999 then
     cnt=1
  end
end

init_OLED(5,6)

i2c.setup(0, sda_pin, scl_pin, i2c.SLOW)
disp = u8g.ssd1306_128x64_i2c(oled_addr)

  sv=net.createServer(net.TCP, 2)
  sv:listen(80,function(c)
      c:on("receive", function(c, pl)
         print(pl)
         if pl=="1" then
            print ("gpio1 low")
            gpio.write(1,gpio.LOW)
         end
         if pl=="2" then
            print ("gpio1 high")
            gpio.write(1,gpio.HIGH)
         end
      end)
        dht22 = require("dht22")
        gpio.mode(BLUE,gpio.OUTPUT)
        gpio.write(BLUE,gpio.HIGH)
        read_sensor_values()
        display_sensor_values()
        dht22=nil
        print("Humidity:    "..humidtwo.." %")
        print("Temperature: "..temptwo.." deg C")
        c:send("H:"..humidtwo.." ; T:"..temptwo.."\r\n")
        c:close()
        gpio.write(BLUE,gpio.LOW)
      
        hp=node.heap()
        if hp<5000 then
            node.restart()
        end
  
        collectgarbage()
       end)
     
tmr.alarm( 2, 1000, 1, function()
  if state == 0 then
    display_sensor_values(1)
  end
  if state == 1 then
    display_sensor_values(0)
  end
  state = (state + 1) %2
  collectgarbage()
end)
The code above requires dht22.lc to be loaded ( from here ), which should be compiled on the ESP2866, once uploaded as a LUA.

Doing a list should show something like this:
dht22.lc        : 1308 bytes
init.lua        : 199 bytes
main.lc         : 4176 bytes
main.lua        : 3521 bytes
The OLED display should be like so:
T: 25.3oC   H: 26.9%RH
* H:22824  C:288
STA_GOT_IP
IP: 10.10.0.164

So with this it will display the (T:) temperature in the top left, with the humidity level (H:) in the top right.
The next line shows a heartbeat icon, followed by heap size in bytes (H:).
I've programmed this so that if the heap space get below 5000 bytes, it will restart itself.
C: is the count, and when it reaches 999, it will go back to 1. This is purely to show it's doing some processing.
The next line alternates between the wifi status codes and the actual SSID it's connected to. This alternates every second.
The last line (IP:) shows the ip currently assigned to the device.

My output to RRD graph is something like this:



The drop outs are caused by the wifi signal (or lack of).
Part 1 here

Friday, 12 February 2016

Configuring exim4 for AWS/Amazon/SES


Exim4 and AWS SES


So if you want to use Amazon SES to send out emails, you'll need to verify your email address under Identity management -> Email addresses, which should send you an email to verify you own the recipient address.

Create SES credentials 

Under Email settings -> SMTP settings.
Don't forget them!

Configure Exim


 dpkg-reconfigure exim4-config
 
  • mail sent by smarthost, received via SMTP or fetchmail
  • your fully qualified domain name (e.g. example.com)
  • 127.0.0.1   for listen address
  • your fully qualified domain name (e.g. example.com) for final destination
  • no relay servers
  • your AWS SES smtp server as outgoing smarthost.  Importantly, don’t use the default port of 25, as 25 is unencrypted, and exim4 won’t send passwords over unencrypted connections without messing around.  So, for example, you might have “email-smtp.us-east-1.amazonaws.com::587
  • Accept defaults for everything else

Create a file /etc/exim4/passwd.client.  This will give exim4 the logon credentials.  Importantly, amazon will resolve to a different server name each time (via the load balancer), so you can’t just put your smtp server name in here.  Your format should be something like:

# password file used when the local exim is authenticating to a remote
# host as a client.
#
# see exim4_passwd_client(5) for more documentation
#
# Example:
### target.mail.server.example:login:password
*.amazonaws.com:<smtp_username>:<smtp_password>
 
Also check your aliases file in /etc/aliases

Tuesday, 26 January 2016

Do not buy your service from Vultr!

Poor Customer service from Vultr.

So I have  side business, which uses VPS around the globe, and one of the providers, I use is vultr.
Initially I thought they were great, but no just after a year of using them - they are unbelievably bad.

So one of the incidents was they emailed me to say they were rebooting one of my VPSs (last year around xmas time) - fine. Except they rebooted it 9 hrs ahead of schedule!! This was not good, as this particular server needed my to manually type in a password for a SSL cert for Apache to run.

Luckily, nagios and site 24x7 caught the reboot and emailed me. I asked for an explanation on why it was rebooted ahead of schedule, and nothing happened. No reply, no acknowledgement and no ticket was raised.

I raised a ticket to complain, and within an hour, someone had trashed my VPS.

I had to re-install, and then I find out that port 25 is blocked (telnetted to several providers e.g. yahoo, gmail)
2016-01-26 01:23:24 1aNsLs-00006F-1a alt1.gmail-smtp-in.l.google.com [2a00:1450:4010:c08::1a] Network is unreachable
2016-01-26 01:25:31 1aNsLs-00006F-1a alt1.gmail-smtp-in.l.google.com [64.233.165.27] Connection timed out
2016-01-26 01:25:31 1aNsLs-00006F-1a alt2.gmail-smtp-in.l.google.com [2404:6800:4003:c02::1b] Network is unreachable
2016-01-26 01:27:38 1aNsLs-00006F-1a alt2.gmail-smtp-in.l.google.com [74.125.68.27] Connection timed out
2016-01-26 01:27:38 1aNsLs-00006F-1a alt3.gmail-smtp-in.l.google.com [2404:6800:4008:c07::1a] Network is unreachable
2016-01-26 01:29:45 1aNsLs-00006F-1a alt4.gmail-smtp-in.l.google.com [173.194.72.27] Connection timed out

I emailed them and they said yup - we have port 25 blocked... Since FUCKING WHEN!? They didn't even inform me. The server that was trashed was a payments server which emailed new users their account data. If it didn't email the users, they would want refunds.

This was costing me sales.

I was not happy.

So they then replied to the support ticket about port 25 being blocked, with some crap about verifying my identity with a credit card and govt issued ID.
I replied: Why was I not informed when you blocked me? and why was this initially blocked?

I don't expect any reply from them or the ticket to be just closed without a response. I am FUMING!

My advice is, if you are running a business, don't use Vultr..... their customer support will lose you business!