After he took another node into the cluster and took it off again, suddenly Rabbit didn't start on the cluster nodes. It kept dumping crash dumps and filling up the error log. So he asked me for help. By myself, I didn't do RabbitMQ - I just played with it a little. But I did quite some Erlang (http://www.amazon.de/dp/3941841459).
The crash dump didn't help much. Google didn't as well. A brief glance at the error log revealed a message like:
** FATAL ** Failed to merge schema: Bad cookie in table definition
So for me, it looked like Mnesia backing Rabbit has become inconsistent at some point, apparently through taking on and off another node. Whatever the cookie problem was, all nodes in the cluster and the questionable node shared the same Erlang cookie.
So my idea was to "connect" direcly to Mnesia and to clean up the schema. When Rabbit comes up, the schema would surely get recreated. Otherwise Mnesia would have to deal with its inconsistency and crash the VM all the time. Of course, it only works when you don't care about messages in the queues. Otherwise you would also need to backup some Mnesia tables used by Rabbit. I'm sure Rabbit documentation will mention those somewhere.
So, now back to how it worked. You need to find out where Mnesia stores its data for your Rabbit user. In my case, it was in /var/lib/rabbit/mnesia. Then, you bring up an Erlang node basically configured like the Rabbit node. And then you delete the Mnesia schema. After that, your Rabbit would be able to start, to create the Mnesia schema, and you can do rabbitmqctl stop_app etc. to reconfigure your cluster.
To "connect" to the Mnesia store, do something like this on every node you run Rabbit on. First, you fire up the prepared Erlang shell:
$ erl -sname "rabbit@node01" -mnesia dir '"/var/lib/rabbit/mnesia/rabbitmq"'
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.8.5 (abort with ^G)
> mnesia:info().
===> System info in version ".....", debug level = none <===
opt_disc. Directory "/var/lib/rabbit/mnesia/rabbitmq/Mnesia.rabbit@node01" is used.
...
That's enough to know - the configuration you have provided is correct. If it says "NOT used", you did wrong and need to check the erl parameters. Then, still in the shell, you do:
> mnesia:delete_schema(['rabbit@node01'])
ok
>
Here you go. The Rabbit schema has gone. Now you can bring up your Rabbit and stop_app etc. to set up your cluster. As easy as this. You could try to list all your cluster nodes in the mnesia:delete_schema/1, but I didn't and it worked.
You're welcome. Feedback is very appreciated. What I also didn't do is to try just to delete Mnesia files - this might work as well, but to hard for me. You might need to start Mnesia on the node and to backup some tables, so I wouldn't suggest just to delete the files.
And I'm pretty sure, there is a magic Rabbit switch which allows to do all that in one single go. But I couldn't find any :)