Tuesday, December 23, 2008

MySQL Cluster 6.4 - BETA

The Configurator and Sandbox (only the Custom part so far) has been updated to support MySQL Cluster 6.4!
MySQL Cluster 6.4 is a fantastic piece of software with new features such as:
  • Multi-threaded Support. A new configuration parameter, MaxNoOfExecutionThreads=[2,4,8] allows you to set how many cores the data node should utilize.
    MaxNoOfExecutionThreads=2 for two core machines
    MaxNoOfExecutionThreads=4 for quad core machines
    MaxNoOfExecutionThreads=8 for eight core machines
    With 8 you will get 4 query handling threads! Can you beat Jonas' 1M reads on a single box?
  • Online add node - allows you to scale online! I will publish a procedure how to do this very soon. Currently the Configurator and the Sandbox do not support this.
  • Windows Support (I have not tried this and not 100% of the quality, nor how to build it, but it should be there).
  • Snapshot backup - START BACKUP, [SNAPSHOTSTART | SNAPSHOTEND]
    SNAPSHOTEND - the backup contains the database at backup end time (default and same as behavior in previos releases)
    SNAPSHOTSTART - the backup contains the database at backup start time
And remember it is a beta, and be prepared for bugs and other oddities until it has stabilized.

Friday, December 19, 2008

MySQL Cluster Sandbox - test cluster to cluster replication!

If you have max 15 minutes and want to try out MySQL Replication (geo redundancy) between two Clusters on your localhost (only tested on Linux) or on vmware then this Sandbox is for you.

Here is what you have to do:
  1. Go to www.severalnines.com/sandbox
  2. Create a master cluster:
    I have used the ports 1186 (for the management server) and 3306 and 3307 for the mysql servers).
  3. Enter the email address and a set of scripts to install cluster will be sent.
  4. Create a slave cluster:
    Make sure you use different ports for the slave cluster. I have used the ports 1187 (for the management server) and 3310 and 3311 for the mysql servers).

Master Cluster:

tar xvfz mysqlcluster-63-master.tar.gz
cd mysqlcluster-63-master
cd scripts
If you have the gnu tool chain (g++/gcc/make) you can build from source:
sh download-and-compile.sh
sh dist-src-build.sh
or if you prefer pre-built binaries (you must chose the "non-rpm" (i.e. the .tar.gz) and the mysql-...tar.gz must not be in /tmp !!):
sh dist-tgz mysql-cluster-gpl-6.3.17-linux-i686-glibc23.tar.gz
Then we have to setup the mysql servers and start the cluster:
sh bootstrap.sh
sh start-cluster-initial.sh
Connect the mysql client to the mysql server:
sh mysqlclient-1.sh
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'localhost' IDENTIFIED BY 'password';
We don't want the GRANT to be replicated (and clear out old stuff in the binlogs that we might have):
RESET MASTER;
and once more (on both mysql servers since grants done locally on each mysql server)..
sh mysqlclient-2.sh
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'localhost' IDENTIFIED BY 'password';
RESET MASTER;

Slave Cluster:

tar xvfz mysqlcluster-63-slave.tar.gz
cd mysqlcluster-63-slave
cd scripts
If you have the gnu tool chain (g++/gcc/make) you can build from source:
sh download-and-compile.sh
sh dist-src-build.sh
or if you prefer pre-built binaries (you must chose the "non-rpm" (i.e. the .tar.gz) and the mysql-...tar.gz must not be in /tmp !!):
sh dist-tgz mysql-cluster-gpl-6.3.17-linux-i686-glibc23.tar.gz
Then we have to setup the mysql servers and start the cluster:
sh bootstrap.sh
sh start-cluster-initial.sh
Connect the mysql client to the mysql server:
sh mysqlclient-1.sh
CHANGE MASTER TO master_host='localhost', master_port=3306, master_user='repl', master_password='password';
START SLAVE;
SHOW SLAVE STATUS\G

Create a table on the master:
sh mysqlclient-1.sh
use test
CREATE TABLE t1 (a INTEGER PRIMARY KEY) ENGINE=NDB;
INSERT INTO t1(a) VALUES (1);
Go to the slave:
sh mysqlclient-1.sh
use test
SELECT * FROM t1;

Read more on replication and learn how to do link failovers etc.

Friday, December 12, 2008

cmon 0.12 - with diskdata support and a lot of fixes

cmon 0.12 is now released and can be downloaded from www.severalnines.com/cmon.

Release notes:
  • connection handling - cmon will now retry connects forever. cmon will also recover and automatically reconnect to cluster after a cluster failure or if the connection to the mysql server crashes where cmon logs information. This means you will have to terminate the cmon using e.g 'killall cmon' or /etc/init.d/cmon stop
  • disk data - cmon now gathers information about the amount of table space used and this is presented in the web interface
  • init.d scripts - fixed bugs in creating the init.d scripts during make install-extra
  • fixed compiler warnings - now 0 compiler warnings on Linux systems
  • mysql_statistics table - now explicitly using engine=myisam
  • web interface - changed printout if cmon cannot connect to the cmon mysql server (suggests you to change $mysql_hostname in {wwwroot}/cmon/cmon_db_conf.inc.php
  • web interface - changed "Memory" to "Storage" which now contains table space information
  • web interface - print hostname instead of ipaddresses if possible
  • cmon has been tested on SUSE 10 (SLES 10).
  • documentation bug - wwwroot for SUSE is /srv/www/htdocs/
Upgrading from cmon 0.11
  • cmon 0.12 will add a new table called 'diskdata'. This will be added automatically when you start cmon-0.12. If you prefer, you can download the sql script and create it manually.
  • don't forget to do ./configure --with-wwwroot={documentroot of www server} if you want to easily install the www files for cmon using 'make install-extra'
  • make install-extra will also install the init.d scripts.

External integration
cmon was successfully integrated at a customer who was using Solarwinds IpMonitor as the central monitor for all their IT services.
By using SQL to select the cluster state, node state, etc from the cmon database, and raise alarms and notifications when needed, the customer's MySQL Cluster solution could be monitored from one central point.

Todo
  • brush up web interface (make it look nicer)
  • add replication link monitoring
  • add smtp/snmp notifications

Wednesday, December 10, 2008

Disk data - a summary

Here is a summary about Disk Data tables in MySQL Cluster.

Data node configuration:
  • SharedGlobalMemory=384M
    RAM from the shared global memory is used for the UNDO_BUFFER when you create the log file group.
    In the configuration generated by severalnines.com/config then you have to uncomment the SharedGlobalMemory in mysqlcluster-XYZ/cluster/config/config.ini before you start the cluster.
  • DiskPageBufferMemory=3072MB
    If you are relying a lot on the disk data, we recommend to set this to as much as possible.
    In the configuration generated by severalnines.com/config then you have to uncomment the DiskPageBufferMemory in mysqlcluster-63/cluster/config/config.ini before you start the cluster.
    The DiskPageBufferMemory should be set to:
    DiskPageBufferMemory=TOTAL_RAM - 1GB (OS) - 1200MB (approx size used for buffers etc in the data nodes) - DataMemory - IndexMemory
    Expect to do some trial and terror before getting this correct.
Creating the storage:
  • LOGFILE GROUP
    CREATE LOGFILE GROUP lg ADD UNDOFILE 'undo1.dat' INITIAL_SIZE=20G UNDO_BUFFER_SIZE=128M ENGINE=NDB;

    You can only have one LOGFILE GROUP. 20GB undo log is just an example, but don't be shy with it if you have space. Theoretically, you could need as much UNDO as you have TABLE SPACE size. If you have 10GB table space, then start with an UNDO log that is 10% of that. Add more UNDO LOG FILES if you need, i.e, run into problems. UNDO_BUFFER_SIZE=128M. should be sufficient for most applications (but you must have set SharedMemoryGlobal=384M). This buffer cannot grow dynamically or be extended later :(
  • UNDO LOG FILES
    You can add more undo files (online) by doing:
    ALTER LOGFILE GROUP lg ADD UNDOFILE 'undo_2.dat' INITIAL_SIZE 256M ENGINE NDB;
  • TABLESPACE
    CREATE TABLESPACE ts_1 ADD DATAFILE 'data1.dat' USE LOGFILE GROUP lg INITIAL_SIZE=256M ENGINE=NDB;
  • DATAFILES
    In 7.0 you can use DiskIOThreadPool which makes it possible to have X threads operating on one file, so you don't have to create many data files. Of course for some applications it may be good to have the more files if you want to spread out over several physical disks.
    In 6.3 it is recommended to have many but smaller data files. Here is a script that generates SQL (ALTER TABLE ADD ... ) to create number of datafiles. Run it as:
    sh ts-gen.sh 100 > alter_add_ts.sql
    and you will get a file with 99 SQL statements to add table spaces.
    From the mysql client you can then do:
    mysql> source alter_add_ts.sql
    and a lot of datafiles will be added to the ts_1 tablespace.
    Please note that you can change filesize and path if you wish.
  • Having more than 50-60 data files does not really give anything. The reason for having several data files instead of one is that more internal file system threads can operate in "parallel" on the data files, instead of having to be synchronized on one.
  • You grow the table space by adding data files.
  • Adding data files is an online operation if you wish to extend your table space.
  • There is no auto-extend yet of table spaces.
  • You can create many tablespaces, each having many data files.
  • Leave EXTENT_SIZE as it is. We have found no reason yet to change it.
  • Example CREATE TABLE:
    CREATE TABLE t1 ( a INTEGER PRIMARY KEY, b VARBINARY(2048)) engine=ndb TABLESPACE ts_1 STORAGE DISK;
    The column 'b' will be stored on disk. The column 'a' will be stored in DataMemory together with the Ordered Index on 'a' (which comes from the PRIMARY KEY). The PRIMARY KEY hash index on 'a' will use IndexMemory.

Disk space and disk utilization:
  • Indexed attributes are stored in RAM (DataMemory). However, you can specify for the non-indexed if they should be in RAM or DISK when you create the table (see ref manual, about column definition STORAGE DISK or STORAGE MEMORY).
  • In MySQL Cluster 6.2 , 6.3, and 7.0 disk data attributes are always stored as fixed size (e.g a VARCHAR(256) will always use 256B on disk, even if it is NULL - this will change in 7.1, hopefully).
  • BLOB/TEXTs are special and stored in segments. The first 256B will be stored in RAM, the rest will be stored in segments (depending if it is a MEDIUMBLOB/BLOB/LONGBLOB then the segment size varies (from 2048B up to 8192B iirc).
  • NULLs stored in a BLOB/TEXT will occupy 0B on disk, but 2B overhead of DataMemory.
  • The data files are divided into extents. Each extent is 1024K (1MB). There is no reason to change the extent size. Many records may reside in one extent.
  • If you write something to a disk data table, then the data will be written in as many table spaces as you have NoOfReplicas set to. Thus, typically you will have two copies of the data (one on each node in a node group).
  • The information_schema contains information how much of the table space and undo files that are used. E.g
    select free_extents, total_extents from information_schema.files where file_type='datafile'

Short on disk data internals :

Thanks jonas for your input on this!
  • The DiskPageBufferMemory is a LIRS cache (variant of the LRU cache)
  • Dirty pages are flushed to the table space during an LCP (local checkpoint), but also if they have been inactive for some time (i have no info yet on the inactive time)
  • Changes to Disk data is Redo logged (same ways as with in-memory tables).
  • Changes to Disk data is written to a circular UNDO log (using WAL). This is so that we can undo all the way back to the start of the last completed LCP. This way, we can get to a state where memory and disk are in sync (i.e. references between disk data and in-memory storage are correct) before we start running the REDO log.
  • If a record exists in the DiskPageBufferMemory, it will be returned from there to the application.
  • If a record does not exists in the DiskPageBufferMemory, it will have to be fetched from disk. In case there are no empty pages in the DiskPageBufferMemory, the LRU algorithm will sync dirty pages to disk in order to free up empty pages.
Performance

What to use disk data for?:
  • Storing e.g inactive subscribers in a subscriber database (when activated they you can move them into RAM). If an activated user has been inactive for too long it can be moved back to the disk data storage. This is something you have to write in the application logic.
  • Storing e.g SMS/MMS
  • Logging data
  • Large feeds coming in that you need to persist, but not necessarily access in real-time (i.e<3ms).>
  • Probably more..
Suitable access patterns:
  • You should not use data stored in disk data tables to realtime applications. If you do random io, then you applications will be as fast as the disk.
  • Use simple access patterns - avoid joins on disk data. See it as an extension to RAM where you store data that you don't have to have that frequent access to.
  • If you want frequent access to the data you should have a BIG DiskPageBufferMemory (like in innodb, the innodb_buffer_pool is very important to performance). It is the same here.
  • Range searches will be expensive as disk data is not stored according to the order of a particular index. It is more or less random where data is stored. I will get back on numbers of this.
  • If you use SSD for disk data, then the above might be more or less true. Disk data has not been tested with SSDs yet.
Caveats:
  • Check my earlier blog post for caveats regarding deallocation/allocation of extents.
  • If you restart the cluster with --intial you have to manually delete all the data files and undo log files that was created. These are not cleared even if you start the cluster with --initial.

Disk data counters (more)

In a previous blog "disk data counter" I did a couple of measurements how to configure the disk data and what we can expect.

Now I have redone the insert test to put as much load as possible with respect to the disk sub system.

1. Insert 5M records a' 4096B (1 threads, batches of five):
6.3.19: 286 sec (3721.12 QPS)

2. Insert 5M records a' 4096B (5 threads, batches of five):
6.3.19: 286 sec (8315.36 QPS)

At 2) the IO sub-system is quite loaded (98.2%) and I get (periodically,when the DiskPageBufferMemory is checkpointed):

"REDO buffers overloaded, consult online manual (increase RedoBuffer) - code=1221" (this does not matter, as i retry my transactions in the application).
I could increase the RedoBuffer (it is set to 32M), but there is no point as if I increase the load, then the disks would not be able to keep up any way. Hence, to get better perf now I need to increase the number of disks I can write to.

Here are some numbers for Read and Writes of 1024B records:

3. Insert 5M records a' 1024B (5 threads, batches of five):
6.3.19: 375 sec (13335 QPS)

4. Insert 10M records a' 1024B (10 threads, batches of five):
6.3.19: 506 sec (19737.2 QPS)

5. Insert 20M records a' 1024B (5 threads, batches of five):
6.3.19: 1100 sec (17035.8 QPS)

6. Insert 20M records a' 1024B (10 threads, batches of five):
6.3.19: 945 sec (21140.7 QPS)

7. Insert 20M records a' 1024B (20 threads, batches of five):
6.3.19: 675 sec (29578.9 QPS)

8. Read 1M random records (1024B) out of 20M (10 threads):
(Each thread does 100K reads)
6.3.19: 816 sec (1230.01 QPS)

9. Read 2M random records (1024B) out of 20M (20 threads):
(Each thread does 100K reads)
6.3.19: 683 sec (2911.67 QPS)

10. Read 3M random records (1024B) out of 20M (30 threads):
(Each thread does 100K reads)
6.3.19: 1273 sec (2350.62 QPS)

The setup was same as in disk data counters.
Next I am going to do some tests on ordered index scans (i.e range scans on an in-memory index) and read data stored on disk.

Use this numbers only for an indication what you can achieve since it may differ for your application.

Tuesday, December 09, 2008

Cluster - session management tips and tricks

Many users use MySQL Cluster for session management. In MySQL Cluster tables are checkpointed and redo logged to disk.

But for session data, do we really care?
If the Cluster is history then the sessions are most likely to be obsolete when the cluster recovers.

And do we want to use disk space and disk band width for the session data?

If you answer "no" on the questions above, then you can create a table that is not checkpointed nor redo logged, by setting the session variable ndb_table_no_logging:


mysql> use test;
Database changed

#enable no logging
mysql> set ndb_table_no_logging=1;
Query OK, 0 rows affected (0.02 sec)

#a grossly simplified structure:
# this table will not be checkpointed or redo logged, because of
set ndb_table_no_logging=1;
mysql> create table session(id bigint primary key, data varbinary(1024), expiry_time timestamp) engine=ndb;
Query OK, 0 rows affected (1.21 sec)

# disable creation of tables with no logging (i.e, enable logging again, default behavior):
mysql> set ndb_table_no_logging=0;
Query OK, 0 rows affected (0.00 sec)



To verify that the table is not logged (look for Temporary table):

johan@stingray:~$ ndb_desc -d test session
-- session --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: yes
Number of attributes: 3 Number of primary keys: 1
Length of frm data: 289
Row Checksum: 1 Row GCI:
1 SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes --
id Bigint PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY
data Longvarbinary(1024) NULL AT=MEDIUM_VAR ST=MEMORY
expiry_time Timestamp NOT NULL AT=FIXED ST=MEMORY

-- Indexes --
PRIMARY KEY(id) - UniqueHashIndex
PRIMARY(id) - OrderedIndex

NDBT_ProgramExit: 0 - OK


If the Cluster crashes and later recovers the session table will not contain any data, but will be completely empty.

I don't think this works with Disk data tables, but i haven't tried so I don't know the error you will get.

This is a technique that we have implemented with quite a few customers using Cluster for session management.