Muhammad Abdullah's Blog - technology as i C it

I completed my Masters in Computer Science from the University of Texas at Dallas in 2011. My concentration was Intelligent Systems. Generally i am very much interested in anything related to technology but software architectures, data stores, machine learning and operating systems are my favorites. When i am free, i listen to music, play sports, watch soccer or movies. I will share my experiences, my knowledge and my findings on this blog.

Sunday, July 17, 2011

MySQL Replication: Statement based Replication vs Row based Replication

Before discussing the replication formats in MySQL it is necessary to discuss how replication works in MySQL. Why is replication required? The answer lies in my earlier post in which I described the process of scaling MySQL for reads (Scale-Out MySQL). Lets proceed by discussing the replication process. Replication process consists of the following steps:

MySQL Master writes any changes that occur to the 'binlog'. binlog is a log that contains any updates/inserts/deletes made on the master MySQL.
Slave's I/O thread reads the binlog from the master and writes the events in its 'relaylog'.
The MySQL thread reads the events from the relaylog and applies those events to the slave.
Steps 1 till 3 are repeated so the slave is synchronized with the master all the time.

Below is the diagram that will aid you in understanding the process of replication in MySQL.

Now that you are aware to some extent about the process of replication, lets dive deep into the replication formats. By replication format I mean the format in which events are recorded in master's binlog. There are three types of replication formats:

SBR or Statement based Replication
In this format of replication, the MySQL master, records the events as SQL statements into the binlog. The statements are picked up by the MySQL slaves and replayed in the same way as they are played at master
RBR or Row based Replication
In this format of replication, the MySQL master, records the events as actual rows that indicate how the rows are changed at the master.
Mixed Mode Replication
This format of replication is a mix of RBR and SBR. MySQL switches the format in real-time depending on the type of event.

Of course there are pros and cons of each format, I will primarily address the RBR and SBR. I will discuss the pros and cons of each replication format type.

Statement based Replication

Pros:

As in case of statement based replication the events are logged as SQL statements and not in form of row changes, hence the log files are much much smaller and utilize less storage space.
The log file contains all the SQL statements and hence can be utilized for analysis, audit or restoring from backup.
In terms of bandwidth, statement based replication is much more efficient because its the queries that are transferred to slave and not the actual row updates.

Cons:

Statements that possess non-deterministic properties are difficult to replicate using this format.
Statements that use stored procedures or user defined functions (UDF) are considered to be nondeterministic statements and hence are difficult to replicate using SBR.
DELETE and UPDATE statements that use LIMIT without using ORDER BY are considered to be nondeterministic and hence cannot be replicated using SBR.
UPDATE statements with a WHERE clause that doesn't utilize the index require more locks in SBR then in RBR.
INSERT statements that use auto-increment, block other non-conflicting INSERT statements. This is applicable to InnoDB engine. Its a very important point to consider as this point effects concurrancy.
More locks are required on the slave for INSERT/UPDATE/DELETE
Complex statements that are evaluated and executed on the master, need to go through the same process on the slave before applying changes to the slave.
Table definations should be identical on the master and slave.

Row based Replication

Pros:

Its the safest form of replication though not as old as statement based replication.
Fewer row locks are required for INSERT statements with auto-increment.
Fewer locks are required for INSERT/UPDATE/DELETE on the slave when compared to statement based replication.
Fewer locks are required when UPDATE and DELETE statement's WHERE clause does not make use of the index.

Cons:

Contrary to statement based replication, row based replication records changes to each row. The most probable disadvantage from this behavior is the size of the binlog. If a SQL statement effects 100 rows then in SBR just one query is logged whereas in RBR changes to 100 rows are recorded. This behaviour makes binlogs in RBR much much larger then SBR.
As in case of SBR, the SQL queries are logged and hence they can be reused and read for audit. In RBR, there is no way to figure out what statements were executed on the master and recieved on the slave. However what rows has been changed or inserted, can be decoded.
There is no mechanism available in RBR to ensure that the binlog in MySQL master was processed without any problem at the slave.

After comparison of both the formats, I personally believe that Row based Replication in MySQL is the way to go because:

Its the safest form of replication
It is safer when it comes to replicating triggers and stored procedures.
It requires fewer locks and hence its much faster and achieves high concurrancy.
Regarding the drawback of having large binlog, I would say that not all applications may issue updates that effect thousands of rows. Comparing to the other advantages that row based replication provides and an informed guess I would say that this disadvantage may be ignored.

The choice of which format to choose depends on the application and requirements. I have not touched mixed format. My guess is that mixed format may bring in the best of both world. I will be discussing mixed format replication in detail, in my upcoming post.

Sunday, July 10, 2011

Memcached and MySQL (Part II – Memcached + MySQL)

In part one of this blog (you can see it here) I gave a detailed overview of what Memcached really is. In this part I will address the usage of Memcached to alleviate the load from MySQL. In a scalable + high traffic application, database will prove to be a bottleneck due to the limitations of disk read/write rate, which is slow as compared to reading from memory. As Memcached is a memory-based distributed object storage cache, we can utilize the power of Memcached. Will avoiding database a good optimization? Yes it is. This is where Memcached will come into play, avoiding the requests to hit MySQL. Well not in all cases but yes to a considerable extent. Having Memcache installation on the same server as MySQL and using it will help the data source to perform much better. When I say Memcache it means that I am refering to only one instance of Memcache and not a distributed version (in the latter case its Memcached). Having Memcache on the same server as MySQL means that memory will be distributed between Memcache and MySQL which is not what I prefer. I would prefer a separate server dedicated to Memcache so MySQL can utilize the memory and definitely get more of it. Below is the architecture that I am considering:

As you can see that we have a MySQL server and a Memcache Server. We generally have queries that are either fetching data from database or adding/deleting/updating data in the database. For each type we can follow a particular sequence:

Read Queries:

Check Memcached if the data set is available
If its available then congratulations you just avoided a hit to the database
If its not, too bad. Get the result set from the database
You want to avoid hitting the database again, so set the data set in Memcache

Write Queries:

Write data to the database
If the write was a success, then either drop the data set from the Memcache or update the Memcache with current information

Of course this is a very naive concept when it comes to the practical implementation. Practically things are very much different and a bit complicated when it comes to the usage of Memcached. Additionally there can be different levels of caching in an application which are constructed overtime in the life cycle of an application and are based on the needs. Also we can increase the Memcache servers to develop a Memcached cluster to cater the needs of ever increasing data. Now we can clearly see that the performance of the application will increase greatly by reducing the load on the database, using an in-memory key value storage.

Sunday, July 3, 2011

Memcached and MySQL (Part I - Memcached)

We all know that disk I/O is expensive then memory and we also know that data in memory is volatile but on disk is non-volatile. Talking about relational database storage, data is stored on disk which means that it is non-volatile but the retrieval and storage is slower then memory. On the contrary, if we store data in memory the retrieval and storage is super fast but the data is volatile and thus prone to loss. The question is whether we can get the best of both worlds. The question will be anwered ahead. Lets first see what Memcached is. Below is a point-wise explanation of Memcached (will try to cover as much as I can)

Memcached in a nutshell

Memcached is a distributed in-memory object caching system. It's distributed, which means that Memcached does not represent a single server but can span hundreds of computers. Its in-memory, which means that all the objects are stored in memory (RAM). Memcached is a distributed version of Memcache.
It's an in-memory key/value store, where data/object can be stored using the key as an identifier of the data/object.
Data is in-memory and is therefore volatile. Its good as a cache but not good for data that needs to be persisted and the loss of which might not be good for the application or the users.
All operations in the Memcached take constant time and hence their complexity is O(1). The basic operations of Memcached are add, set, get, multi-get, delete, replace (Note: a set after a set on the same key is considered to be an update).
All items in Memcached have expiration time. An expiration time of zero '0' means that the item will never expire (here never means 30 days of expiration time). If the expiry time is greater then 30, it will be treated as a UNIX timestamp.
Memcached does not have a garbage collection mechanism. You need to either explicitely delete the item, get an item that is already expired, or wait for Memcached to run out of alloted memory. In short, Memcached memory reclaiming is lazy, which is logical keeping in mind the complexity and processing involved in garbage collection.
Memcached reclaims the memory using the following mechanism:
1. If an item is requested, Memcached checks it's expiry time. If the item is expired, it returns a negative response and reclaims the memory by freeing its memory.
2. If Memcached is unable to accommodate any new items, it starts to free the memory of LRU (least recently used) items in order to accommodate new items.
Memcached servers are isolated, which means that one server is unaware of the presence of another server. Where to route the request is the responsibility of the Memcached client library.
Generally you do not need authentication mechanism for Memcached and previously it was not even supported. Now if the client supports, SASL authentication can be used. Generally Memcached infrastructure is in a closed internal network and hence having authentication and other security measures may complicate and introduce unwanted latency to an otherwise simple concept.
Memcached has a client part and the server part. The client part is responsible for routing the request to an appropriate Memcached server in the Memcached server cluster, managing connection and handling failures. The server part is responsible for request processing and reclaiming memory.
You can cache objects, queries, data-set and anything sensible in the Memcached. Just remember its a cache and not a persistent storage.
There is no replication or a fail-over mechanism in Memcached.
Compression and Serielization of cache objects should be investigated when selecting an appropriate client for Memcached. Also connection handling mechanism should be carefully read in order to avoid connection leakages which will render the Memcached server useless.
Hashing Algorithm depend on the clients. Generally 'Consistent Hashing' algorithm is implemented by the clients. This algorithm devises a strategy to distribute the keys across several Memcached servers evenly but the biggest advantage comes in when new servers are added to the Memcached cluster. This algorithm minimizes the number of re-hashed keys whenever a new server is added in comparison to the normal hashing algorithms where re-hashing is considerable.

Below is a diagram that shows the client part and server part of Memcached (in a Memcached cluster of two Memcached servers):

The next part will discuss how we can use Memcached to allieviate the load from the database server, which was the actual motive of this post.