Sunday, June 26, 2011

MySQL DATETIME vs TIMESTAMP

A well-designed application has many traits. Good schema design is one of them. Choosing the right data type serves as a contributor towards a good schema design. The selection of a data type depends on various questions like what is the range of values that the column will hold?, the precision required by the column?, and if any special behavior is required. Choosing an optimal data type may save space on disk and memory, and may save CPU cycles. Generally such consideration are trivial when we talk about smaller applications but are very important for large scale applications. MySQL offers two alternatives for storing date time values namely DATETIME and TIMESTAMP. Which one is the best is not a good question but which one to use and what are the characteristics of each one is a better question and knowing the answer to the latter question will help us selecting the best option.

Storage Format

When using DATETIME, MySQL stores date time information as an integer in the format of YYYYMMDDHHMMSS. When displaying this data MySQL formats this storage representation into an ANSI standard representation which is YYYY-MM-DD HH:MM:DD. On the other hand, MySQL just stores the number of seconds elapsed since January 1, 1970 12:00 AM when it comes to storing date time information as TIMESTAMP.

Range

DATETIME can hold date time information from the year 1001 to the year 9999 whereas TIMESTAMP can hold date time information from the year 1970 to the year 2038.

Storage

DATETIME uses 8 bytes of storage whereas TIMESTAMP uses 4 bytes of storage, almost half of what DATETIME uses. So if TIMESTAMP is feasible for usage, you shouldn't mind using it.

Timezone Dependence

Storage and display of date time information when using DATETIME is independent of the timezone whereas in the case of TIMESTAMP, the date time information displayed is dependent on the time zone.

Special Characteristics of TIMESTAMP

TIMESTAMP type has some characteristics that are absent in DATETIME type. Below are the characteristics:

  • The display of date time when using TIMESTAMP type is dependent on the time zone. This means that MySQL will return the date time information after converting it into an appropriate time zone. By default the time zone is the server's time and can be changed on a per connection basis.

  • By default, if you insert a row in the table without specifying the value of the timestamp column, MySQL will set the time of that column to be the current time. Similar is the case when a row is updated, MySQL will update the TIMESTAMP column with the current time if the date is not specified in the update query. Obviously we can change the default behaviour of the TIMESTAMP data type.

Hopefully this information will help us distinguish the two data types and to choose the right data type for the right purpose.

Wednesday, June 22, 2011

Scale-Out MySQL

As a web applications grows, they face high peak times and a burst of requests that they have to handle. On an application stack, software components can be scaled with ease by just adding new application servers. The hardest part is to scale the data layer. One question that comes into our mind is to why scaling the database is necessary and what matters most? The database or the application. The answer to this is that the stand alone database will at some point become a bottle neck for the application. If the application is capable of servicing 800 requests per second then because of the database, the application will be able to serve far lesser requests then it is capable of. In order to explain the bottleneck lets analyse the architecture of MySQL:

Figure taken from:
http://dev.mysql.com/doc/refman/5.1/en/pluggable-storage-overview.html

If you analyse the lowest level of this architecture, you will see a bottleneck. Yes you are right, its the file system or the disk on which the data is eventually stored. No matter how fast and efficient the data retrieval and storing algorithms are, there will always be a limitation of the disk in short the requests are I/O bound.

Now that we are aware of the bottleneck that effects the application's performance, we have to come out with different solutions in order to improve the response/serving time of the data layer. There are different architectures that provide additional performance. Lets first look at the traditional application architecture:


The above architectures are traditional architectures, that almost every application may have, or at-least most. As we can see in figure 2, the application layer is scaled to two app servers which improves the performance a little bit, but the problem is the data layer. Of course, scaling the application layer will definitely have some issues but those issues are out of scope for this blog. So lets start by talking about scaling the database. Yes by scaling I mean increasing the number of database servers to distribute the load between them.

Concepts:
  1. A query is either a read or a write query. I call them passive queries or active queries. Queries that change the state of the database are considered to be active queries (insert, update, & delete) whereas the queries that do not change the state of the database are passive queries (select).
  2. Master Server: A server that serves as the primary data source for the application. This server has the most updated state of the data.
  3. Slave Server: A server that serves as a backup data source for the master server and shares the load with the master server. The state of the data in this server is 'eventually consistent'.

Improvement 1:
This improvement introduces one or more slaves for a master in the architecture. The active queries are executed at the master and the updates are reflected on the slaves after a slight delay. The passive queries may be executed on any of the servers including the master server. This approach reduces the load on master server as now the load is shared between more then one server. Below is the architecture:


As we can clearly see that the slaves share the burden of the reads on the master. We can add some more slaves in order to improve performance further.


Improvement 2:
So in order to increase the performance we keep on increasing the slaves and wait for the magic. But soon we observe that the spell proves to be ineffective and the master server is lagging. The reason for this lag are the slaves. MySQL master slave configuration requires master to record active queries in a log. The slaves, in order to update the data connect to master, access the log, get all the new updates in it and apply the updates to the slave data. This way the slaves are synchronized with the master. So many slaves trying to synchronize themselves with the master server by making connection with the master and accessing the log to get the updates, makes the master lag. Having this problem, we can utilize the 'blackhole' engine to solve the master lag issue.
So instead of Slaves connecting directly to the master, we introduce a dummy master that acts as a slave to master, getting all the changes in the binlog of the master and recording it in its binlog. The only difference between master and the dummy master is the table type. All the table in the dummy master use black hole engine. It means that the actual data is not stored and the queries are not performed on the dummy master but rather just logged. The slaves can connect to the dummy master just like in Improvement 1 and get synchronized. Below is how the architecture should look like:


As we can now see that all the slaves do not connect to master but rather they connect to the dummy master and the dummy master connects to the master. We can add some more dummy master in order to accommodate more slaves.

Few Notes:
  1. In this post I only talked about Scaling Reads and not Scaling Writes.
  2. There is a single point of failure and yes that's the master. You need to have a replicated secondary master so in case the primary master fails, the secondary master takes its role.
  3. You cannot add up infinite slaves to the master or black hole master.
  4. There is always a replication lag in slaves and that should be considered.
  5. If the master is write-heavy, expect a bit more delay in replication.

In this blog, I went through the conventional application architecture, then devised two scaleable architectures for MySQL that we can use for high performance demanding applications. My next posts will be:

  • Statement-based and Row-based Replication
  • MySQL + MemcacheD
  • Setting up MySQL Master/Slave replication
  • Setting up MySQL Master/Master replication