Tuesday, December 27, 2011

Architecture for Scaleable Resource Discovery (Part I)

About the Post
This blog post was basically my project for "Distributed Computing". I took this course in Fall 2011, which was my final semester of my graduate studies. This project was basically a group project and therefore I find it necessary to quote my group members involved in this project. Alongside me, following were my group mates:
  • Justin Roberts
  • Darshan Lathia

The problem we were trying to resolve was an open ended problem. We have to design a distributed architecture that can support a resource discovery system, which handles millions of request for resource discovery. For example: a user wants to access any particular resource on the cloud. Now this resource is available in different locations on the internet. Before utilizing the resource, the user has to locate it. After which the user can use the resource for a specific task. The purpose of this project was to design a system that allows a user to locate a resource. The system should be able to handle huge traffic coming from all over the world. The system should be scale-able and robust to handle the work load and should be fault tolerant to be able to serve requests, in an event something goes bad or crashes.

In this paper, we explore the inner workings of our architecture. This architecture is a complex layering of protocols which individually solve various problems, but all work together to serve the common purpose of resource discovery in a massively distributed system. We intend to analyze our algorithms by providing a small example of how they function, and showing that each one is a robust solution for accomplishing its intended task. Finally, we will conclude by explaining our simulation strategy, and how we would accomplish a simulation if we were to fully develop our distributed system concept.

Overview of Architecture
Figure 1 - Overall Architecture
In Figure 1, we present the top level overview of our regions. Whenever a client makes a request for any hardware or software resource, the request first lands at the DNS (the request will be made to a domain name or domain names). The DNS server will have a list of IP addresses corresponding to the domain name specified in the request. The request is then routed to any of the IP addresses (in a random or round robin fashion). The IP addresses represent routers. These routers when receiving a request will forward the request to an appropriate regional server based on the location of the client (location identified by the clients IP address). A client's request will be routed to a region that is closest to his location (based on his location).
Figure 2 - Region Architecture
A region consists of regional servers and clusters. The number of regional servers depends on the load of requests from the client in that particular region (they are auto-scaleable). There is one resource cluster for each of the available resource types (for example, one cluster could be entirely devoted to the Microsoft resources). The regional server looks for the Resource Type in the requests from the client and accordingly forwards it to on of the Cluster Servers (belonging to the Resource Type Cluster), again in a round robin fashion. All the regional servers of a particular region are connected to every cluster in that region, so if one of the regional servers fails the system will continue to function due to the redundant links.

Figure 3 - Cluster Architecture

Each cluster is made up of a number of cluster servers (auto-scaleable) and replication server rings. The cluster servers will receive the client requests from the regional servers and then forward the requests to the any replication server (in the appropriate ring), based on the ResourceID. The replication servers are arranged in multiple rings where each ring can be accessed by any of the cluster servers. Each ring in a cluster represents a shard (sharding done based on the resource ID) and each replication server contains resource locations. The replication servers inside a ring are mirrors,
that is they contain the same information as all the other replication servers in the same ring (eventually consistent). Replication servers will contain all the resource discovery data required by clients. Clients can read or write on any of the replication server. Whenever update operation is performed by the client on any of the replication servers, the update is synchronized in that ring and then sent to the cluster server for replicating the change in the other rings of clusters of other region. As the load on a particular server increases, the number of cluster servers can be increased to handle multiple requests. If one of the cluster server fails, the other cluster servers will handle the requests coming from the client. So this architecture is fault tolerant as well as scalable.

Figure 4 - Ring Synchronization

Figure 4 shows that if there is an Update in any of the replication servers then the other servers should also be updated accordingly. So, whenever there is an update in one of the servers, it acts as the leader of the bidirectional ring, and sends the updates to its neighboring servers. In this way all the servers in the ring will remain synchronized.

Figure 5 - Cluster Synchronization between Regions

As already mentioned above, each region consists of clusters for every Resource Type. The other regions also have an identical set of clusters. The clusters of the same Resource type in different regions are synchronized so all the data updated in one cluster is passed on to every other cluster of the same resource type. Since all clusters are synched, if a cluster of a particular region fails or even if the whole region is down, client requests can be redirected to the next closest region. More details on cluster synchronization will be provided at a later time.

Figure 6 - Region Interaction Topology
The clusters within each region will sync with only a small subset of neighboring regions, creating a spanning tree of regions as show in Figure 6. With this approach, data in all regions will always remain synchronized and each region only has to update a small subset of the other regions.

For this deliverable, we are assuming that the resources being requested are read-only. Since this paper only describes the resource discovery protocol, there is no need for end users to modify data at this moment. The resource modification techniques will be presented in a future deliverable. We also assume that we have a client application, where the user will browse to find the desired resource type, and then select the resource to discover. The client application will then make a request consisting of a number of fields (User IP, ResourceType, ResourceID) which will be used for request routing.

Overview of Internal Algorithms
In order to make our designed solution work as intended, there are a few different algorithms which will run at the different layers of our architecture, or which will control the interaction between layers.
The first example is what is referred to as the Cluster Sync algorithm. This algorithm is responsible for maintaining consistent clusters across regions. If a resource id is added or removed from a particular cluster, the Cluster Sync algorithm takes care of synchronizing these updates across clusters.

Another example is the algorithm that creates the spanning tree of regions. Once the regional spanning tree is created, the clusters within the regions will be aware of which other regions they are responsible for updating to ensure data consistency across all regions.

As we know that each ring represents a shard (horizontal partition) of the resource records and the data servers are repositories of the resource records. The data in a ring is replicated among all the data servers which are part of the ring. The process is as follows:
  1. One data server receives an update (add/edit/delete of a resource location). Call this data server an active data server.
  2. The active data server propagates this update to its neighbor (two neighbors). Update includes an identification of the active server and the timestamp.
  3. A neighbor on receiving an update will apply the update locally, if there was no update with the same timestamp and identification received from another neighbor.
  4. The neighbor then propagates the same update to its neighbor (if it has any neighbor). In this way the data servers in the rings are synchronized to carry the current data. Of course there will be a synchronization delay but the data servers in the rings would be "eventually consistent".
Algorithm Analysis
There are four main algorithms which control the sharing and distribution of resources: the Regional Minimum Spanning Tree, Cluster Synchronization, Ring Synchronization, and finally the Resource Discovery algorithm which allows a user to search for and use a particular resource. The following is our analysis of how each of these algorithms functions, and why we consider each to be robust.

The algorithm that connects various regions across the world is a minimum spanning tree concept which allows the clusters within the regions to be aware of which other regions they are responsible for updating to ensure data
consistency across all regions. At the region level, the most efficient way to create a minimum spanning tree is by using the well-known Asynchronous GHS (AsyncGHS) algorithm. The process of creating the regional spanning tree is as
  1. A single region can contain multiple regional servers, so one server from each region is chosen to be a leader for that particular region. Leader election within a region is assumed to be trivial. Since there will only be a small number of region servers within a region, a leader could be chosen manually.
  2. The leaders of each region will begin an instance of the AsyncGHS algorithm. All regions are assumed to be connected in a complete graph topology, so core edges will initially be determined by least-cost paths between regions.
  3. Once the AsyncGHS algorithm has finished the merging and absorption process, the leaders of each region will be aware of which other regions it is directly connected to based on the minimum spanning tree created.
  4. The leaders of the region will propagate the resulting spanning tree to the other regional servers within its designated region.
Creating a minimum spanning tree to connect the regions together will minimize the number of neighboring clusters that each region is responsible for updating when changes are made. In addition, we assume that larger distances mean greater link cost (either in latency or monetary), so regions far from each other will have a greater cost and hence be less likely to become neighbors. Therefore, geographic separation between regions will ensure that no single region will have too many neighbors.

This algorithm is responsible for maintaining consistent clusters across regions. If a resource id is added or removed from a particular cluster, the Cluster Sync algorithm takes care of synchronizing these updates across clusters. Once an MST is created to connect the regions, each region knows what other regions to synchronize. Below are the steps for cluster synchronization:
  1. When an update is made to any region (suppose Region1), that region should synchronize its neighbors (suppose Region2 and Region3).
  2. In order to synchronize the neighbor regions, the cluster servers (one or more of a specific resourceType) connect to cluster servers (one or more of the same resourceType) of Region2 and Region3.
  3. The cluster servers of Region1 will now send the updates to its neighboring cluster servers (of Region2 and Region3).
  4. The receiving cluster servers then perform the update on an appropriate resourceType shard based on the same consistent hashing strategy as they use to find the resourceType shard for a resourceID.
  5. Similar process is done at Region2 and Region3 to synchronize their neighbors.
Using this process, all of the regions are synchronized. One reason to synchronize all of the regions is to allow the whole system to function properly in case an entire region goes down. Given this situation, we can route the traffic (of the failed region) to neighbor regions or any region we want.

As we know that each ring represents a shard (horizontal partition) of the resource records and the data servers are repositories of the resource records. The data in a ring is replicated among all the data servers which are part of the ring. The process is as follows:
  1. One data server receives an update (add/edit/delete of a resource location). Call this data server an active data server.
  2. The active data server propagates this update to its neighbor (two neighbors in the ring). Update includes an identification of the active server and the timestamp.
  3. A neighbor on receiving an update will apply the update locally, if there was no update with the same timestamp and identification received from another neighbor.
  4. The neighbor then propagates the same update to its neighbor.
In this way the data servers in the rings are synchronized to carry the current data. Of course there will be a synchronization delay but the data servers in the rings would be "eventually consistent".
We will use record based synchronization to synchronize ring servers. In record based sync technique the actual record is transferred from one ring to its neighbors. This technique will increase the network traffic but is good and efficient if we have non-deterministic query functions.

Resource Discovery here is used to find appropriate resource for the clients based on their request. The algorithm operates as follows:
Whenever a client wants to use a particular resource it formulates a request for that particular resource type. This request contains Source IP Address, Resource Type & Resource ID. Source IP address is used by the DNS server to determine the closest Region Server for that particular client. Resource type helps in selecting a particular resource type cluster from all the clusters in that region (One Resource cluster Type for every Region). Resource ID helps in selecting a ring of replication servers among all the Rings connected to the cluster server. This selection is done using Consistent Hashing on the Resource ID. The Complete process from initiating a request to allocation of a resource for a client is as follows:
  1. The Client which is in need of a Resource sends a request for that resource with its source IP address, Resource type required and the resource ID.
  2. The DNS Server first receives this request and selects a region server based on the IP Address of the client which is the closest to the client. Also, it can redirect the client to the next closest Region if the closest regional server is not functional at that time for some reason.
  3. A Region contains many regional servers. One of the regional server is selected in a Round-Robin fashion. This helps in Load Balancing.
  4. The Regional server based on the Resource Type requested by the client selects a Resource Type Cluster. There is exactly one Resource type cluster for every resource that a client may want to use in that particular region. So, Cluster Selection can be done by simply mapping the resource type requested to the different clusters in the Region.
  5. Each Resource type cluster has some cluster servers. One of the server is selected based on Round-Robin Scheme which again helps in Load Balancing of the number of requests coming from the clients.
  6. The cluster servers are connected to the rings of Replication servers. Number of rings in the region depends upon how frequently clients requests for that resource. Consistent Hashing on the Resource ID is used for selecting a ring. Consistent hashing has many advantages over other hashing techniques. Some properties of consistent hashing make it a different and more improved method than other standard hashing schemes like the Spread, Load, Smoothness, Monotonic, and Balancing.
  7. Clients can read or write on any of these Replication servers as each server is a mirror of another i.e. all the servers contain the same data and are updated according to a change in other replication server.
The fact that the Resource Discovery algorithm is broken down into regions based on a user’s geographic location, and the round-robin approach to server load balancing ensure that this algorithm performs with an acceptable speed. When this is combined with the consistent hashing technique for locating one particular resource, it ensures that our resource discovery protocol is both fast and scalable.

Example Resource Discovery
The resource discovery protocol which finds and returns resources is currently the focus of our attention. In order to retrieve a particular resource, the algorithm will operate as follows:
  1. A user will formulate a request for a particular resource type. This request will consist of multiple parts, including the source IP address (IP), resource type (RTYPE), resource id (RID), and the action being performed (DISCOVER, UPDATE, etc.).
  2. The user will attempt a connection, which through the local DNS servers will route the user to the nearest region servers. Since there can be multiple servers operating at the region level, the individual region server picked within the chosen region will be based on a round-robin scheme to load balance all incoming requests into the region.
  3. The region server will receive the request and analyze the RTYPE and action. The region server will know that this is a DISCOVER message (meaning the requesting node is attempting to locate a specific resource), so based on the value of RTYPE, the region server will forward the request to the appropriate resource cluster which services the requested resource type. In a similar manner to the region server operation above, requests will be routed to an individual cluster server within the chosen resource cluster in a round-robin fashion to account for load balancing within the cluster.
  4. The chosen cluster server will receive the request and analyze the RID. Based on this resource id, the cluster server will know which resource ring to forward the incoming request on to.
  5. The chosen server in the resource ring will receive the request, and reply to the incoming user directly with the given resource.
That's all for this blog. In the next blog I will give a brief overview of how we thought we could simulate this architecture. We decided to use Amazon EC2 to develop our architecture and the topologies and also presented a way to test the resource discovery request.

  • Amazon Web Services – Elastic Compute Cloud (EC2) [http://aws.amazon.com/ec2]
  • DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon’s Highly Available Key-value Store. (Stevenson, Washington, United States, October 14-17, 2007). SOSP’07. ACM Press, New York, NY, 205-220.

Note: There may be several issues with the approach and the architecture we presented. There might be several additions or improvements to this architecture. Comments will be appreciated.

Monday, August 15, 2011

Installing Xdebug and webgrind on Ubuntu to debug and profile PHP

Xdebug is a PHP extension that allows us to debug and profile PHP scripts. Its a very handy tool that allows you to profile PHP scripts to figure out the performance bottlenecks or to trace/debug the script. Below is the procedure to install Xdebug and use the information generated by Xdebug for the purpose of profiling and debugging.

In this post I will be discussing the basic installation of Xdebug on Ubuntu and then using 'webgrind' front end to view the profiling data generated by Xdebug. You can install Xdebug in two ways. The first one works if you are using older versions of Ubuntu (below 8.0) and the other one works with Ubuntu 8+. When using the older method of installation, Xdebug is installed through PEAR/PECL. If you have Ubuntu 8+, you can still install Xdebug through PEAR/PECL.

Installing Xdebug through PEAR/PECL:
First install the 'php5-dev' package. This package provides php5 source files, needed to compile additional modules.
sudo apt-get install php5-dev

Next install the 'pear' package.
sudo apt-get install php-pear

Now install Xdebug.
sudo pecl install xdebug

Now find where Xdebug extension is, so the path could be included in the 'php.ini' file as a zend extension.
sudo find / -name 'xdebug.so'
Output: /usr/lib/php5/20090626+lfs/xdebug.so

Now open the 'php.ini' file from the location given below:

In 'php.ini', you add the path to 'xdebug.so'. Below are minimal configuration option that you can use to get started. A complete list can be found at: http://xdebug.org/docs/all_settings

xdebug.profiler_output_name = "cachegrind.out.%t.%p"

Now restart the 'Apache' server.
sudo service apache2 restart

New installation method:
First install Xdebug.
sudo apt-get install php5-xdebug

Now find where Xdebug extension is, so the path could be included in the 'xdebug.ini' file as a zend extension.
sudo find / -name 'xdebug.so'
Output: /usr/lib/php5/20090626+lfs/xdebug.so

Now open the 'xdebug.ini' file from the location given below:

In 'xdebug.ini', you add the path to 'xdebug.so'. Below are minimal configuration option that you can use to get started. A complete list can be found at: http://xdebug.org/docs/all_settings

xdebug.profiler_output_name = "cachegrind.out.%t.%p"

Note: These are the same configuration settings as mentioned in installation method above. For explanation of each configuration refer to the previous method.

Now restart the 'Apache' server.
sudo service apache2 restart

After installation:
Now that you have successfully installed Xdebug, its time to test your installation. Execute any of the PHP scripts you have, or just write a simple php script and execute it. After execution, you will see that there is a Xdebug Profile file in the "/tmp" directory. Hence the installation was successful.

Now that the profiles are being generated, lets switch on to 'webgrind', which is a web-based tool to provide profile information in a meaningful fashion.

First download 'webgrind' from the location below:
Extract the compressed files to "/var/www/webgrind" or any location you want (should be served by Apache)
Now open the 'webgrind' config file from the location below:
Or any location where you have extracted the 'webgrind' package.

In 'config.php', change the following variables:
static $storageDir = '';  // DIRECTORY WHERE YOU WANT TO STORE webgrind FILES
static $profilerDir = '/tmp'; // DIRECTORY WHERE THE Xdebug PROFILE FILES ARE

Now you can run 'webgrind', by executing the index.php file. It can be found in the following location:
Or any location where you have extracted the 'webgrind' package.

Sunday, July 17, 2011

MySQL Replication: Statement based Replication vs Row based Replication

Before discussing the replication formats in MySQL it is necessary to discuss how replication works in MySQL. Why is replication required? The answer lies in my earlier post in which I described the process of scaling MySQL for reads (Scale-Out MySQL). Lets proceed by discussing the replication process. Replication process consists of the following steps:
  1. MySQL Master writes any changes that occur to the 'binlog'. binlog is a log that contains any updates/inserts/deletes made on the master MySQL.
  2. Slave's I/O thread reads the binlog from the master and writes the events in its 'relaylog'.
  3. The MySQL thread reads the events from the relaylog and applies those events to the slave.
  4. Steps 1 till 3 are repeated so the slave is synchronized with the master all the time.
Below is the diagram that will aid you in understanding the process of replication in MySQL.

Now that you are aware to some extent about the process of replication, lets dive deep into the replication formats. By replication format I mean the format in which events are recorded in master's binlog. There are three types of replication formats:
  1. SBR or Statement based Replication
    In this format of replication, the MySQL master, records the events as SQL statements into the binlog. The statements are picked up by the MySQL slaves and replayed in the same way as they are played at master
  2. RBR or Row based Replication
    In this format of replication, the MySQL master, records the events as actual rows that indicate how the rows are changed at the master.
  3. Mixed Mode Replication
    This format of replication is a mix of RBR and SBR. MySQL switches the format in real-time depending on the type of event.
Of course there are pros and cons of each format, I will primarily address the RBR and SBR. I will discuss the pros and cons of each replication format type.

Statement based Replication
  1. As in case of statement based replication the events are logged as SQL statements and not in form of row changes, hence the log files are much much smaller and utilize less storage space.
  2. The log file contains all the SQL statements and hence can be utilized for analysis, audit or restoring from backup.
  3. In terms of bandwidth, statement based replication is much more efficient because its the queries that are transferred to slave and not the actual row updates.
  1. Statements that possess non-deterministic properties are difficult to replicate using this format.
  2. Statements that use stored procedures or user defined functions (UDF) are considered to be nondeterministic statements and hence are difficult to replicate using SBR.
  3. DELETE and UPDATE statements that use LIMIT without using ORDER BY are considered to be nondeterministic and hence cannot be replicated using SBR.
  4. UPDATE statements with a WHERE clause that doesn't utilize the index require more locks in SBR then in RBR.
  5. INSERT statements that use auto-increment, block other non-conflicting INSERT statements. This is applicable to InnoDB engine. Its a very important point to consider as this point effects concurrancy.
  6. More locks are required on the slave for INSERT/UPDATE/DELETE
  7. Complex statements that are evaluated and executed on the master, need to go through the same process on the slave before applying changes to the slave.
  8. Table definations should be identical on the master and slave.
Row based Replication
  1. Its the safest form of replication though not as old as statement based replication.
  2. Fewer row locks are required for INSERT statements with auto-increment.
  3. Fewer locks are required for INSERT/UPDATE/DELETE on the slave when compared to statement based replication.
  4. Fewer locks are required when UPDATE and DELETE statement's WHERE clause does not make use of the index.
  1. Contrary to statement based replication, row based replication records changes to each row. The most probable disadvantage from this behavior is the size of the binlog. If a SQL statement effects 100 rows then in SBR just one query is logged whereas in RBR changes to 100 rows are recorded. This behaviour makes binlogs in RBR much much larger then SBR.
  2. As in case of SBR, the SQL queries are logged and hence they can be reused and read for audit. In RBR, there is no way to figure out what statements were executed on the master and recieved on the slave. However what rows has been changed or inserted, can be decoded.
  3. There is no mechanism available in RBR to ensure that the binlog in MySQL master was processed without any problem at the slave.
After comparison of both the formats, I personally believe that Row based Replication in MySQL is the way to go because:
  • Its the safest form of replication
  • It is safer when it comes to replicating triggers and stored procedures.
  • It requires fewer locks and hence its much faster and achieves high concurrancy.
  • Regarding the drawback of having large binlog, I would say that not all applications may issue updates that effect thousands of rows. Comparing to the other advantages that row based replication provides and an informed guess I would say that this disadvantage may be ignored.
The choice of which format to choose depends on the application and requirements. I have not touched mixed format. My guess is that mixed format may bring in the best of both world. I will be discussing mixed format replication in detail, in my upcoming post.

Sunday, July 10, 2011

Memcached and MySQL (Part II – Memcached + MySQL)

In part one of this blog (you can see it here) I gave a detailed overview of what Memcached really is. In this part I will address the usage of Memcached to alleviate the load from MySQL. In a scalable + high traffic application, database will prove to be a bottleneck due to the limitations of disk read/write rate, which is slow as compared to reading from memory. As Memcached is a memory-based distributed object storage cache, we can utilize the power of Memcached. Will avoiding database a good optimization? Yes it is. This is where Memcached will come into play, avoiding the requests to hit MySQL. Well not in all cases but yes to a considerable extent. Having Memcache installation on the same server as MySQL and using it will help the data source to perform much better. When I say Memcache it means that I am refering to only one instance of Memcache and not a distributed version (in the latter case its Memcached). Having Memcache on the same server as MySQL means that memory will be distributed between Memcache and MySQL which is not what I prefer. I would prefer a separate server dedicated to Memcache so MySQL can utilize the memory and definitely get more of it. Below is the architecture that I am considering:

As you can see that we have a MySQL server and a Memcache Server. We generally have queries that are either fetching data from database or adding/deleting/updating data in the database. For each type we can follow a particular sequence:

Read Queries:
  1. Check Memcached if the data set is available

  2. If its available then congratulations you just avoided a hit to the database

  3. If its not, too bad. Get the result set from the database

  4. You want to avoid hitting the database again, so set the data set in Memcache

Write Queries:

  1. Write data to the database

  2. If the write was a success, then either drop the data set from the Memcache or update the Memcache with current information

Of course this is a very naive concept when it comes to the practical implementation. Practically things are very much different and a bit complicated when it comes to the usage of Memcached. Additionally there can be different levels of caching in an application which are constructed overtime in the life cycle of an application and are based on the needs. Also we can increase the Memcache servers to develop a Memcached cluster to cater the needs of ever increasing data. Now we can clearly see that the performance of the application will increase greatly by reducing the load on the database, using an in-memory key value storage.

Sunday, July 3, 2011

Memcached and MySQL (Part I - Memcached)

We all know that disk I/O is expensive then memory and we also know that data in memory is volatile but on disk is non-volatile. Talking about relational database storage, data is stored on disk which means that it is non-volatile but the retrieval and storage is slower then memory. On the contrary, if we store data in memory the retrieval and storage is super fast but the data is volatile and thus prone to loss. The question is whether we can get the best of both worlds. The question will be anwered ahead. Lets first see what Memcached is. Below is a point-wise explanation of Memcached (will try to cover as much as I can)

Memcached in a nutshell

  1. Memcached is a distributed in-memory object caching system. It's distributed, which means that Memcached does not represent a single server but can span hundreds of computers. Its in-memory, which means that all the objects are stored in memory (RAM). Memcached is a distributed version of Memcache.

  2. It's an in-memory key/value store, where data/object can be stored using the key as an identifier of the data/object.

  3. Data is in-memory and is therefore volatile. Its good as a cache but not good for data that needs to be persisted and the loss of which might not be good for the application or the users.

  4. All operations in the Memcached take constant time and hence their complexity is O(1). The basic operations of Memcached are add, set, get, multi-get, delete, replace (Note: a set after a set on the same key is considered to be an update).

  5. All items in Memcached have expiration time. An expiration time of zero '0' means that the item will never expire (here never means 30 days of expiration time). If the expiry time is greater then 30, it will be treated as a UNIX timestamp.

  6. Memcached does not have a garbage collection mechanism. You need to either explicitely delete the item, get an item that is already expired, or wait for Memcached to run out of alloted memory. In short, Memcached memory reclaiming is lazy, which is logical keeping in mind the complexity and processing involved in garbage collection.

  7. Memcached reclaims the memory using the following mechanism:

    1. If an item is requested, Memcached checks it's expiry time. If the item is expired, it returns a negative response and reclaims the memory by freeing its memory.

    2. If Memcached is unable to accommodate any new items, it starts to free the memory of LRU (least recently used) items in order to accommodate new items.

  8. Memcached servers are isolated, which means that one server is unaware of the presence of another server. Where to route the request is the responsibility of the Memcached client library.

  9. Generally you do not need authentication mechanism for Memcached and previously it was not even supported. Now if the client supports, SASL authentication can be used. Generally Memcached infrastructure is in a closed internal network and hence having authentication and other security measures may complicate and introduce unwanted latency to an otherwise simple concept.

  10. Memcached has a client part and the server part. The client part is responsible for routing the request to an appropriate Memcached server in the Memcached server cluster, managing connection and handling failures. The server part is responsible for request processing and reclaiming memory.

  11. You can cache objects, queries, data-set and anything sensible in the Memcached. Just remember its a cache and not a persistent storage.

  12. There is no replication or a fail-over mechanism in Memcached.

  13. Compression and Serielization of cache objects should be investigated when selecting an appropriate client for Memcached. Also connection handling mechanism should be carefully read in order to avoid connection leakages which will render the Memcached server useless.

  14. Hashing Algorithm depend on the clients. Generally 'Consistent Hashing' algorithm is implemented by the clients. This algorithm devises a strategy to distribute the keys across several Memcached servers evenly but the biggest advantage comes in when new servers are added to the Memcached cluster. This algorithm minimizes the number of re-hashed keys whenever a new server is added in comparison to the normal hashing algorithms where re-hashing is considerable.

Below is a diagram that shows the client part and server part of Memcached (in a Memcached cluster of two Memcached servers):

The next part will discuss how we can use Memcached to allieviate the load from the database server, which was the actual motive of this post.

Sunday, June 26, 2011


A well-designed application has many traits. Good schema design is one of them. Choosing the right data type serves as a contributor towards a good schema design. The selection of a data type depends on various questions like what is the range of values that the column will hold?, the precision required by the column?, and if any special behavior is required. Choosing an optimal data type may save space on disk and memory, and may save CPU cycles. Generally such consideration are trivial when we talk about smaller applications but are very important for large scale applications. MySQL offers two alternatives for storing date time values namely DATETIME and TIMESTAMP. Which one is the best is not a good question but which one to use and what are the characteristics of each one is a better question and knowing the answer to the latter question will help us selecting the best option.

Storage Format

When using DATETIME, MySQL stores date time information as an integer in the format of YYYYMMDDHHMMSS. When displaying this data MySQL formats this storage representation into an ANSI standard representation which is YYYY-MM-DD HH:MM:DD. On the other hand, MySQL just stores the number of seconds elapsed since January 1, 1970 12:00 AM when it comes to storing date time information as TIMESTAMP.


DATETIME can hold date time information from the year 1001 to the year 9999 whereas TIMESTAMP can hold date time information from the year 1970 to the year 2038.


DATETIME uses 8 bytes of storage whereas TIMESTAMP uses 4 bytes of storage, almost half of what DATETIME uses. So if TIMESTAMP is feasible for usage, you shouldn't mind using it.

Timezone Dependence

Storage and display of date time information when using DATETIME is independent of the timezone whereas in the case of TIMESTAMP, the date time information displayed is dependent on the time zone.

Special Characteristics of TIMESTAMP

TIMESTAMP type has some characteristics that are absent in DATETIME type. Below are the characteristics:

  • The display of date time when using TIMESTAMP type is dependent on the time zone. This means that MySQL will return the date time information after converting it into an appropriate time zone. By default the time zone is the server's time and can be changed on a per connection basis.

  • By default, if you insert a row in the table without specifying the value of the timestamp column, MySQL will set the time of that column to be the current time. Similar is the case when a row is updated, MySQL will update the TIMESTAMP column with the current time if the date is not specified in the update query. Obviously we can change the default behaviour of the TIMESTAMP data type.

Hopefully this information will help us distinguish the two data types and to choose the right data type for the right purpose.

Wednesday, June 22, 2011

Scale-Out MySQL

As a web applications grows, they face high peak times and a burst of requests that they have to handle. On an application stack, software components can be scaled with ease by just adding new application servers. The hardest part is to scale the data layer. One question that comes into our mind is to why scaling the database is necessary and what matters most? The database or the application. The answer to this is that the stand alone database will at some point become a bottle neck for the application. If the application is capable of servicing 800 requests per second then because of the database, the application will be able to serve far lesser requests then it is capable of. In order to explain the bottleneck lets analyse the architecture of MySQL:

Figure taken from:

If you analyse the lowest level of this architecture, you will see a bottleneck. Yes you are right, its the file system or the disk on which the data is eventually stored. No matter how fast and efficient the data retrieval and storing algorithms are, there will always be a limitation of the disk in short the requests are I/O bound.

Now that we are aware of the bottleneck that effects the application's performance, we have to come out with different solutions in order to improve the response/serving time of the data layer. There are different architectures that provide additional performance. Lets first look at the traditional application architecture:

The above architectures are traditional architectures, that almost every application may have, or at-least most. As we can see in figure 2, the application layer is scaled to two app servers which improves the performance a little bit, but the problem is the data layer. Of course, scaling the application layer will definitely have some issues but those issues are out of scope for this blog. So lets start by talking about scaling the database. Yes by scaling I mean increasing the number of database servers to distribute the load between them.

  1. A query is either a read or a write query. I call them passive queries or active queries. Queries that change the state of the database are considered to be active queries (insert, update, & delete) whereas the queries that do not change the state of the database are passive queries (select).
  2. Master Server: A server that serves as the primary data source for the application. This server has the most updated state of the data.
  3. Slave Server: A server that serves as a backup data source for the master server and shares the load with the master server. The state of the data in this server is 'eventually consistent'.

Improvement 1:
This improvement introduces one or more slaves for a master in the architecture. The active queries are executed at the master and the updates are reflected on the slaves after a slight delay. The passive queries may be executed on any of the servers including the master server. This approach reduces the load on master server as now the load is shared between more then one server. Below is the architecture:

As we can clearly see that the slaves share the burden of the reads on the master. We can add some more slaves in order to improve performance further.

Improvement 2:
So in order to increase the performance we keep on increasing the slaves and wait for the magic. But soon we observe that the spell proves to be ineffective and the master server is lagging. The reason for this lag are the slaves. MySQL master slave configuration requires master to record active queries in a log. The slaves, in order to update the data connect to master, access the log, get all the new updates in it and apply the updates to the slave data. This way the slaves are synchronized with the master. So many slaves trying to synchronize themselves with the master server by making connection with the master and accessing the log to get the updates, makes the master lag. Having this problem, we can utilize the 'blackhole' engine to solve the master lag issue.
So instead of Slaves connecting directly to the master, we introduce a dummy master that acts as a slave to master, getting all the changes in the binlog of the master and recording it in its binlog. The only difference between master and the dummy master is the table type. All the table in the dummy master use black hole engine. It means that the actual data is not stored and the queries are not performed on the dummy master but rather just logged. The slaves can connect to the dummy master just like in Improvement 1 and get synchronized. Below is how the architecture should look like:

As we can now see that all the slaves do not connect to master but rather they connect to the dummy master and the dummy master connects to the master. We can add some more dummy master in order to accommodate more slaves.

Few Notes:
  1. In this post I only talked about Scaling Reads and not Scaling Writes.
  2. There is a single point of failure and yes that's the master. You need to have a replicated secondary master so in case the primary master fails, the secondary master takes its role.
  3. You cannot add up infinite slaves to the master or black hole master.
  4. There is always a replication lag in slaves and that should be considered.
  5. If the master is write-heavy, expect a bit more delay in replication.

In this blog, I went through the conventional application architecture, then devised two scaleable architectures for MySQL that we can use for high performance demanding applications. My next posts will be:

  • Statement-based and Row-based Replication
  • MySQL + MemcacheD
  • Setting up MySQL Master/Slave replication
  • Setting up MySQL Master/Master replication