Sunday, January 22, 2012

MySQL InnoDB Indexes

MySQL Indexing – InnoDB Indexes
A few days back I was reading about MySQL indexes, more specifically InnoDB indexes, to better understand query performance and optimization. So I thought of sharing some information on this topic. Indexes are basically structures that help the database engine in finding (retreiving) the records faster. An opposite to index lookup is full scan. Think of a full table scan as going through all the rows in a table and selecting the right row.
A common example that is generally given when explaining indexes is a book's index. To look up for a topic in any book, you either look up for the topic in an index or scan the whole book page by page. Obviously if the book has less pages, it is viable to go page by page and scan for a topic but if the book has decent number of pages than using an index is a smart and efficient approach. Same is the case with database indexes.

InnoDB Indexes
MySQL InnoDB Index Structure uses B+ Tree structure to store its data. B+ Tree structure is a different topic. I will be writing a blog on how data in B+ Trees is organized and how insert, update and delete effects the tree

Clustered Index
Clustered Index is an approach to store data. Think of a Clustered Index as a tree structure (index), with data rows as leaves and primary key as nodes above the leaves. InnoDB clusters the data by primary key. Below are some points to remember regarding the Clustered Index:
  • As stated earlier, InnoDB clusters the data by primary key. If the table has a primary key, MySQL will use this primary key for Clustered Index.
  • If the table does not have a primary key, MySQL selects the first UNIQUE and NOT NULL index for Clustered Index.
  • If none of the above applies, MySQL will generate a hidden (6 byte) field that contain row IDs. MySQL will use this hidden field to cluster data (as Clustered Index).
As Clustered Index holds both the data and primary key on the same page, row access is faster because no additional disk I/O is needed. On the other hand, in case of MyISAM, an additional disk I/O is needed as index and data are not on the same page. Below are some points worth mentioning:
  • Insertion speeds depend on how data is inserted into the table. Insertions are fast if data is inserted in primary key order. A bad approach is to insert data randomly (with random keys) but a good approach is to have sequential keys (like AUTO_INCREMENT)
  • Updating primary key may not be a good idea as it forces each updated row to be moved to a different location. Moving rows to a new location may lead to page splits which causes a table to use more disk space.
  • Though Clustered Indexes are efficient in terms of retreival, defining a clustered key having many columns may be a disadvantage. It would be clear why its a disadvantage once you read about the secondary index.
Secondary Indexes
Secondary Indexes are also called non-clustered index. Unlike clustered index, they do not store row data as leaves but they store primary key (clustered index) as leaves. It is due to this reason that it is advised to keep primary key short. The size of primary key will effect the size of a secondary index. As far as lookups are concerned, secondary index look up requires two steps. One to get primary keys matching the secondary key lookup and after that fetching the actual data by looking up the primary keys that are fetched before, from the clustered index.

Tuesday, January 3, 2012

Architecture for Scaleable Resource Discovery (Part II)

This is the second part of this post. In Part I, I explained the problem, analysis, architecture and algorithms that were devised to solve the presented problem. In this part I will be presenting a strategy to test the architecture using Amazon EC2. The strategy includes the implementation of architecture using EC2 and then implementation of testing clusters to simulate thousands of requests per second to test the architecture.

Simulation Strategy
Region Design
In order to go about simulating our algorithm, we chose to use Amazon’s Elastic Compute Cloud (EC2) as a base framework for simulating regional data replication and distribution. Amazon’s EC2 service is already broken into seven regions (known as Availability Zones) which would allow us to partially test our geographically distributed Regional Minimum Spanning Tree building algorithm. Distributing data accesses across multiple servers within a cluster or ring can be handled by the Elastic Load Balancing (ELB) feature which also detects and reroutes traffic from unhealthy instances of the data to healthy instances until the unhealthy instances can be restored. The CloudWatch feature would allow us to monitor just how efficient our design is performing. The following diagram shows our planned usage of the Amazon EC2 platform in order to simulate our regional architecture:

Request for a resource when generated by the client is routed to an appropriate Resource Region (present in one Amazon EC2 Region). The request is routed to the region closest to the client's location. This is done automatically by Amazon. The request first lands on the Elastic Load Balancer (ELB) of the Resource Region. The ELB sends the resource request to one of the region servers. This forwarding of requests is based on the ELB's request distribution algorithm. The region servers are auto-scaled which means that the number of servers will increase or decrease as the resource requests increase or decrease. There can be different auto-scaling criteria like bandwidth, requests per second, idle CPU time, etc. Once the request reaches the region server, the server determines the cluster to which the request should be forwarded (based on the resource type). The region server then forwards the request to the ELB of the selected cluster. The cluster ELB then routes the request to one of the cluster servers. The cluster servers are auto-scaled just like the region servers. When the request is received by the cluster server, the cluster server uses consistent hashing to figure out what ring cluster to forward the request to. After identifying the ring cluster, the request is forwarded to the ring cluster ELB. This ELB then forwards the request to any MySQL ring server. MySQL ring server has a web service that takes the resource ID and looks for the resource in the database. It then returns the appropriate response (resourceLocation or notFound). Unlike the other servers (region and cluster), the ring servers are not auto-scaled. For ease of deployment and re-use, we will use server templates for Region Server, Cluster Server, Ring Server. The templates will contain all of the necessary server configuration. All we need to do to add an additional pre-configured server is spawn
another instance of a template. We can also clone the Ring Cluster (Farm Cloning) to create a new Ring Cluster when we need more rings, in the case of new resources being added to the system.

All of the servers will be closed to public access. Only for the reason of region synchronization will the cluster servers be allowed to connect to the cluster server of the neighbor region.

Testing Clusters
Now that the regional architecture is set up, we also need to simulate millions of users accessing the data concurrently across all seven regions of the EC2 platform. The following diagram is an example of how we plan on testing our region architecture described above:

We will create a separate instance of the EC2 platform which will send requests to our regional architecture at a predefined rate. These instances will be known as the Tester instances. The Tester instances will take advantage of predefined tools and use our Tester template to spawn multiple instances of a tester for each region, which will allow us to scale up the number of requests and analyze the network traffic and bandwidth used. This approach allows us to create as many or as few requests per second as we want for testing purposes.