Friday, May 15, 2009

libketama - a consistent hashing algo for memcache clients

libketama - a consistent hashing algo for memcache clients

We wrote ketama to replace how our memcached clients mapped keys to servers. Previously, clients mapped keys->servers like this:

server = serverlist[hash(key)%serverlist.length];

This meant that whenever we added or removed servers from the pool, everything hashed to different servers, which effectively wiped the entire cache. We add (and sometimes remove) servers from the memcached pool often enough to warrant writing this - if your memcached pool never changes, you can probably stop reading now :)

Ketama is an implementation of a consistent hashing algorithm, meaning you can add or remove servers from the memcached pool without causing a complete remap of all keys.

Here's how it works:

* Take your list of servers (eg:,,
* Hash each server string to several (100-200) unsigned ints
* Conceptually, these numbers are placed on a circle called the continuum. (imagine a clock face that goes from 0 to 2^32)
* Each number links to the server it was hashed from, so servers appear at several points on the continuum, by each of the numbers they hashed to.
* To map a key->server, hash your key to a single unsigned int, and find the next biggest number on the continuum. The server linked to that number is the correct server for that key.
* If you hash your key to a value near 2^32 and there are no points on the continuum greater than your hash, return the first server in the continuum.

If you then add or remove a server from the list, only a small proportion of keys end up mapping to different servers.

The majority of the code is a C library (libketama) and a php4 extension that wraps it. I've also included a class from our Java client. (Java Collections makes it rather easy). We use a single-server memcache client wrapped with a native php class to make it multi-server capable, so we just replaced the hashing method with a ketama_find_server call. (should be easy enough to plug this into libmemcache if need be)

We've been using this in production for all our php installs and java services at for around 10 days now. We deployed it just in time to smooth over moving loads of webservers between datacenters.


from: and

DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1.

In the illustration above, the two orange boxes represent two servers that form an HA cluster. The boxes contain the usual components of a Linux™ kernel: file system, buffer cache, disk scheduler, disk drivers, TCP/IP stack and network interface card (NIC) driver. The black arrows illustrate the flow of data between these components.

The orange arrows show the flow of data, as DRBD mirrors the data of a high availably service from the active node of the HA cluster to the standby node of the HA cluster.

→ Continue with What is HA

DRBD® and the DRBD® logo are trademarks or registered trademarks of LINBIT® in Austria, the United States and other countries. 

DRBD is a block device which is designed to build high availability clusters.    This is done by mirroring a whole block device via (a dedicated) network. You    could see it as a network raid-1.        DRBD takes over the data, writes it to the local disk and sends it to the other    host. On the other host, it takes it to the disk there.        The other components needed are a cluster membership service, which is supposed    to be heartbeat, and some kind of application that works on top of a block    device.        Each device (DRBD provides more than one of these devices) has a state, which    can be 'primary' or 'secondary'. On the node with the primary device the    application is supposed to run and to access the device (/dev/drbdX; used to be    /dev/nbX). Every write is sent to the local 'lower level block device' and to    the node with the device in 'secondary' state. The secondary device simply    writes the data to its lower level block device. Reads are always carried out    locally.        If the primary node fails, heartbeat is switching the secondary device into    primary state and starts the application there. (If you are using it with a    non-journaling FS this involves running fsck)        If the failed node comes up again, it is a new secondary node and has to    synchronise its content to the primary. This, of course, will happen whithout    interruption of service in the background.         And, of course, we only will resynchronize those parts of the device that    actually have been changed. DRBD has always done intelligent resynchronization    when possible. Starting with the DBRD-0.7 series, you can define an "active    set" of a certain size. This makes it possible to have a total resync time of    1--3 min, regardless of device size (currently up to 4TB), even after a hard    crash of an active node.         The ChangeLogs can be found here:;a=blob;f=ChangeLog;hb=HEAD;a=blob;f=ChangeLog;hb=HEAD        The DRBD Homepage is

Thursday, May 14, 2009


About Nagios
Home > About

Get proactive.
Save time, money, and your sanity.

Nagios is the industry standard in enterprise-class monitoring for good reason. It allows you to gain insight into your network and fix problems before customers know they even exist. It's stable, scalable, supported, and extensible. Most importantly, it works.

What does Nagios provide?

Comprehensive Network Monitoring

  • Windows
  • Linux/Unix
  • Routers, Switches, Firewalls
  • Printers
  • Services
  • Applications

Immediate Awareness and Insight

  • Receive immediate notifications of problems via email, pager and cellphone
  • Multi-user notification escalation capabilities
  • See detailed status information through the Nagios web interface

Problem Remediation

  • Acknowledge problems through the web interface
  • Automatically restart failed applications, services and hosts with event handlers

Proactive Planning

  • Schedule downtime for anticipated host, service, and network upgrades
  • Capacity planning capabilites through usage monitoring

Reporting Options

  • SLA availability reports
  • Alert and notification history reports
  • Trending reports through integration with Cacti and RRD-based addons

Multi-Tenant/Multi-User Capabilites

  • Multiple users can access the web interface
  • Each user can have their own unique, restricted view

Integration With Your Existing Applications

  • Trouble ticket systems
  • Wikis

Easily Extendable Architecture

  • Over 200 community addons are available to enhance Nagios

Stable, Reliable, and Respected Platform

  • 10 years in development
  • Scales to monitor 100,000+ nodes
  • Failover protection capabilities
  • Winner of multiple awards
  • Constant media coverage

Huge Community

  • 250,000+ users worldwide
  • Active mailing lists
  • Extensive community website network

Customizable Code

  • Open Source Software
  • Full access to source code
  • Released under the GPL license


What is Ganglia?

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Ganglia is an open-source project that grew out of the University of California, Berkeley Millennium Project which was initially funded in large part by the National Partnership for Advanced Computational Infrastructure (NPACI) and National Science Foundation RI Award EIA-9802069. NPACI is funded by the National Science Foundation and strives to advance science by creating a ubiquitous, continuous, and pervasive national computational infrastructure: the Grid. Current support comes from Planet Lab: an open platform for developing, deploying, and accessing planetary-scale services.

Ganglia ( takes a stab at answering all of these questions. Ganglia is comprised of several components: gmond, gmetad, and gmetric (pronounced gee-mon-dee and so forth), which perform the roles of sending data, aggregating data, and collecting data, respectively. gmond, the Ganglia-monitoring daemon, runs on each host we want to monitor. It collects core statistics like processor, memory, and network usage and deals with sending them to the spokesperson for the server cluster. The spokesperson for a cluster collects all of the data for the machine in a cluster (such as web servers or database servers) and then sends them on to the central logging host. gmetad is used for aggregating together cluster data at intermediate nodes. gmetric allows you to inject extra metric data into the Ganglia multicast channel, sending your statistics to the central repository.

One of the really useful abilities of the Ganglia system is how easy it is to extend the monitoring with custom metrics. We can write a program or script that runs periodically via cron to check on our statistic and then feed it into the Ganglia aggregation and recording framework by using the gmetric program. Our custom scripts can collect the data using whatever means necessary and then inject a single value (for each sampling run) along with the metric they're collecting for, its format, and units:

/usr/bin/gmetric -tuint32 -nmemcachced_hitratio -v92 -u% 

This example feeds data into Ganglia for a statistic called memcached_hitratio. If Ganglia hasn't seen this statistic before, it will create a new RRDTool database for storing it, using the data type specified by the -t flag (unsigned 32-bit integer in this example). The value 92 is then stored for this sampling period. Once two sampling periods have been completed, we can see a graph of the memcached_hitratio statistic on the Ganglia status page for that node. The final -uflag lets us tell Ganglia what units the statistic is being gathered in, which then displays on the output graphs, as shown in Figure 10-8.

Figure 10-8. Output graph with units


What RRDtool does

RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. Use it to write your custom monitoring shell scripts or create whole applications using its Perl, Python, Ruby, TCL or PHP bindings.

Tobi Oetiker's Toolbox


I work as an IT-Specialist for OETIKER+PARTNER AG a consultancy company based in Olten, Switzerland.

We use OpenSource Software extensively in all of our projects. So I made it a habit to open source code we write whenever possible. Below is a list of notable projects I have created or been involved with over the years.



A industry standard database for logging and graphing time-series data.


The first widely used, long term network monitoring tool on the Internet. I created it in 1995 and still update it regularly as people send in patches.


Monitor latency on your network. By default it uses ping to track the latency on your network links.


Running a large storage server. This highly configurable tool lets you keep track of disk usage per directory. It comes with a Ajax front end and needs classic du for input.

Perl Win32::Monitoring Modules

A set of perl modules helping to write user interactivity monitoring systems.

Database Development

Gedafe the Generic Database Frontend

Developed with David Schweikert, this neat tool lets you write database centric web applications by simple setting up a postgresql database while following a few simple design and naming rules.

System Management

O+P Insights

A website where me and my colleagues document our insights connected with all things system management.

Windows System Management: Real Men Don't Click

A website about windows system management setup I helped setup while I was working for ETH Zurich.


A software deployment system for large Unix systems. It deals with multi architecture issues, multiple versions of the same package installed in parallel.

ISG Toolchest

A system for keeping track of all the little scripts you write while sysadmining. Also on this page you find a large collection of tools developed at ETH Zurich using the system.

SamFS Support for Samba

This patch teaches Samba to handle SamFS offline files.


The not so Short Introduction to LaTeX2e

LaTeX is a system for writing documents with perfect looks while fully concentrating on the content. This document shows you how to do this.


Red Hat Content Accelerator Manuals


Red Hat Content Accelerator is a kernel-based, multi-threaded, extremely high performance HTTP server with the ability to serve both static and dynamic data. TUX is also an architecture for kernel-accelerated network services.

Red Hat Content Accelerator 2.2

Reference Manual
PDF (208 KB)    |    HTML Tarball (28 KB)

Red Hat Content Accelerator 2.1

Reference Manual
PDF (144 KB)   |   HTML Tarball (24 KB)

Red Hat Content Accelerator 2.0

Reference Manual
PDF (76 KB)   |   HTML Tarball (22 KB)

MogileFS is our open source distributed filesystem

About MogileFS

MogileFS is our open source distributed filesystem. Its properties and features include:

  • Application level -- no special kernel modules required.
  • No single point of failure -- all three components of a MogileFS setup (storage nodes, trackers, and the tracker's database(s)) can be run on multiple machines, so there's no single point of failure. (you can run trackers on the same machines as storage nodes, too, so you don't need 4 machines...) A minimum of 2 machines is recommended.
  • Automatic file replication -- files, based on their "class", are automatically replicated between enough different storage nodes as to satisfy the minimum replica count as requested by their class. For instance, for a photo hosting site you can make original JPEGs have a minimum replica count of 3, but thumbnails and scaled versions only have a replica count of 1 or 2. If you lose the only copy of a thumbnail, the application can just rebuild it. In this way, MogileFS (without RAID) can save money on disks that would otherwise be storing multiple copies of data unnecessarily.
  • "Better than RAID" -- in a non-SAN RAID setup, the disks are redundant, but the host isn't. If you lose the entire machine, the files are inaccessible. MogileFS replicates the files between devices which are on different hosts, so files are always available.
  • Flat Namespace -- Files are identified by named keys in a flat, global namespace. You can create as many namespaces as you'd like, so multiple applications with potentially conflicting keys can run on the same MogileFS installation.
  • Shared-Nothing -- MogileFS doesn't depend on a pricey SAN with shared disks. Every machine maintains its own local disks.
  • No RAID required -- Local disks on MogileFS storage nodes can be in a RAID, or not. It's cheaper not to, as RAID doesn't buy you any safety that MogileFS doesn't already provide.
  • Local filesystem agnostic -- Local disks on MogileFS storage nodes can be formatted with your filesystem of choice (ext3, XFS, etc..). MogileFS does its own internal directory hashing so it doesn't hit filesystem limits such as "max files per directory" or "max directories per directory". Use what you're comfortable with.

MogileFS is not:

  • POSIX Compliant -- you don't run regular Unix applications or databases against MogileFS. It's meant for archiving write-once files and doing only sequential reads. (though you can modify a file by way of overwriting it with a new version) Notes:
    • Yes, this means your application has to specifically use a MogileFS client library to store and retrieve files. The steps in general are 1) talk to a tracker about what you want to put or get, 2) read/write to one of the places it told you you could (it'll pick storage node(s) for you as part of its load balancing), using HTTP GET/PUT
    • We've prototyped a FUSE binding, so you could use MogileFS without application support, but it's not production-ready.
  • Completely portable ... yet -- we have some Linux-isms in our code, at least in the HTTP transport code. Our plan is to scrap that and make it portable, though.

Please see one of these Wiki pages for more information:

Need help? Have Questions? Optimising Web Delivery

Squid: Optimising Web Delivery

Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has extensive access controls and makes a great server accelerator. It runs on most available operating systems, including Windows and is licensed under the GNU GPL.

Making the most of your Internet Connection

Squid is used by hundreds of Internet Providers world-wide to provide their users with the best possible web access. Squid optimises the data flow between client and server to improve performance and caches frequently-used content to save bandwidth. Squid can also route content requests to servers in a wide variety of ways to build cache server hierarchies which optimise network throughput.

Website Content Acceleration and Distribution

Thousands of web-sites around the Internet use Squid to drastically increase their content delivery. Squid can reduce your server load and improve delivery speeds to clients. Squid can also be used to deliver content from around the world - copying only the content being used, rather than inefficiently copying everything. Finally, Squid's advanced content routing configuration allows you to build content clusters to route and load balance requests via a variety of web servers.

 [The Squid systems] are currently running at a hit-rate of approximately 75%, effectively quadrupling the capacity of the Apache servers behind them. This is particularly noticeable when a large surge of traffic arrives directed to a particular page via a web link from another site, as the caching efficiency for that page will be nearly 100%.  - Wikemedia Deployment Information.

Want to learn more?

The Squid project provides a number of resources to assist users design, implement and support Squid installations. Please browse the Documentation and Support sections for more information.

The Spread Toolkit

Welcome to the Spread Toolkit!

Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support. Spread services range from reliable messaging to fully ordered messages with delivery guarantees.

Spread can be used in many distributed applications that require high reliability, high performance, and robust communication among various subsets of members. The toolkit is designed to encapsulate the challenging aspects of asynchronous networks and enable the construction of reliable and scalable distributed applications.

Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.

Some of the services and benefits provided by Spread:

  • Reliable and scalable messaging and group communication.
  • A very powerful but simple API simplifies the construction of distributed architectures.
  • Easy to use, deploy and maintain.
  • Highly scalable from one local area network to complex wide area networks.
  • Supports thousands of groups with different sets of members.
  • Enables message reliability in the presence of machine failures, process crashes and recoveries, and network partitions and merges.
  • Provides a range of reliability, ordering and stability guarantees for messages.
  • Emphasis on robustness and high performance.
  • Completely distributed algorithms with no central point of failure.