Tech Choices - Best Configurations for the new Dell R420
Sharing Files & Assets between Web Servers
Backups should be off-site
User Password Protection
Tech Item - Backups should be off-site
Tech Issue - Using SSH keys vs. passwords
Tech Items - Load balancing (HAProxy vs. Nginx)
Tech Items - Linux NIC Interrupts Overloading Single CPUs
Tech Items - More NUMA Fun in Virtual Machines
MySQL Backups Done Right
Tech Items - MySQL Sub Slaves to Which Master ?
Tech Items - MySQL Slaves are not good Backup
Amazon's EC2 Cloud System
Tech DBs - Why use only InnoDB on MySQL
Goldman Sachs 10 Rules to Follow
Always Do What is Best for the Customer
Tech Troubleshooting - HAProxy Performance and Load
Tech Annoyances - Linux NUMA Issues
Tech Annoyances - Linux Swapping Fun
Having Everyone's Problems
Welcome to the ChinaNetCloud Blog
| Posted on August 29, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on September 26, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
Tech Choices - Why we use Centos instead of Debian / Ubuntu
| Posted on September 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on August 20, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on August 17, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on August 8, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on August 1, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on July 26, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
Yihaodian was most foolish, it seems storing passwords in plain-text in their system. This is a despicable practice that should never be used, ever, for anything. The reasons are obvious as anyone with access to the DB, including hackers, programmers, and many others, can see, use, and sell the passwords. Plus, many people re-use passwords in many systems, so if you know one password that user Bob uses on one system, you can try it on all others, too, including corporate, code, finance, health, and other sensitive systems.
LinkedIn was much stronger, using hashed passwords using MD5 or SHA-1 which is a very good and standard method for password protection, but in modern times not good enough. The reason, known for decades, is that there are two key ways to leak passwords. The first is that the same passwords will hash to the same value, so if my password's hash is "34AH8CD" and yours is the same, we know our passwords are the same and I can log into your account. The second problem is that common hashes can be precomputed in something called a Rainbow Table, such as md5("a"), md5("b"), md5("test"), etc. for millions of common passwords - then all the hacker has to do it compare the results in the password DB with the Rainbow Table - this is common and is what was used partly for the LinkedIn release.
How to prevent this ? In a word, salt. Not like salt and pepper, but in adding extra random data to the password before hashing. This way two identical passwords will have different hashes. Unix has had this in /etc/password forever with 12 bit salt, but now 48-128 are common, often in combination with key-stretching using multi-round hashing.
All new and upgraded systems should use proper salting and hashing to create secure passwords, to protect their users, systems, and data. And you should follow other good practices on non-password data such as mobile phone numbers, email, transaction histories (also all stolen from Yihaodian) by limiting server access, and purging all such data from DBs used for developing and testing. Plus, always, always encrypt your backups before they leave the server.
| Posted on July 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
But most backups are only stored locally, not off-site, which creates a huge risk for the data and thus the business. All your important data should be put somewhere, or else a fire, flood, or business dispute may remove access to your data. In China, various regulatory and government issues may also limit access to your systems (or all systems in a paricular IDC). So you need copies of your data elsewhere.
This creates two questions - where and how to transfer and store the data ? These are difficult questions that depend greatly on your situation and especially data size. It also depends on where you are, since outside of China we often use Amazon S3 for storage, but this type of storage is not available yet in China (but hopefully by the end of 2012).
For small customers, we recommend simple transfer of the backup files to your office for storage and maybe dev/test. For larger customers, we try to find an off-site location such as other servers, Amazon, etc. We also offer backup services which move data off your servers and usually to other data centers around China or Asia, though this system is limited in size to a few GB.
Moving small amounts of data is easy, via sftp, rsync, and other simple methods, but larger data is a problem. UP to tens of GB can be moved with these methods, more carefully, or with backup systems like Bacula. Larger data sets pose special problems and often require hardware like tape systems, data sync systems like incremental rsync (good for video, images, etc. but not DB), and other methods. Some commercial tools can also be useful in this area. All require serious discussion and planning, which we do regularly for our customers.
Best practice is of course to do a test restore on a dev/test system every week or month, and then run some simple data integrity checks, which makes sure the whole process is working well.
Note that if you push production data to dev/test or office use, best practice is to scrub the data to remove sensitive information like passwords, email addresses, phone numbers, etc. so the data can't be stolen and sold by developers or others. This is often done as part of a formal dev/test data load and scrub operation to both reduce the data volume and remove sensitive data before being used.
Also, a key part of off-site backups are security. There are many stories of data getting lost or stolen while off-site, either on tapes/disks or just files on a remote file server (or laptop). For this reason, we always encrypt any backup file before it leaves the server. Be careful not to lose the encryption key, and to test a restore.
In summary, off-site backups are very important to your business, especially in terms of surviving a disaster of various types. Any backups are better than none, of course, and best practice is off-site daily, encrypted, tested, for dev use, scrubbed.
| Posted on July 20, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
| Posted on July 12, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
Most people have heard of HAProxy, and some know it has the same architecture as nginx, being a single-threaded event-driven system that can scale to 100-200,000 simultaneous connections and 100,000 requests/second on big systems. More importantly, HAProxy is very powerful and flexible, with a wide variety of front-end, back-end, and standby pools, very flexible re-write rules and checking, and more.
It also has very powerful and flexible logging, including how every connection or request was started and ended, at what HTTP phase, and by who - this can really help troubleshooting. In addition, the real-time API allows engineers to dynamically add/remove servers from the pools, which is needed for maintenance, testing, etc. (though we have built special tools to make this easier).
For us, one of the most useful parts of HAProxy is its very sophisticated monitoring, including a very nice GUI that we can access in a browser. This lets us see the status and statistics of all pools and servers, including errors, connection, request rates, check info, and much more. We can use this directly for real-time monitoring, and also pull the data via an API to feed our monitoring system.
On the other hand, nginx has almost none of these features and is very simplistic, especially in monitoring and control. There is no way to know what servers are okay, there are no stats on connection rates or other info that makes the system useful for troubleshooting, monitoring, or control. Nginx is simple and works, but is not well-suited for large or complex systems.
The one thing HAProxy cannot do well is SSL, which is not directly supported. The easiest way around this is to use nginx to handle SSL connections on port 443 and then forward the un-encrypted connections to port 80 (or a different port if to be balanced separately). This is a bit complex, but not too bad and works well, though some work is needed to get the client IP passed all the way through the system to the real application servers.
Overall, HAProxy is the best choice for large-scale load balancing of real systems, especially when they are often changing, have many pools, complex needs, and good monitoring with control. Nginx is not a bad choice, but HAProxy is much better.
| Posted on June 6, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
The Linux kernel has come a long way in terms of performance in the last few years and especially in the 2.6/3.x kernel line. However, at very high IO rates, especially for the network, interrupt handling can become a problem.
We've seen this on high-performance systems saturating one or more 1Gbps NIC and also in VMs with lots of small packets, with a recent overload at about 10,000 packets per second.
The reasons are clear: in the simplest modes, the kernel processes each packet via a hardware interrupt from the NIC. But as the packet rate rises, these interrupts overload the SINGLE CPU that handles them. This single CPU concept is very important and poorly-understood by sysadmins. On a common
4-16 core system, an overloaded core is hard to see, since overall CPU utilization is 6-25% and the server looks normal. But the system will run very poorly, dropping packets with no warning, no dmesg log items, and with nothing appearing to be wrong.
But if you look in top in multi-cpu mode (run top and press 1) at the %si item (System Interrupt) or in mpstat irq item (mpstat -P ALL 1) you can see this - on a very busy system it's clear that interrupts are high, and with advanced mpstat usage you can see which CPU and driver is the problem.
You need a newer version of mpstat that can run -I mode. Then to see irq load, run this:
mpstat -I SUM -P ALL 1
Anything over a 5,000/second is a lot. 10-20,000/second is extremely high.
To find out what driver/item is creating the load, run mpstat -I CPU -P ALL 1
This output is hard to read but you need to trace the right column to see which IRQ is causing the load, such as 15, 19, 995, etc. You can also specify just the CPU you want to make the display simpler, such as "mpstat -I CPU -P 3 1" for CPU #3 - note that top, htop, and mpstat may number CPUs differently (starting at 0 or 1; both top and mpstat use 0, 1, 2 but htop uses 1, 2, 3).
Once you know the IRQ number, then look at the interrupt table, via "cat /proc/interrupts" and find the number from mpstat's load - then you can see the driver using that IRQ. That file will also show you the # of interrupts so you can see which is loading the system.
Okay, now it's probably the NIC card, what to do ?
First, make sure you are running irqbalance, which is a nice daemon that will automatically spread your IRQs across CPUs. This is very important on a busy system, especially with two NIC cards, since by default CPU 0 will handle all interrupts, and can obviously get easily overloaded. irqbalance spreads these around to lower the load. For maximum performance, you can manually balance these to spread across sockets and hyperthread-shared cores, but this is usually not worth the trouble.
But even after spreading the IRQ around, a single NIC can overload a single CPU core. So then what ? This depends on your NIC and driver, but generally there are two helpful choices.
The first it multiple NIC queues, such as some Intel NICs have. If they have four queues, these can have different interrupts and thus be handled by four CPU cores, spreading the load. Usually the driver should do this automatically, and you can check via mpstat.
The other and often more important driver option is IRQ coalescing. This powerful function allows the NIC to buffer several packets before calling the IRQ, saving a massive amount of time and load on the system. For example, if the NIC buffers 10 packets, the CPU load is reduced by almost 90%. This function is usually controlled by the ethtool utility, using the 'c' options, but some drivers need this set at driver load time; see your documentation on how to set this. For example, some cards like the Intel NIC we worked with, have automatic modes that try to do the best thing based on load.
Finally, some drivers such as we saw on our VMs, just don't support multiple queues or coalescing. In this case, once the CPU is busy, that's the limit of your performance until you can change NICs or drivers.
This is a complex area that's not well-known, but a few good techniques can really improve performance on very busy systems. Also, a little extra monitoring can help find and diagnose these hard-to-see problems.
| Posted on April 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
Turns out that using NUMA for large-memory processes is not so easy, as we discovered when we tried to update our MySQL init scripts to world-class levels. As we've written before, current best practice is to use numactl to set full interleaved mode for NUMA systems for large memory processes such as MySQL, MongoDB, Memcached, and Java systems. This prevents RAM nodes from running low in mysterious kinds of core RAM and swapping even when there is plenty of RAM free on other nodes.
But we found that in at least some versions of Centos 5.x for Xen VMs, NUMA is not compiled in and thus not supported. This seems to be due to bugs that can prevent the kernel from booting in certain circumstances, though these are unclear. As a result, there is no NUMA support and thus numactl is useless, but also there should be no swapping problems, even on NUMA hardware.
We are looking at enabling this for VMs with a different kernel, though this may not make sense, since then we are using numactl to set interleaved mode which basically turns NUMA off (at least for the big process, not for the kernel and other processes, which probably benefit from NUMA aware locality).
Another option we have not yet considered is to turn off NUMA support at boot time, which in theory would also solve the problem and run in a pseudo-interleaved mode. Interesting idea.
How and why does this happen ? Many ways, so let's look at the good and bad ways to backup MySQL databases.
Slaves - Many people think that backing up the Slave DB is a good and sensible idea. Even if they backed up the Slave correctly (which they don't, see below), this is still a bad idea. Why ? Because the slave data is often not the same as the Master DB. Why ? Many reasons, including MySQL bugs, but also due to some statements not being replicated correctly (ever see those warnings in your logs?). Or replication stopped due to another error like deadlock or timeout and was incorrectly restarted or skipped, or there are poorly-understood skip or do configurations at some point in the system's history (maybe years ago) that cause data de-synchronization.
Basically, you should not trust that the slave is correct and thus avoid backing up the slave unless the Master DB is very busy or for other reasons cannot handle the performance or locking issues related to backup. Even in this case, you should use a tool like Percona's replication checker to verify replication to know the slave's data is probably accurate.
Locking - Most customers come to us using MyISAM tables, which are very hard to backup (and are a bad idea, see other blogs). The only good way to backup MyISAM is to fully lock the database for the entire backup period, often many minutes or hours, effectively killing the website for a long time. This can be fixed by using InnoDB, but some customers try another way, to use mysqlbackup in non-locking mode, or backing up the files with or without a snapshot, etc. These are all useless as the data will be corrupt or not consistent.
Mixed-MyISQM/InnoDB - Many customers have both MyISAM and InnoDB tables but try to backup using standard single-transaction methods for InnoDB, which result in bad MyISAM data (but they don't know until they test/do a restore). Even worse, many customers don't even know they have MyISAM tables, but get them when developers create tables without checking on the engine; in MySQL 5.1 and earlier, the default is MyISAM. We have special monitoring to detect this situation, but most customers don't know about this and are backing up inconsistent data.
Right Way - The right way to do it is with careful planning and selection of options, mostly based on whether or not there are MyISAM tables. First, always backup the master if possible. Second, if all InnoDB, use single transaction mode or LVM snapshots and check to make sure there are no MyISAM tables. Third, if there are MyISAM tables, use Percona tools to validate a slave and back that up if needed.
Backing up MySQL is not easy to do perfectly, as it takes good knowledge, tools, and monitoring. We work on this problem every day with our customers to ensure their data is safe and reliable.
| Posted on April 23, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
Interesting customer question this week on the best Master for the 2nd, 3rd, etc. Slaves in a system. We've always used the primary Master as Master for all Slaves as simple and best practice, but now I'm wondering. I'm thinking that making the 2nd and other slaves of the main Slave would make any Slave promotion to Master much easier since the secondary slaves won't have to be re-pointed, which is messy and difficult, especially with various failure modes at 3am, even with MMM or other tools.
One one hand these slaves would maintain all their SET MASTER TO, their log positions, everything. This greatly simplifies the very stressful Master failover situation, so that all that has to happen is simple read-write and write IP promotion after what if often lots of troubleshooting of the real Master. In addition, this would really simplify any automated HA DB failover upon Master failure.
On the other hand, this introduces a secondary Single Point of Failure (SPoF), the main Slave. In a typical multi-slave system, any Slave can be promoted to Master by simply declaring it so, making it read-write, and pointing all other slaves at it. But any slave failure is easily handled by just dropping the dead slave from the application DB pool. But if the main Slave is the sub-slave Master, and it fails then all other slaves would have to be re-pointed to the Master or another Slave. In this scenario, essentially all Slaves would fail together since all updates are SPoF through the main Slave. Most reads would continue and the app would stay up (unlike a Master failure scenario), but lack of updates would be a severe problem rather quickly.
I'm not sure which is better or worse, but it's an interesting question to ponder. For now we'll keep our simple single Master to all Slaves structure, but keep thinking about how to improve and simplify that.
Another related issue is virtual IP addresses in any multi-DB system. We have not been ideal at this, in that we use real IPs for all DBs, even though this makes failovers and promotions, especially of R/W split systems, much more difficult. But using real IPs also makes the system much simpler to setup and manage, and given that failovers are very rare, has worked quite well. Yet another issue to think about as we work to design and build the world's best and fastest systems.
| Posted on April 10, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
We have many customers who backup their Slave databases and feel that this is enough for them. We disagree, because the experts disagree, because Slaves often do not match their Masters. How can this happen ? Many different ways, and while MySQL has gotten better over the years, Slaves are still very often quite different than their Masters and often have corrupted or inaccurate data and you should not rely on them for good backups (of course any backup is much better than none!)
MySQL Replication is a wonder of simplicity and general reliability, but this simplicity along with MySQL's rather loose definitions of data integrity create many situations where the Master and Slaves can be different. We even see customers there the Slaves are different from the Master and from each other, sometimes every week. There are so many ways this can go wrong, from duplicate keys, non-deterministic procedure execution, common LIMIT issues and more. This is especially true if you have warnings in your DB logs on replication.
So, we usually recommend always backing up the Master, so you get good data, always. We use advanced methods, often tied to InnoDB so there is no locking, and try to limit the performance impact by using high-performance hardware and DB configurations. This guarantees good backups, accurate data, and good performance.
There are a few good tools for this that we are starting to use for our customers to fix the sync issues, including and especially the old Maatkit system, now maintained and re-branded by Percona, the world's top MySQL consulting company. We use these tools to scan Slaves to find differences and are testing additional tools that can fix and re-sync the data, especially on large systems where a Slave re-sync is not practical, or where Replication has errors so often that we need to fix them every week.
With these tools, customers can in theory use the Slaves for backups, though we are not yet doing this, as we continue to test and evaluate these complex and powerful tools. By the summer of 2012 we should have these tools running on most systems with periodic sync reports and enhanced DB backup options that include the Slaves and good backup guarantees (though for financial and other critical data, we'll usually recommend the Master).
| Posted on April 5, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
EC2 - The core offering are cloud servers which are extremely popular and useful. Especially because recently they both lowered the price and added long-needed new instances. We are especially excited about 64-bit small and medium instances which give our customers a whole new way and price point for using Amazon. Previously the only 64-bit system smaller than the US$250/month Large instance was a Micro which was cheap but nearly useless for real work. The old Small instance was 32-bit and thus not easily compatible with scale-up or the mainstream 64 bit systems and tools. But the new Small and Medium instances are both 64-bit and very well-priced at about $64 and $125/month for non-reserved pricing.
S3 - The oldest and easiest to use AWS feature is S3, the Simple Storage System, for storing images, backups, and most anything. S3 is pretty good though can be a bit slow and expensive for large data. Best practice for customers is to use it for shared upload or static objects, especially things like uploaded images on SMS sites. Generally you can take the image on your server, re-size it, and push to S3, storing the URL in your DB. Then your page links to the S3 image. This is great because your several web servers do not need NFS, rsync, or other complex file / image sync systems, none of which scale well. And S3 is great for backups as we push our customers encrypted backups there in most cases for unlimited storage and retention as needed.
RDS - The Relational Data System, or MySQL in the cloud. This is an interesting service and simple to use, but we don't recommend it for high-performance use due to reports of problems and scaling issues. In particular, reports indicate the system does not cache data, instead using EBS I/O for all queries. Also, it's not really tunable for various situations nor easy to use replication. For now, large-scale customers should use their own MySQL instances on standard EC2 instances with EBS storage.
CloudFront CDN - This is Amazon's CDN which seems to work reasonably well across the US and EU, with some service in Asia, mostly from HK. However, service inside China is limited and probably cannot be relied upon for consistent service. For this reason, we recommend that customers with ICP licenses use a PRC domestic CDN like ChinaNetCenter for best performance inside China.
EBS - The Elastic Block Store or the basic iSCSI disk storage system for EC2, and broadly the best cloud storage system available. Even though performance is not always consistent, it's extremely flexible and easy to use with dynamic mounting, fairly easy re-sizing, snapshots and more. EBS is simply very nice and easy to use and generally quite fast, though one has to watch it.
ELB - For load balancing there is the Elastic Load Balancer. The ELB is a very simple load balancer and we use it mostly to handle multi-available zone failovers, since there is no other good and fast way to do this (other than ugly and slow API EIP transfers). As a load balancer, ELB is not very good nor consistent, but is necessary for real HA and also SSL termination. We then usually use HAProxy in LB instances behind it for real load balancing, where we set rewrites, filtering, re-writes, logging, etc. So any real system includes ELB and two HAProxy instances.
We'll write more about other features such as Route53, Email Sending, and security in future blogs or newsletters.
MySQL has two common storage / DB engines: MyISAM and InnoDB. Up to and including version 5.1, MyISAM is the default and the most common, but is also the oldest system that is not being updated or improved now. It is also slow in many cases and has the bad habit of corrupting data on system crash, plus has no transactions, referential integrity (rarely-used on the Internet), and other advanced features.
This is mostly because MyISAM only has table-level locking, not row-level like all other modern systems. This means when a user / client does something, the whole table is locked (could be millions of rows) and everyone else waits. As you can imagine, this does not scale very well to large systems. More recently there are exceptions for SQL INSERTs and a few other things, but performance in real systems is still quite poor on MyISAM.
In addition, it has no transaction log / journal, so it just writes data to the Linux file cache and hopes it eventually gets on disk. If the system ever crashes and loses some of that data, MyISAM will often not start or complain you need to fix the tables; it has limited methods to recover data and often loses things. Also, MyISAM is difficult to backup correctly, usually requiring a full system lock of all data during the backup, which often means the website is down or not usable for 15, 30, 60 or more minutes per day.
By contrast, InnoDB is a much more modern system and is under heavy development, with several options in the InnoDB family, from the base engine to plug-in versions to advanced enhanced versions such as XtraDB and others. Everyone is working on scaling InnoDB for better CPU and IO performance, better backups and locking, improved statistics and debugging, and more. It is where all the action is.
Additionally, as a system, InnoDB supports several critical features, most importantly a transaction log and row-level locking. The log allows for real DB transactions but more importantly for data crash recovery and roll-back. This allows for much better data protection while maintaining higher performance due to the way InnoDB does IO. And row-level locking provides much higher concurrent performance in most cases, since users only lock data they are writing, and reads almost never block at all. Recent performance tests show massively better performance for InnoDB over MyISAM, especially under heavy load.
Generally, Innodb is much faster, much safer, much more powerful, and continually getting better. You should always use it for all your systems unless there is a specific reason not to.
What are those specific reasons you might not use InnoDB ? There are at least two, the first being count(*) behavior. On MyiSAM a SELECT count(*) with no WHERE is very fast. InnoDB must actually count the rows and will be slow. Of course, usually count(*) is not a very good way to program (first use an indexed column to be faster, like count(id)) and is rarely a good idea without a WHERE clause.
Second, MyISAM allows full text search in regular columns, which InnoDB does not. Normally real search is done outside of MySQL in systems like Lucene or Solr (or Sphinx in new MySQL Versions), but some smaller sites do search in MySQL; in this case you probably need to use MyISAM for the text table.
Overall, and to put it simply, use InnoDB. Every day. For every table. Only minor exceptions that rarely apply.
Very good reference article from Tag1: http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB
1. Don’t waste your time going after business you don’t really want.
2. The boss usually decides— not the assistant treasurer. Do you know the boss?
3. It is just as easy to get a first-rate piece of business as a second-rate one.
4. You never learn anything when you’re talking.
5. The client’s objective is more important than yours.
6. The respect of one person is worth more than an acquaintance with 100 people.
7. When there’s business to be found, go out and get it!
8. Important people like to deal with other important people. Are you one?
9. There’s nothing worse than an unhappy client.
10. If you get the business, it’s up to you to see that it’s well-handled.
Today, an executive of Goldman Sachs, the world's top investment bank, quit, because the company had changed over his 12 years and is not putting the customer first and is doing things only for bank profits, often that are bad for the customer. Below is his article in the NY Times, read today by most of the world's top businessmen and women, about why he's leaving a company that for 124 years has helped customers.
There are good lessons here, about always serving the customer. For us, this means choosing the right servers, services, etc. for the customer, even if it means less money for us. Always do what is right for the customer. Even if that is fewer servers or no servers at all.
We are always thinking of how we can do the right things for them, especially when we are choosing servers or services and spending their money. Remember that we are usually spending the customers' money and that the #1 thing we sell is trust. Trust that we will do the right things, securely, quickly and professionally, but also that our advice is correct and worthy of their trust, not just best for us. This is where the guy below feels that Goldman Sachs has major problems. Our teams must work together to keep this in mind and always do what is right for the customer.
Interesting troubleshooting session with HAProxy (http://haproxy.1wt.eu/) yesterday on a high-load customer. HAProxy is a very high-performance system, up to 100,000 requests/second, 100,000 connections, very flexible system that we use everywhere (and much, much better than nginx or LVS for this). It's very hard to run systems at this performance level at hundreds of millions of requests/day at such massive connection levels, with specialized Linux kernel tuning along with careful monitoring and management.
The problem was at 150-200,000 concurrent connections but less than 5,000 requests/second, the system was becoming very slow, taking several seconds to respond. Standard troubleshooting of this situation found nothing strange, checking overall CPU, memory, sockets, kernel messages, iptables connection tracker limits, TCP memory and pressure. Since this is a VM, we also checked the underlying Xen Dom0 system since CPU and other limits there can also affect VM load performance.
But a look at per process CPU shows that HAProxy is using 95%+ of a CPU, a clear sign of process CPU overload. HAProxy is normally a single process event-driven system which is how it (and Nginx, which has the same architecture) achieve such high performance. But if that single process has more load than a single CPU can handle, we are dead, or at least very slow - I'm actually amazed that at 100% the system could function at all, still at 3-5,000 requests/second and 200,000 connections.
Why the CPU load ? We don't know, since the actual request rate was not that high. We think it's due to the number of connections and the overhead in managing that list, even though in theory the kernel should be doing that using epoll(), but the system part of CPU was quite high which is probably all the socket selection work.
One question is why so many connections and the answer is the long TCP keep-alive time, which defaults to two minutes. So 2,000 new connections per second and over 100 seconds average connection time gets you 200,000 connections, simple as that. We are shortening the timeout, which normally reduces the user experience, but in this case the application is not that sensitive to this, so we feel safe to reduce to 60 seconds or less, down to 15 seconds and in extreme cases just turning keep-alive off (since for this customer we are only seeing about 1.2 requests/connection, much lower than typical websites).
The overall solution for now was to enable multiple processes for HAProxy, which is a difficult feature to use since the statistics and monitoring are then per process and randomly switched between them, so debugging and monitoring data are much less useful. But the per-process CPU dropped to 50% and the server was suddenly very happy at 120,000 connections. We will run this way for a while and see how it goes and also work to get back to single process mode for better monitoring unless we can find a better per-process method.
Overall, this is all in a day's work for managing large sites and advanced high-performance load-balancing systems running at the edge of what is possible. And we learn something new every day, though fortunately we can then share that knowledge with you and across our customer base.
Linux knows all of this and tracks RAM use by CPU and application. It tries to schedule each process to run on the CPU that owns most of the RAM that process is using, which should improve performance. All of this is invisible to the user and sysadmin, and in fact, most engineers have never even heard of this technology, even though all new servers use it.
So what is the problem ? Swapping. For years there have been reports of mysterious swapping by syadmins, especially with large-RAM monolithic processes like databases and Java servers. Recent discussions and tools have shown that NUMA is a key problem. Why ? Because newly created processes default to allocating all/most of the RAM on a single CPU, then using some from other CPUs. Thus on a 16GB machine with 8GB per CPU, and a 12GB MySQL process, we'll find MySQL uses all 8GB of the first CPU and 4GB from the second.
Why is this a problem ? Because the kernel needs RAM, too, and does not well-balance across the NUMA system, especially when one CPU's RAM is full. Instead, it will swap out some RAM on the first CPU, even when there are many GB of free RAM on the second CPU. And of course this swapping is evil, freezing the whole DB system during swapping and killing the website.
How to fix this ? For now, the only way is to start the big RAM process using a special command 'numactl' in interleaved mode. This will split the RAM use evenly between the CPUs and avoid all of these problems, though in theory the system may be a little slower due to cross-CPU RAM access. But this is still a good and recommend method. But it's also annoying because you must use numactrl every time you start such processes, which means modifying init scripts and otherwise changing how things start (plus make sure numactl packages are installed, which is not so common).
What Linux really needs is a default NUMA policy for this. There are sophisticated policies for controlling RAM use, binding, etc. but these are all per process, and there is no way to have a default. The kernel should really have a sysctl for this, allowing sysadmins to set the default for new processes, or with some other filters, etc. such as large RAM processes, etc. This would go a long way to avoiding unneeded swap and other related problems since we could just set policy to interleaved and run normally.
Linux is a great operating system. Period. Probably one of the best every built and still the fastest evolving, especially given that it already runs everything from cell phones to super computers, and a new version is released about every 8 weeks, sometimes with major internal changes. But it still has annoyances and issues that make life difficult for Internet Operations, where we live. We'll occasionally blog about these issues.
One is swapping. Not swapping as a concept, which everyone supports, but how it's handled, monitoring, and managed on Linux.
First is swappiness. This is the kernel sysctl that determines how the system should trade application RAM for file cache. The default on most distributions is 60, which is absurdly high. We always set this to zero as servers should never swap, period. Swapping is evil on many levels, but in practical terms, any swapping process is frozen during swap, which is deadly to modern multi-threaded systems like MySQL or Java, which are simply dead for 1, 10, 100 seconds while we swap. If there is really no RAM free, then there is no choice, but the swappiness of 60 makes this happen far earlier than it should, often with gigabytes of RAM still free.
Maybe this is useful on a desktop to push out old apps and free up cache, but even then we'd suggest a setting of 20-30 would be better, but servers should be zero, pure and simple. One might argue that some swappiness is okay to push out unused code, but such code is not that common on real servers and is often quite small. especially with today's RAM sizes (we always use 2-64GB)
Second is swap monitoring. We sometimes have low RAM issues that forces some swap, but it's then very hard to tell what is using the swap. Why do we care ? A good reason is that we just want to know. But more importantly, even when RAM use is lowered, things will stay in swap, often a lot of things, and this creates alerts and worry for us if RAM runs low again (though in theory swapcache tells us how much real danger there is, but that's an advanced topic for most engineers). We can restart the offending services during a maintenance window to remove swap, if we know which services - this is obvious in single use machines like DB servers, but many customers have many services on one bigger machine. A tool to tell us what apps are really using the swap would be useful.
Third is swap release. As noted above, we get some cases of past swap use, which could be days, weeks, or months ago when we used the swap, but it's not needed now. Some or most may even be in the swap cache and can be instantly released, but there is no method to do that. So the first useful function would be to immediately drop the swap cache, just as we can drop the file caches. The second would be to force things to be swapped in to reduce swap use and make us all feel more comfortable.
Fourth is random mysterious swapping. Linux is very sophisticated, but still swaps for no obvious reason, and no one seems to know why. Recently some of this has been discovered to be due to NUMA issues (see future blog) but others are still not well-understood, especially on older kernels (before 2.6.3x, certainly including RH/Centos 5.x which are 2.6.18). Even today we always allocate some swap because the kernel just seems unhappy without it, and in some cases goes crazy with kswapd and other things taking 100% CPU for no reason; we see this in larger RAM server on EC2, for example, especially under load. Newer kernels are better, but more progress is needed.
We run large-scale Internet systems. All large-scale Internet systems have problems, so any big Internet company has a team that deals with the problems in their own system. Many problems are common, but many of these problems are different for different companies, different technologies, and different industries, e.g. video systems are very different from mobile game systems.
But we run systems in every industry, which means we have everyone's problems. That is good in many ways since many of the problems are the same, such as hardware, IDCs, Linux, MySQL, PHP, Apache, HA, security, performance, etc. So we can share our deep knowledge and lessons learned from one customer with the others, and everyone benefits. We bring world-class best practices to all of these areas.
But it's also difficult because many of our customers' problems are not the same, and we have to handle every type of problem on every type of system, including uncommon languages (Python, Ruby, Perl), search engines (Solr, Sphinx), caches/queues (Redis, MQ), NoSQL (MongoDB and others), video encoding, sharding, hardware, firewalls, replication, batch systems, continuous integration and automation (Hudson, Puppet), many custom issues such as MMO game engines, and much more.
For these issues we have to learn about the systems, the problems, and how to monitor, manage, troubleshoot, document, and tune all these things. This makes us the global experts in everything about the Internet, but also makes life interesting at 3am when one of these things breaks. In the end, we have to know everything about everything . . .
Hi ! This is the shiny new ChinaNetCloud Blog, partly about the things we do, but more importantly about Operations, Clouds, technology, Linux, customers, sales, service, and other random topics we find interesting and want to share with you. Later we may also split this into a few blogs, such as on business, on tech and tools, on service, etc.
Steve Mushero, our co-founder and CEO, will initially write most of the blogs, but other guest-writers will also help from time to time, including team engineers, managers, sales, and others who have interesting ideas to contribute.
As a reminder, our main business is architecting, designing, building, and especially operating large-scale Internet servers and systems. So we outsource and operate customers' backend serves, data bases. We take care of everything, running the systems 24x7, with deep monitoring, troubleshooting, backups, and more, focused on performance, reliability, and saving the customer money. This is what we do, here, there, and everywhere.
We also partner with and recommend a variety of 3rd party services such as IDCs, CDNs, hardware, web monitoring, email delivery, and more. And our business ecosystem includes a wide variety of others such as investors, web development, digital advertising, data analytics, content, accounting, legal, business setup, and much more - we are always happy to connect people together. Let us know how we can help connect you . .
Today we mostly serve the Chinese market, and mostly for Chinese Internet companies, though we also have lots of global customers who are coming into China and ask us to run their systems. We also increasingly are taking Chinese companies out into the world, running systems that serve Asian or American users, often using Amazon EC2 in Singapore, Japan, or California.
Later we will expand to take over the world since our model, service, and business actually scales and fits the needs of any Internet company, anywhere. We will probably open an office in Singapore later this year to serve SE Asia, India, and beyond. Then on to the Middle-East, Latin America, the U.S., Europe, and eventually Africa - our vision is to Run All the World's Internet Servers - there are about 10 million so we have a lot of work to do !
Stay tuned for interesting blogs in this space.
| Posted on January 17, 2012 | by ChinaNetCloud Team | Leave a Comment |