SBC Solutions Better Living through Terminal Services


GSLB Sync Problems? Check your RPC nodes.

Came across this issue recently when migrating to new appliances. When initiating a GSLB configuration sync, you may see the following error:

Getting Config: Skipping - ERROR: Invalid username or password.

Generally speaking, this sounds fairly straightforward. Check the RPC node configuration under Network -> RPC. The RPC secret must be the same on both nodes, and if using secure RPC (which you should), make sure that is enabled on both sides by clicking the "Secure" checkbox.

However, I tried again and still had the same problem. I verified that both sides are using the same nsroot password, and verified that management access was configured on the GSLB site IPs. Still no dice.

Here's where it gets interesting. Open GSLB -> Sites and look at the status of your sites. In my case, MEP was up. Huh? Logically thinking, that tells us that the RPC node configuration has to be the same, or else MEP would be down too!

After watching a tcpdump to verify communication back and forth, I got to thinking a bit more. GSLB config sync uses port 3008 when security is enabled. What else uses port 3008? High availability configuration sync. In this environment, we aren't using HA, so those standalone RPC nodes weren't configured.

I configured the same RPC secret and enabled security, then tried to run another sync. Success! This aspect of configuration isn't at all apparent when planning your environment, so this can be a bit of a curveball. However, taking a step back and thinking logically about what you're attempting to do and what's going on in the background is always important.


High Availability – A Narrative

After getting bitten by another firmware bug, I feel I must address something to anyone doing research on NetScaler devices, whether to build a new environment or to refine an existing deployment. If you don't buy your devices in HA pairs, you are putting your environment in danger. Whether or not ten minutes of downtime is critical to you is something to be considered, but in enterprise environments this can seriously jeopardize the client perception of the services you provide.

Our most recent tangle with a firmware bug has to do with LDAP group extraction, something every NetScaler configured for LDAP authentication will do automatically. If a client's extracted group string is longer than 25 kilobytes, a buffer overflow will happen. This makes NetScaler angry. The packet processing engine (nsppe) that happens to process this LDAP response will die. Pitboss, the HA dameon, will notice this and declare this NetScaler device to be unhealthy.

If you don't have HA configured and running in your environment, this is when you get a few thousand angry people beating down on your poor helpdesk staff. For up to ten agonizing minutes, your environment vanishes into thin air. The NetScaler dutifully creates a coredump and then reboots the device, meaning that all traffic destined for that device is now dead in the water. Anyone accessing any load balanced VIPs? Gone. Any app passing SQL data through a DataStream VIP? Dead. Any users connected to your XenApp and XenDesktop farms? Disconnected, with no chance of automatic reconnection. Anyone connected through VPN? Cutoff from the network. Unsaved config changes? Brother, they never happened. Generally speaking, anytime where you can count angry users by the thousands is a bad day in the life of an IT worker. This should be avoided at all costs.

Now let's look at that same scenario with one major change - this time we have a shiny pair of NetScaler devices running in HA. We'll call them NS1 and NS2.

NS1 is the primary node, accepting all client connections and doing all of the heavy lifting while NS2 is just waiting in the wings, making sure NS1 is feeling OK. Suddenly, John from the desktop team decides he wants to access his XenDesktop, so he hits the Access Gateway VIP that's hosted on NS1 and logs in. Now, little Johnny is a curious person. He knows that the GPO responsible for installing a specific printer is targeted to a specific AD group. What happens if he adds himself to all 100 groups? Even better, how would XenDesktop handle it? He calls up those jerk Citrix admins, and they just tell him not to do it. But he'll show them! He adds himself to all the groups, waits for replication, then heads to the Access Gateway VIP like a freight train at full speed.

Poor old NS1 has no idea what's about to happen. The PPE is processing traffic with aplomb when it gets a request to authenticate John. It passes the username and password to Active Directory and awaits the response. When it gets back the entire text of War and Peace, it gets confused. Then it gets angry. To retaliate against this jerk of a domain controller, it stops dead in its track and refuses to go further.

Pitboss, meanwhile, has been sitting on top of a hill like a sheepdog, watching the PPEs mill about doing their thing. Suddenly, it perks up. Something isn't right - there were five processes but now there's only four! It counts again carefully, confirming there's a problem. Pitboss on NS1 hurriedly calls pitboss on NS2. "Something's wrong!", he shouts, "tell /dev/wife I love her!" He then writes a coredump and begins the reboot process.

Hearing this, pitboss on NS2 springs into action. NS2 becomes the active node, and starts letting the network know that all the traffic on NS1 should now come to NS2 until further notice. This means that LB VIPs stay up. DataStream VIPs stay up. Access Gateway user sessions stay connected. Only the slightest of blips would be noticed during this time; all of which can be handled by lower-layer protocols. Johnny wouldn't get granted access, but your thousands of other users won't be unceremoniously pulled out of the matrix without warning. Your NetScalers would immediately let your engineering team know that Something Bad(tm) happened, and that they need to start grepping log files like the CLI-ninjas they are. If they're on v10, they'll even open the ticket and submit a support file for you!

Yes, it is more expensive. Yes, it requires extra ports on switches, SFPs, and a bit more planning. It's also worth its weight in gold. Bugs happen. With HA, they don't have to affect your users.


SSL and NetScaler: Not Just a Checkbox

The whole point of a NetScaler device is to get into the middle of all communication between a client and network resources. Once it's there, we can do all sorts of Really Cool Stuff(TM),  such as SSL Offloading, compression, connection multiplexing, and AppFW just to name a few. The one thing that the above functions depend on is that they must be able to read all traffic between the two sides of the conversation. It must be unencrypted and uncompressed.

SSL offloading lets us shift the burden of cryptographic operations and certificate management from backend servers to a single point - your NetScaler infrastructure. Generally speaking, you'll have backend services using HTTP on port 80 communicating with your NetScaler, and the frontend VIP that clients access using SSL on port 443.

This is the simplest SSL offload configuration, which is sufficient for most users. In this case, there is only one sequence of encryption and decrypting for packet flow, and it is all taking place on the Cavium SSL offload card. All of the NetScaler advanced protection and optimization features are available to this data flow. You can troubleshoot application issues by running an NSTrace and decoding HTTP traffic going to the relevant backend services.

The second example is SSL all the way through. In this scenario, we want to ensure backend traffic between servers and the NetScaler is protected using SSL. This is also fairly common, and there are a few special considerations to note. First, your backend services *do not* have to have externally valid certificates. They can be self-signed as far as the NetScaler is concerned - as long as they're functional things will work as expected. Remember that the client isn't ever directly touching the backend servers. This doesn't require any special configuration on the NetScaler side other than creating SSL services rather than HTTP.

One major thing to understand is that this process is doubling* your SSL operations. Remember what we established at the beginning: for all the fancy features to work, the NetScaler kernel must be fed *unencrypted* and *uncompressed* traffic. If you're using a hardware appliance, the backend SSL termination will still be performed on a Cavium SSL Offload card. If on a VPX, this will be performed by the PPE (Packet Processing Engine) which uses standard CPU time. This is much less efficient and makes CPU monitoring even more critical. Packets will come off the wire, be fed through the Cavium card to be decrypted, sent through the NetScaler kernel to work its magic, then shipped back through the Cavium card to be encrypted and back on the wire to be shipped to the client device

In an SSL all the way through configuration, you're still able to utilize all of the fancy features. There is a performance penalty due to the double-encryption, but that won't impact the performance of the other features.

The third SSL method is SSL Bridging. Notice I didn't call this an SSL Offloading method. In this scenario, you're directly exposing a backend service to the client. Let's take a minute to let that sink in. The client is directly talking to the resource. That means you don't get the benefit of SSL offloading - your servers are responsible for all SSL session establishment, key exchange, and encryption overhead. You don't get the benefit of TCP multiplexing - client connections and server connections will have a 1:1 ratio. You will not be able to use AppFW, since the traffic can't be decrypted by the NetScaler. The same problem kills all advanced functionality of the NetScaler for this resource.

At this point, the NetScaler is essentially a session-layer router. An SSL Bridge resource can do load balancing, and that's about it. Almost every function of the NetScaler platform revolves around getting in the middle of the traffic flow, and you simply can't do that if you can't decrypt it. You also can't perform AAA since the NetScaler can't insert a redirect to the AAA-TM page.

Exposing an SSL Bridge vserver to your DMZ is suicide. You're directly exposing backend resources to the Internet without any protection, authentication, auditing or other controls.

You may be wondering, "Why would they even leave that functionality in?" That is a valid question, and generally speaking, it was left in for very specific use cases. For example, if you need to load balance a web application that requires client SSL certificates, your only option is to use an SSL Bridge. SSL certificates and keys can be non-intuitive to someone not familiar with the NetScaler management utility, and using an SSL Bridge can be a tempting option to sidestep that configuration to the untrained eye. I hope this article has helped shed some light on the subject.


* This isn't exactly doubling all SSL operations, as connection multiplexing saves a lot of overhead with regard to session setup, initial key exchange and teardown. All of the actual data, however, will be encrypted twice.


Troubleshooting and FIPS: A Few Words

As you may know, Citrix can ship NetScaler devices that are FIPS 140-2 compliant, which means they can be used in the public sector. Those of you that work with FIPS-enabled devices know that it tends to be the eternal monkey wrench in your troubleshooting efforts.

The main point of having a FIPS-enabled appliance is that your private keys are protected. Instead of files on a disk, they are stored in memory on a special expansion card commonly referred to as an HSM, or Hardware Security Module. The actual hardware is protected in such a way that if any attempt to get at the memory or other inner workings of the device is made, the memory is wiped in order to protect the integrity of encrypted data.

While this is fantastic for security, you must also realize the direct implications of that tradeoff. You can't copy SSL certs and keys from one box to another the way you can with a standard, non-FIPS appliance. Transferring keys is possible, but requires a much more involved process called Secure Information Management which will be further explored in a forthcoming post. You can't simply download the private key and decode SSL sessions using WireShark. The only devices that can decode the traffic of a given FIPS tunnel are the source and the destination devices.

This means that any time you have to call vendor support, you're not going to be able to easily provide any kind of packet trace. The only way to decode the traffic is to actually perform a targeted MITM attack using a software product similar to Fiddler. Even then, that comes with a whole host of pitfalls. If you call Citrix for help on an issue and they request traces, Citrix Support won't be able to decode an NS trace. (They'll still ask for your private keys, though, which is incredibly frightening on its own accord.)

Another thing I constantly run into is lack of compatibility between different FIPS models. NetScaler 9010 FIPS models have a different HSM than the MPX FIPS models, which means that configuration requirements can be just different enough to cause problems. For instance, I found today that password limitations on the SOPin and User PINs are capped at 14 characters, while the 9010 FIPS models can support longer passwords. The error message is not helpful, and the problem is not obvious. I had to get escalated to the Senior Engineering level before this little gem was discovered. FIPS is not commonly talked about, and only a small fraction of the user base ever has to deal with it, which means that your average support rep, while as helpful as possible, is not always properly equipped to handle many questions, including basic ones.

I ramble about this not to scare any newly-minted FIPS admins, but as a way to spread the things I've had to learn along the way. FIPS requires you to think differently, especially with regard to troubleshooting, and it's important to understand that in order to manage your environment in the most effective way possible.


Long Passwords and Access Gateway

I came across an incident earlier where a client was not able to login to the Access Gateway site, even though he swore up and down he knew he was using the correct credentials. We confirmed that he was using the right credentials by having him login to another password-protected intranet site with success. After the usual housekeeping of making sure his account was in good standing and in the proper access groups, I fired up PuTTY and started poking around.

In FreeBSD (which is what the Netscaler is built on), there exists a special type of object called a "pipeline," or "pipe" for short. In this instance, it appears as a file called "/tmp/aaad.debug". It isn't a normal text file, as it doesn't actually store the data. It has a similar function as the system's event log. As authentication events happen, debug events are streamed to this file. If somebody's watching, they'll see them. If not, they go directly to the bit bucket.

One of the most important things this file will let us do is find out more information on why a login process is failing. If you are using two-factor authentication, it'll tell you which method is rejecting the login. If it's a bad password, username or the user is not in the member of the correct group, it will tell you. If your LDAP bind account isn't working correctly, you'll see that as well.

Normally this works wonders, but in this scenario we had an additional wrinkle. I watch the debug messages and tell the client to login and it fails, but I don't see any record of his attempt to login. I try to login to the same page and I work just fine, so it's not the authentication method. When I remote to his PC, I notice the credentials are getting rejected immediately. I try having him login with his smartcard, still no dice! I start a network trace while he inputs his password and notice that when he tries to login, it doesn't try to post anything.

Yep, we're looking at a coding issue. At this time, I start getting some details on his password and he tells me it was changed recently. No crazy high-ASCII symbols or anything are in there, but it's long - very long. I started doing some tests and found that I can replicate the problem if I stuck more than 31 characters into the password field. I rang up a technical contact at Citrix and found that this is a known limitation, and it is on the roadmap to be fixed in the first part of this year. I'll follow up again and see if I can get my hands on a pre-release version, but I'm sure it's been rolled up into v10.


The three important takeaways are that:

  • Passwords longer than 31 characters won't work on the Access Gateway.
  • Attempts to login with a longer password will silently fail on the NS side, and present the standard "Bad Credentials" message to the user
  • If you are using smartcards, this restriction applies whether using LDAP or smartcard login.

One more note: I've also been told that AGEE doesn't like passwords with either slash character (/ or \) as well. I haven't tested this yet, but will in an upcoming post.


Authentication Policies, Binding and You

Just a quick note regarding authentication policies.


Now, for those of you that might be angry about my previous statement, I'll add a bit at the end.

Don't bind them globally UNLESS YOU HAVE A REALLY GOOD REASON.

Binding an authentication policy globally means that device management logins will also be handled by the authentication provider you set up. That means if you bound an LDAP authentication policy globally, nsroot logins will be sent there first for evaluation. This will make your AD team angry, but will also most assuredly come back to bite you. A situation was encountered where one device had a globally bound TACACS authentication policy which was providing RSA authentication for management interface logins. The TACACS server went down one morning, which meant nobody could access an Access Gateway vServer which was using the same authentication policy. No problem, I thought, I'll just edit that policy to talk to another TACACS server.

Imagine my surprise when I found that I couldn't login via nsroot! That's right. The device was up and talking to a misbehaving TACACS server, and instead of passing the credentials to the local password store it simply just stopped.

The moral of the story is you can still create globally bound authentication policies if entirely necessary, but you must bind a local policy with a lower-numbered priority. Remember, don't use zero - that will make it the last policy to get processed. If you change the expression to match connections coming from a management network, for example, then you can stop every authentication request going locally and then remote.



SSL Problems on New NetScaler Devices

So, you import the NetScaler VPX image into your hypervisor of choice, give it an NSIP and gateway, then reboot it. So far go good, right? You hit the web-based setup page and are greeted by the red warning text seen below:

SSL Warning

"Accessing configuration utility over HTTP is insecure. Click here to use Secure HTTP (HTTPS)." Being a responsible admin, you click on the HTTPS link before you login only to be greeted with disaster!

That's odd. Perhaps another browser will behave better?

No such luck, but we've got some good information at the bottom of that error message. "Error 113 (net::ERR_SSL_VERSION_OR_CIPHER_MISMATCH): Unknown Error." Let's hop back to the administrative GUI via HTTP and check out the SSL configuration. To do this, simply access your device via HTTP, login, then select the "SSL" feature. If this device is brand new, you'll notice the blue icon on the feature's folder which denotes that it is disabled. Simply right-click on the feature and choose "Enable SSL Feature" to enable SSL.

That wasn't the only problem, however. You can access the Netscaler Admin Console via HTTPS even if the SSL feature is disabled. If we poke around a bit further in the SSL parameters, however, we can notice something awry in the Ciphers page.

List of default ciphers

What happened to all of the SSL ciphers the real world uses? Well, that answer is simple, yet not obvious at first glance. Remember that all cryptography over a certain strength is governed by export controls. Modern SSL ciphers are no different. If you were paying close attention when you downloaded the VPX you'll notice that you agreed to Citrix's EULA, but *not* a crypto non-export affirmation. It turns out that this is actually controlled by licensing. Your VPX already contains the ciphers, but they are inaccessible unless the device is properly licensed. Like many other features of the Netscaler, this is logical but not immediately apparent. For instance, if you check the feature list of a non-licensed VPX, the SSL Offload feature is marked as licensed even though all ciphers but the export ones are unavailable.

Feature list of a Netscaler with no license file applied

A VPX Standard license enables Load Balancing and Content Switching, but most importantly a proper license file will yield the following list of ciphers:


List of available SSL ciphers on a properly licensed NetScaler

Much better! You can now not only access the administrative console via HTTPS, but you can also begin to use your NetScaler to do SSL offloading for all industry-standard (non-FIPS) ciphers.