Thursday, August 21, 2014

Multi-Core, Languages and the Dreaded GIL

For a very long time[1], I have always thought that the only way to deal with the coming multi-core explosion is to move to a language that deals with concurrency as a core feature in the language. This is hardly a novel idea. It has always been obvious that threaded code in standard languages is next to impossible to write correctly. This implies that the only way to deal with threads is to hide them in an abstraction in which you can devote enough resources to ensure a reasonably robust implementation. Writing thread safe code is extremely difficult and requires a huge payback to justify the effort. It's exactly the same reasoning behind allowing the kernel to do memory and process management, rather than having every application write it's own OS.

In the face of this I have made a serious effort to learn Erlang, Go and dip my toe in functional languages in an attempt to be ready for adapting to this multi-core world. As much as I love Ruby,
it was obvious that it simply would not be a reasonable solution for a machine with 32 cores and above. It and many of the other "scripting" languages, depend on having a lock on the dreaded GIL[2]. This is generally not a problem for code that runs websites due to the fact that they are largely I/O bound[3]. However, for any other workload that requires CPU these languages would be relegated to the "fancy shell script" problem domain. You simply wouldn't write anything serious in these languages, since they can't effectively use 80-90% of the available resources on a high end server.

Recently, I have begun to rethink this assumption.  The reality is that most 32 core machines are running VM's in one form or another. Docker[4], in particular, provides a highly efficient means of simulating a single core OS for a single core process. A common theme of the "multi-core" ready languages is to avoid the problems of threading by switching the model from shared memory to very fast messaging. If it is possible to create an interprocess message system between Docker images that can compete with speed of Go channels or Erlang messaging, then it should be possible for languages like Ruby to be a reasonable alternative in this brave new world.

There was an attempt to allow this kind of distributed processing in Ruby, DRb. It never really gained wide usage primarily due to the latency and AAA[5] issues. However, if there was an API that allowed communication between Docker instances that could be both fast and trusted, it would be possible to revive Drb and similar "remote object" implementations. I think this is the only workable way forward for the "interpreted" languages. Essentially you have to convert the message sending primitive to sending a message to remote[6] objects. This isn't as hard as it looks at first blush, Docker images are simply processes running in the same kernel on the same machine. They just have different ideas about their various interfaces to the rest of the world. It should be possible to "short-circuit" these interfaces to allow them to communicate at something close to the speeds available to more concurrency aware languages.







[1]- Since about 1997 or so (i.e. when I first attempted to write pthreaded code ).

[2]- Global Interpreter Lock https://en.wikipedia.org/wiki/Global_Interpreter_Lock

[3]- i.e. Turning database entries into web pages and web forms into database entries.

[4] - https://www.docker.com/

[5]- Authentication, Authorization  and Accounting

[6]- If they are in the same memory accessed by the same cpu, they really aren't "remote"

Friday, November 8, 2013

If you really think your employee's performance is on a bell curve, you should fire your HR department.

A recent kerfuffle over in Yahoo Land has got me thinking about the mis-application of statistics.

http://allthingsd.com/20131108/because-marissa-said-so-yahoos-bristle-at-mayers-new-qpr-ranking-system-and-silent-layoffs/

The assumption is that employee performance follows a bell curve. But you only get a bell curve in your sample of a larger population if your sample is random and the larger population is also a bell curve.

What this means is that if your employee's performance truly follows a bell curve, then you should fire your HR department and completely replace all those expensive people with a random process that just picks a resume out of the hat. Frankly, I suspect you'd be hard pressed to tell the difference in most organizations.

Thursday, October 10, 2013

Person Capital and the Government Shutdown


One of the mantras of the information age is

"People are Capital"

In other words, the experience and knowledge of your workforce is a significant capital investment. It is the only capital investment that can walk out the door whenever it get's pissed off enough and right now all the best people employed by either directly or indirectly by the federal government are polishing their resumes and checking just exactly who they know on LinkedIn. I know because that's exactly what I've been doing.

The great lie that has been put out there is that government jobs are cushy and easy and only incompetent people would take them. Well, one way or another I've been employed by the Federal government indirectly for the every job I've ever had since I stopped waiting tables, cooking or washing dishes in 1987. The reality is that most people in the government work long hours for relatively low wages given their expertise and education. Despite every attempt to belittle it on Fox News, Public service is alive and well. Many of the people not getting paychecks are hard working competent people that
could make more money in the private sector, but choose to work for the federal paycheck because they believe in public service. For me, it matters that what I do advances the knowledge of mankind in some way. 

I remember a conversion in high school (1978) about the future and I said my "dream job" would be to work at a "national lab doing physics research". One of my friends that had more exposure to the larger world said "Those are terrible jobs". 

It's really odd that we were both right. My job is my dream job, but it is also terrible in that I never get the resources to do it right.  The current drama in Washington is the last straw for many of us; the best people currently employed by the government are all looking around for stable paychecks.

Monday, September 16, 2013

Chef Encrypted Data Bags Are A Code Smell.


Chef databags are a centralized key/value store for json based data in the Chef environment. Encrypted databags are exactly what is advertised on the box, the data is encrypted with an external key before uploading to the data store.

Encrypted databags are meant to help solve the problem of dealing with identifier information that you want automatically installed, but that you need to keep private. Examples are database login passwords, the private key of a public key pair and other tokens that allow a server access to additional services.

Encrypted databags provide protection against two kinds of access:

Implicit Access:

This is defined to mean access outside the chef protocol to the underlying data store on your chef server. If you're using a service like Hosted Chef or just practicing general good data hygiene, DB access to the server should only expose "public" data if at all possible.

Explicit Access:

Explicit access is via the standard chef protocol were using the ssl keypair created as part of the chef client bootstrap to access data in the chef datastore or via a knife command using an admin key pair.

The problem with encrypted databags in both use cases is that they only appear to solve the underlying problem by moving it out of the chef workflow. In order to use either protection effectively, you need to create an "out of band" system to manage the shared secret required to access the databag.

In the implicit access protection case, this is likely a worthwhile and manageable cost since the key only needs to be secret from the chef server. However, this only protects against read-only implicit access. If the bad guy has write implicit access to your chef datastore, the game is over on your next client chef run.

The explicit access case is where the real problems arise. In this case, the intention is to prevent some admins/hosts from access to private data. This however creates an unfortunate side effect in that access to the shared secret becomes an "invisible access control list". Using an encryption key as an authorization object creates problems since you destroy the chain of identity. All you ever know is that "somebody with the key" accessed the data. You need to create a separate tracking and access control system to provide an audit-able trail. Encrypted databags don't solve a problem in this case, they just create a whole new set of problems.

Chef Vault is an attempt to get around this problem by creating access control lists using the available private key on each chef client. Without the strong ACL system that either Private or Hosted Chef provide, this is probably the simplest workable solution. Any other solution should implement all of the features of Chef Vault. The most general solution in the explicit access case is ACL's based on the ssl identity of the client. There are many other objects in Chef on which the ACL system of the Opscode Chef would provide useful security boundaries.

It's important to remember that more keys is not better security. Encryption keys should be used to provide data integrity, privacy and authentication. (i.e. they answer the who questions, who are you? who am I? who sent that message? ). They should never be used to answer the what questions. ( what can I read? what can I write?, etc ).

Providing secrets in a scalable and secure fashion is still a largely unsolved problem. But any solution that attempts to use only shared secret encryption will not scale. Public key encryption and rings of shared access seem to be the only workable way forward and every chef client and admin has key pair already available. Any scalable solution should be using this existing identity. If you are using encrypted databags to control explicit access, then you are building in scaling, access and audit problems for the future.


Thursday, May 23, 2013

There Are Only Two Ways to Save Money with Software

To paraphrase a civil war general:

 "The only good code is deleted code."

 I have recently been in several meetings where the phrase

"We'll save resources by doing things with new software",

has been used a justification for change.

This is one of the great lies that drives Silicon Valley. Every line of software is a cost liability.

The only way to save money with software is to delete the number of lines of software you run or to use software to make the problem somebody else's problem. Every advance in computer software engineering has followed these two principals.

Compiled languages follow this principal by both reducing the number of lines of code required to solve a problem and also by handing the problem of turning those lines of code into executable instructions to the people that write compilers.

Operating systems follow the same principal; instead of having to manage access to millions of blocks in both memory and disk, you write enough software to make it the user's problem.

The only way to save resources with new software is to delete old software, or to make the things that the old software was responsible for somebody else's problem.

New software can make doing new things cheaper, but that is a completely different problem.

Wednesday, May 22, 2013

How to build ruby 1.9.3 with libyaml in a funky place

The only solution I've found is to hack the location of libyaml library and include files
into the extconf.rb file in ext/psych


header_dir =  '/afs/slac/package/ruby/@sys/include'
library_dir = '/afs/slac/package/ruby/@sys/lib'

dir_config  'libyaml', header_dir, library_dir

Friday, October 12, 2012

Why Monitoring Sucks

It doesn't matter what software you use, the fundamental model behind all current monitoring software is broken.

The  idea the originally prompted me to create this blog still holds true, I never published it. But the basic tenent of that message still holds true.

"Push Considered Harmful"

The original idea I had was in the context of grid computing in the early 2000's. Early versions of the LHCG were based on the idea that a central scheduler could monitor the entire state of the "grid" and
assign jobs to the hosts that had both the data and the available cycles. This failed badly, and the concept of "pilot jobs" invaded LHCG. In essense, pilot jobs flip the paradigm on it's head and turn "push" architectures into "pull" ones. Instead of a node passive waiting for the central service to ask
it for state, the central service should be waiting for nodes to notify it for state.

There have been no large scale push architectures that have actually ever worked outside a single organization. This is an important lesson for monitoring.

The reason "monitoring sucks" is that it is entirely based on a push model. All current monitoring software assumes that you have a model of your infrastructure that can be mapped onto existing
hardware either virtual or real. This is completely backwards.

There will be no improvement in monitoring software until this model is thrown out and a "pull" or
peer to peer model is created. There is a reason Nagios still exists, it's because everything else that
has tried to replace it uses the same "push" model.

What is needed is a subscribe model in which a node requests a monitoring service to "pull" service requests with a finite time limit. Similarly there needs to be a subscribe service for alerts. Monitoring configuration needs to be a dynamic service; any software that reads a static file for configuration is
doomed to suck.

The other fundamental problem in monitoring is the poor separation between alert and diagnostic services. For diagnostics, you want to monitor "everything", for alerts you should only be monitoring
things that you can actually control. If it sends an alert, there needs to be procedure to handle the alert.

Until monitoring services look a lot more like DHCP than DNS, they will continue to suck, regardless of the software used to implement them.