3. Systems Design

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3. Systems Design

Computer Science is the study of what can be automated.

Donald Knuth

Designing a system is fundamentally a different task from designing or implementing a single function. Developing a concise function can, at best, be a linear process. You consider the inputs, the expected outputs, and the possible errors, and if you’re very lucky, you can stub it out and write it in a relatively linear fashion, working from the inputs to the outputs. While this might seem an oversimplification of the process of writing code, it’s not too far from the truth for most koders, and it is definitely not the process most people would describe when designing any system of significant scale.

Well-designed systems depend heavily on one of the tenets described at the start of Chapter 1, which is composability. Just as good code is composable, a good system is made up of parts, be they functions, classes, modules, or whole programs that are themselves composable. The early developers of Unix put forth a systems design philosophy that has been variously quoted, but KV prefers Peter H. Salus’s description.¹

1. Write programs that do one thing and do it well.

2. Write programs to work together.

3. Write programs to handle text streams, because that is a universal interface.

1. Peter H. Salus: A Quarter Century of Unix.

The term programs comes from the fact that Unix programmers were often interested in building up many programs that could cooperate in processing streams of text, but we can apply the same advice to modules, classes, methods, and functions. The third suggestion has fallen by the wayside as systems have become more distributed and tend to communicate more than just plain ASCII text, which was the norm when Unix was first built. Now it ought to be stated as, “Write programs with well-defined interfaces so that the output of one can easily become the input of another.”

When designing a system, it is important to try to collect together all of the component parts into a consistent whole, one that describes, on different levels, the interconnections and dependencies among these parts. Two approaches to systems design have been promulgated in software development: top down and bottom up, and both have their proponents and detractors, much as the Lilliputians argued about which side to break their hard-boiled eggs. Very briefly, top-down design emphasizes having near total knowledge of the final system before work begins, while bottom-up emphasizes building up larger, complex systems, from pre-existing parts. Modern systems design nearly always comprises both approaches as very little software is written from scratch but is most often built up out of extant code, libraries, and other systems. The number of green field software projects built every year probably number less than 100, perhaps less than 10. For modern systems design we usually take quite a bit for granted, including tools like compilers, and platforms like the operating system and its supporting libraries, as well as a whole host of open source and proprietary components that need to be successfully sewn, some would say cobbled, together to make a coherent working whole.

We started out with our nose to the kode in Chapter 1 and pulled back a bit to look at Koding Konundrums in Chapter 2; now we try to pull back a bit further, just high enough above the code to be able to see how all the parts interconnect, in order to make a, hopefully, coherent system.

3.1 Abstractions

The art of programming is the art of organizing complexity, of mastering multitude and avoiding its bastard chaos as effectively as possible.

E. W. Dijkstra

Proper abstractions are the key to good systems design because if the abstractions are wrong, it’s very likely that composability will suffer, and then the whole system will either not come together, or if it does, it will have warts, and pieces that seem like they don’t fit together naturally. Failure to get the abstractions right leads to misunderstandings between components, which leads to bugs, which usually leads to system failure. When systems fail because the abstractions have not been thought through properly, the blame likely falls to the designer rather than the koders who had to implement the component parts.

Nearly every attempt to solve the software crisis, which was first identified in the 1960s and which goes on to this very day, has some relationship to the idea that if we can just come up with the right abstraction for what we want the system to do, all of our work will fall into place, and everything will be just fine. The fact that these abstractions continue to elude us after nearly 60 years does not seem to have caused a pause in the fads that periodically overtake the software industry. The plain old boring fact is that some level of abstraction is necessary and helpful in designing complex systems, but that too much of a good thing is counter productive.

A good abstraction gives us a way to encapsulate an algorithm, in whole or in part, in a way that is usable, testable, and maintainable. Seems simple enough that it can be stated in a three-word list, and yet the particulars of each remain a challenge. Over the years we’ve developed the idea of functions, then libraries of functions, then modules, then came objects and object-oriented programming, all of which are abstractions that are supposed to help with reuse of the code we’re already working on. Lists, tables, and trees of various sorts are also abstractions targeted at data, rather than code, and these too have proliferated like rabbits in Australia, with too much food and too few predators.

The central issue with various types of abstraction is not whether the concept of collecting related functions or data together is good or bad, or even, as in the case of object-oriented programming, keeping functions with the data they manipulate, is a good idea, as it probably is, but it is when the drive towards abstraction results in a Xeno’s paradox of carving whatever we’re working on into smaller and smaller bits until any particular bit, on its own, is relatively useless. Systems built only out of the smallest, understandable codelettes may be emotionally satisfying to a certain type of koder, but they generally result in an overly complex morass of code where it’s nearly impossible to understand which part is doing productive work and which part is just there to glue the other tiny bits into the incohesive whole. The goal of using abstractions should be to reduce, and not increase, complexity. A system where there are too many tiny functions or methods effectively pushes the program logic into the connections between the functions; the callgraph defines most of the logic rather than the code in any particular function, which is classic spaghetti code. Such systems also have very high framework overheads, which waste memory and CPU in deference to someone’s personal concept of elegance.

When we’re looking at any abstraction, code or data, we must answer the three questions laid out above: Can this abstraction be utilized by a programmer as is? Can this abstraction be tested on its own? When it comes time to perform maintenance, how many knock-on effects do we have to worry about in the code that consumes this abstraction?

The utility of an abstraction can be gauged in two ways, proliferation of use and simplicity. A simple abstraction does not mean that it has a single operation, for example the plus operator, but that the operations that the abstraction provides are easy for the programmer to keep in mind when using it. Take, for example, the traditional file operations in Unix systems: open(), close(), read(), write(), seek(), and ioctl(). The file operations are a clear example of a good abstraction because they are related and easy to understand for the consumer, and they have, over time, been used successfully in millions of systems.

Our second measure, testability, can be related to simplicity, since it’s far easier to test a piece of code, or a data structure, that has a small operating surface. A module with 10 operations is far easier to test than one with 100, no matter how good your test framework is, for it’s not the framework that’s being taxed, but the programmer’s mind.

Maintenance is our final measure of an abstraction’s quality. If I fix a bug in the code or change the layout of the underlying data structure, will that significantly change what happens with the existing code? How much re-testing is necessary when the abstraction is changed? A common problem is improving the speed of an abstraction, and who doesn’t like more speed, and having that improvement break the assumptions of the code that consumes it. A change like this shows not that the consuming code was wrong, necessarily, but that there are assumptions between the consumer and the abstraction that were poorly understood, and, if this kind of problem recurs every time the abstraction is updated, then there is actually a problem in the abstraction, or the interface that it is providing to its consuming code.

Given how much of our time we spend dealing with abstraction, it is not surprising that there are many letters to KV about this very topic.

Dear KV

I have an office-mate who writes methods that are a thousand lines long and claims they are easier to understand than if they were broken down into a set of smaller set of methods. How can we convince this person that his code is a maintenance nightmare?

Fond-of-Abstractions

Dear FoA,

The short answer to your question is to make your office-mate maintain his own code forever, as that should be punishment enough. At some point they will realize that what they wrote three months ago is unreadable and begin to change their ways. Unfortunately, people with annoying habits, like talking loudly on cell phones, driving poorly, or giving unwanted advice, rarely see the errors of their ways. This is why there must always be crusaders for the good and right, such as ourselves.

I note that you used the word “methods” in your letter, rather than, “functions,” which indicates to me that you are using some form of object-oriented language. Before we go further let me point out that everything I say in this letter applies both to methods in object-oriented languages and to functions in non-object-oriented languages. The base problem is that there is too much functionality crammed into one place. There are several arguments you can make to explain why such over-long methods are problematic.

The first argument that comes to mind is code reuse. One of the reasons to have methods or functions in a program is to capture the essence of a single idea or algorithm so that it may easily be reused by others. When a method grows to 1000 lines, it usually becomes highly specialized to one job, and a job that is probably not needed that often. It is far better to break down the larger problem into smaller ones, some of which may be reused by other parts of the software. Another added advantage of smaller, reusable methods is that they can be used in the next project you work on. Reusable methods are a benefit to the koder, who now does not have to write as much code, and to the company they work for because they can now finish a project faster. This kind of work avoidance is one of my favorite reasons for anything I do. Why should I work harder than I have to?

Another argument is that over-long methods are just plain hard to read and understand. A method that is 1,000 lines long, when viewed in a window that displays 50 lines at a time, works out to 20 pages of code to work through. Now, I don’t know about anyone else, but 20 pages of anything, a book, magazine, or code, is hard to digest and keep all in my brain at any one time. Understanding anything requires context, and that context ought to be local. Jumping from page 18 back to page 2 because that’s where the variable fibble was last modified often causes me to lose my place. I wind up staring at page 2 thinking, “No, why was I here?” I get all glassy-eyed, stare into space, and occasionally begin to drool, which makes my co-workers very nervous.

Finally we come to your excellent and well-justified point about code maintenance. Clearly if something is hard to understand, as we just established in the last paragraph, then it will also be hard to maintain. A thousand-line method is clearly doing too many things at once. How do you find the bug in a method when it is doing the equivalent of balancing your checkbook, whistling the “Ode to Joy,” and juggling chain saws, all at once? Compounding the problem of just understanding such a method there is the issue that the number of possible side effects in a piece of code goes up quite a bit with every line you add. Perhaps not exponentially, but certainly more than linearly.

There are a few ways to set your office-mate on the righteous path to klean kode, if not clean living. One way is to make this set of arguments to them and see how they respond. Sometimes people can actually be shown the error of their ways. Using a neutral third party’s code as an example is a good way to avoid the “I’m a better programmer than you” pissing match, which rarely wins anyone over to your side. If rational argument fails, you can try to use the software specification as a way to get reasonably sized chunks of functionality out of this person. You do have a specification for your software, right? If the specification clearly states the amount of work that is to be done for each method, then it will be pretty clear when this person violates the spec, and you can then come down on them as hard as you like at that point.

Of course, sometimes reasonable argumentation, i.e., the carrot, and direct control, i.e., the stick, fail. At that point I recommend something lingering, with boiling oil or melted lead. Just don’t tell Amnesty International; though, if they had to maintain your office-mate’s code, they might understand.

Fondly, KV

3.2 Driven

Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Rob Pike

Object-oriented systems in particular seem to suffer from an overwhelming amount of abstraction, which is unsurprising since a key selling point of object-oriented systems is that they allow abstractions to be made more obvious. Prior to object-oriented programming, abstractions had to be hand built, which could be done in any language, including such a low-level language as C.

Dear KV,

I’ve been working on a program in C++ to handle some simple data analysis at my company. The program should be a small project, but every time I start specifying the objects and methods it seems to grow to a huge size, both in the number of lines and in the size of the final program. I think the problem is that there are just too many things in the system to analyze, and each one needs a special case, which requires just a bit more code, or another sub-class to be created.

Help!

Driven to Abstraction

Dear DA,

One of the biggest problems when people use an object-oriented language is that when they realize how easy it is to create yet another class, they do. Instead of figuring out where the rubber meets the road, they instead find where the rubber meets the sky.

Of course, without looking at your code, and given your description, I really don’t want to look at your code, I can’t give you a pat answer. I also charge heavily for pat answers.

When I find someone I work with spending days specifying class after class without writing any implementation code, I tend to first take a long walk around the building. My therapist says that screaming at people helps no one. I don’t agree with him, but for now, I am trying to play along.

I have a few pieces of advice when you find something you think should be simple starts to take up a huge amount of time and space. The first suggestion is to switch over from a compiled language, like C++, to something interpreted, like BASIC. Oh, wait, sorry, not BASIC. I meant Python, my current scripting language of choice. The reason I suggest Python is because it, too, is object-oriented, and it’s easier to move an idea built in one OO language to another. You may even find that Python suits your needs perfectly and you won’t have to move to a compiled language, but that decision is further off.

I suggest using a scripted language as my second piece of advice. Try to solve a smaller part of the problem you’re working on. Programmers and engineers often try to bite off more than they can chew. We’re a strangely optimistic lot, unless we’re talking to a marketing person. In which case solving an equation such as 2 + 2 seems to require millions of dollars in investment, a colo full of machines, high-speed network links to everyone’s house, and six weeks of paid vacation in Barcelona if you come up with a correct result. OK, maybe you don’t handle your marketing department that way, but I can’t suggest it strongly enough.

With a scripting language you can take smaller bites of the problem and play with them. If you can solve a segment of the problem and get some output to work with, you can then probably figure out the five or six next things to do and do them and so on.

The nice thing about working in smaller chunks is that you wind up with a result a lot sooner, and that’s a lot more satisfying than having reams of UML diagrams and hand waving and a promise of a brave new world when you’re done, which, at the rate you’re going, you probably never will be.

So, get your tires out of the clouds, put them on the road, and implement a few things, instead of trying to solve everything at once.

3.3 Driven Revisited

A program is never less than 90 percent complete, and never more than 95 percent complete.

Terry Baker

Abstraction also proved to be a popular topic with KV’s readership as the previous letter and response brought in three new responses, all of which are addressed in this section.

Dear KV,

I like your column on Koding in ACM Queue, mostly because at the end it says that you are an “avid bicyclist.” Just like me. Bicycling in California, where you live, must be so much fun compared to Germany, where I live, because of the constantly nice weather you enjoy in California. Too much sun, however, can sometimes make you write columns that seem strange for us Old Europeans. In your column in Vol.3, No.2 (“KV Reloaded”), your advice to poor reader “Driven to Abstraction” is “to start with smaller parts of the problem” and to play with them using a scripting language.

Not only does California give you plenty of sun, it also has employers that give you plenty of time to play around with the “smaller problems” that you like in some programming language that is irrelevant for the later implementation. I always thought that in system design it may be convenient, but it is actually very dangerous to solve your favorite problem first and add the difficult, not-liked stuff later. Examples: Security—can it be added later? No! Read the pleas of guilty in “Patching the Enterprise”! Performance—can it be added later? No! Look into the faces of all the frustrated software re-engineers. Even more frustrating than the lack of sunshine in my country is the absence of employers that give you time to play around with scripting languages when project deadlines impend and budgets are already overspent. We are forced to make good overall designs first and improve it iteratively. Oh, California—the land of milk and honey and bicycling in the sun!

Koder-User-Rider-Teacher (KURT) ;-)

Dear KV,

Regarding the comment to switch from C++ to Python for the gentleman doing the “simple” data analysis. Two problems with question/answer scenario presented.

1) He doesn’t understand the problem he’s assigned, as he stated confusingly at the end of the paragraph that “there are just too many things in the system and each one needs a special case.” Thus, it’s not simple. He has a poor understanding of the problem assigned him. Statements like that should have come before he wrote any code at all. Had he spent more time mapping out the problem he could have saved himself some embarrassment.

2) You however did an equal disservice by blindly suggesting he switch to another language. How can you answer like that without knowing the detail of the department? The solution if nothing else is to teach them how to think and design. This guy sounds like he was working in the boiler room with no one else. I remind you are writing for a prestigious international association, an association of professional “thinkers.”

The first suggestion is for the developer to do the upfront thought and design needed so he understands the problem. An interpretive language is no better if he does not understand what the problem is, namely, the characteristics of the data involved and how it needs to be analyzed. Any engineer who “bites off more than he can chew” didn’t do any engineering he hacked. And you feed the crowd by suggesting nothing more than to hack with a different language. The real problem is not about what language to use but the lack of thought prior to writing code. But if my guess is correct, teams and managers don’t have time for that. You simply suggested code as you go in the second to last paragraph.

I suggest the IEEE/EAI Standard 12207.0 known as the Development Process.

Sincerely, Karl Henning

My Dear Olde Europeans,

I want to thank you both for writing to KV here in the New World where we enjoy plenty of sunshine and benevolent employers. I was actually just enjoying a rubdown from my private masseur here at the office when your mails arrived, but I sent Jacques away so that I could concentrate fully on answering your letters.

What I believe you, unfortunately, missed in my original response was my suggestion that “Driven To Abstraction” break the problem down into smaller, possibly bite-size chunks. Although it is nice to think that you could specify all aspects of a program up front, that is only the case if all of the parts of the problem are understood before you start designing. I’m sure that each of you has been confronted with a system that you did not completely understand and that to be able to get a handle on the issues you had to work with smaller models and prototypes to be able to get your mind around the problem and finally solve it.

It is as easy to waste huge amounts of time over-specifying a system as it is to waste time playing with a scripted prototype of a subproblem, and each is equally dangerous to the success of a project. What is most dangerous, though, is simply staring at the same blank screen, page of your notebook, or white board, day after day without making any real progress. Telling your boss, “Well, I spent the last week thinking about the problem” is not acceptable, even here in the Land of Milk and Honey.

My suggestion was meant to break the mental stalemate that Distracted had gotten into, the Zen equivalent of a tweak on the nose or a tap with a stick. I figure telling people to break down larger problems into smaller problems is more acceptable in the workplace than wandering around tweaking their noses or hitting them with sticks.

Dear KV,

While I agree it’s a good idea to not bite off more than you can chew, I’m concerned with your response to Driven to Abstraction’s question. My fear is that your faithful readers will think it’s OK to not create classes. I’ve seen too many supposedly-OO programs composed of a single class. I realize you don’t suggest not adding classes, but you also do not directly address Driven to Abstraction’s fear of classes.

Regards, Afraid of Those Afraid of Classes

Dear ATAC,

It’s nice to see another take on the problem with my response to “Driven To Abstraction,” and with this one I can whole heartedly agree. Just two days ago I was reviewing some code that was clearly written by someone either afraid of or totally ignorant of classes. Actually they seemed to also be ignorant of the concept of modularity, as everything was in a single, 4,000-line file. The kode was very clever, but its current state totally unreusable. Unfortunately, this is a common problem I suspect we all face. Either due to time pressure or lack of training, someone decides to not only bite off more than they can chew, but more than anyone else can swallow.

As with many things in the world there is a spectrum between too big and too small. Too often people kode only for themselves without realizing that everything they create must be read and debugged by others. If we could only mend our selfish ways, perhaps we could all just get along.

3.4 Changative Changes

Nothing is so painful to the human mind as a great and sudden change.

Mary Wollstonecraft Shelley,
Frankenstein

As software systems grow and proliferate the probability that an update to one library or component will break another becomes ever more likely, and since many components are loaded at runtime, both in operating system kernels and in applications, it become less likely that these issues will be found by the compiler, linker, or anywhere else in the build system. Attempts to solve this problem are usually encoded into package systems, which attempt to resolve these conflicts when a package is updated, by keeping track of all of the dependencies and forcing an update of all of the relevant components to a, hopefully, compatible version. Current package systems do this not by understanding the components at the API level but at the level of the overall version, which is insufficiently granular and also prone to errors, because the dependencies are expressed by human beings marking versions of libraries as compatible or incompatible. One proposal is to be stricter about what the version numbers mean, with the major number increasing only when a completely incompatible change is made, the minor number changes when the code is no longer backwards compatible, and the patch number, the last digit, is increased for each patch or minor change. Giving a more concrete meaning to version numbers won’t solve the problem of depending on humans, but having such a standard would make it easier for koders to know if they’re about to eat a change that might give their program indigestion.

Dependency analysis is an area ripe for automation, since, for compiled language, a signature could be generated for every function entry point, based upon its name as well as the names and types of its arguments and return value. A change to the name, as we see in the following letter, is easy, but changing the types of arguments or the return value is often missed by languages that are looser in their interpretation of types. Compilers already record plenty of data about function entry points, as these are necessary for the debugger, and therefore an extension of this mechanism to aid packaging and dynamic loading systems is most welcome, and not just one that throws an oblique error about incompatibility, but that explains just which thing is incompatible, down to the specific entry point.

Here we see how disastrously this can go wrong, and, unfortunately, this is a very common failure still seen in software systems.

Dear KV,

For the last two years I’ve been working on a software team that produces an end-user application on several different operating system platforms. I started out as the build engineer, setting up the build system, then the nightly test scripts, and now I work on several of the components themselves as well as maintaining the build system. The biggest problem I’ve seen in building software is the seeming lack of API stability in software. It’s OK when new APIs are added, you can ignore those if you like, and when APIs are removed, you know because the build breaks. The biggest problem is when someone changes the API, this isn’t discovered until some test code, or worse, a user executes the code and it blows up. How do you deal with constantly changing APIs?

Changes

Dear Changes,

The best way to deal with change is to bury your head in the sand and ignore it. After all, we can all learn from the great management traditions of the past, and engineers are no exception to this. Hmm, perhaps not.

What you point out is one of the biggest challenges in building large and complex systems. Software is amazingly malleable, and that makes it possible, and unfortunately quite probably, that someone will make a change. What many engineers and programmers don’t realize is that when they’re building a library, or really any component that others are supposed to depend on, the API becomes the contract between their code and everyone who uses it.

As you point out in your letter, there really three ways in which this happens. The first, adding an API, won’t affect your system because with no one to call it the new API can’t really cause much damage. The second case, removing an API, results in an immediate error when your program is linked, either at compilation or run time, so at least you notice this before trying to really use the code. The last case is the one that will give you fits and nightmares because there are very few automated ways of finding an API that looks the same, but isn’t. At one place I worked we dubbed this as “changative change” for want of a better phrase, or, it would seem, a technical writer.

On one particular system about 80% of our problems were related to trying to re-integrate different subsystems with each other. The problem, as you can imagine, grows quite quickly with the number of components involved. Two subsystems that depend on each other have at least one dependency, whereas 4 subsystems have 6 dependencies, and 8 sub-systems have 28, and so on. Building up any sort of coherent system from a set of modules, all of which are changing, turns out to be very hard, but there are some solutions.

Operating systems people have long known about this problem, and so APIs that programs depend upon tend to change only very slowly, or not at all. The basic open(), close(), read(), write() system calls in Unix and Unix-like operating systems have taken the same arguments and returned the same types of values for nigh on to 20-plus years at this point. When new subsystems were added, such as networking, new function calls were added as needed; hence, to open a socket, you don’t call open(), because that would have required changing its arguments and therefore all the code that already used it. Instead, you have the socket() system call, which takes different arguments but returns a value that is usable by read() and write(). System programmers also tend to narrowly define the set of functions they will provide, because they know the nightmare of maintaining an arbitrarily wide set of APIs. FreeBSD, for example, has around 400 available system calls, that is, APIs that user programs are allowed to call to get the OS to do something for them, like read a file or find out the time. Although that number is not small, it is trackable and maintainable, whereas the number of APIs in the full set of POSIX libraries, or Microsoft Foundation Classes, is far, far larger.

Another trick that can be adopted from the systems programming world is that of the ioctl(), or I/O control. Device driver writers can do most of the necessary work using the simple open(), close(), read(), and write() semantics, because what most people want from a device is to open or use it, read data from and write data to it, and then put it away, or close it. The problem here is that it is often necessary to have device-specific controls that can be easily exported upwards to the operating system, for example, to set a network device into promiscuous listening mode, or to set its various address parameters. It is these special cases where ioctl() is used. The ioctl() call has been used, and verily, abused, over the years, but the basic design principle is a sound one. Always leave yourself an escape route.

Lastly, there is discipline, which some people like, but this is not that kind of magazine. What I actually mean is that there has to be a decision made about how changes get made in a system. Changing things fast seems to be in vogue at the moment; the so-called extreme programming methodology has led to a lot of this. Many engineers simply decide that at some point an API is set in stone and that it has too many callers to change, and so any changes require new APIs.

Unfortunately I doubt I’ve solved your real problem, because unless you and your team write everything from scratch you will be at the mercy of people who can, and will, make mistakes. My only other advice is that your team use the smaller number of external APIs possible and to not use too many new or advanced features as those are the ones that are mostly likely to change.

3.5 Threading the Needle

Why Threads Are A Bad Idea (for most purposes)

John Ousterhout

Sometimes a good quip will last for, and influence, a generation. In the area of programming, John Ousterhout’s commentary on threaded programming, quoted above, is quite well-known and so it was unsurprising when a letter came to KV to ask for thoughts on this topic. The topic has come up more than once as it’s also addressed in Section 3.6. Given the realities of modern hardware, which achieves performance by giving the programmer many cores on which to execute their code, it is impossible to not consider threaded programming in solving many significant software problems, which means everyone must now learn and understand how to write and debug threaded programs.

Interested readers can find the original presentation from Ousterhout here:

https://web.stanford.edu/˜ouster/cgi-bin/papers/threads.pdf

Dear KV,

When I was in school, I read a paper on how threads were considered dangerous, but that was before most CPUs were multicore. Now it seems that threads are required to get improved performance. I haven’t seen anything that indicates that threaded programming is any less dangerous than it used to be, so would you still consider threaded programming to be dangerous?

Hanging by a Thread

Dear Threaded,

You might just as well have asked me if guns are still dangerous, because the answer is closely related: only if the gun is loaded, and definitely if the business end is pointed at you.

Threads and threaded programming are dangerous for the same reasons they always were: because most people do not properly comprehend asynchronous behavior, nor do they do a good job of thinking about systems in which two or more processes work independently.

The most dangerous people are those who think that simply by taking a single-threaded program and making it multithreaded, the program will somehow, as if by magic, get faster. Like all charlatans, these people should be put in a sack and hit with a stick (an idea I got from the comedian Darragh O’Brien, who wants to use that method for psychics, astrologers, and priests). I’m just adding one more group to his list.

Probably my favorite example of not thinking clearly about threaded programming was a group that wanted to speed up a system they had developed that included a client and a server component. The system was already deployed, but when it was scaled up to handle more clients, the server, which could handle only one request at a time, couldn’t serve as many clients as was called for. The solution, of course, was to multithread the server, which the team dutifully did. A thread pool was created, and each thread handled a single request and sent back an answer to a client. The new server was deployed, and more clients could now be served.

Just one thing was left out when the new server was multithreaded: the concept of a transaction identifier. In the original deployment, all of the requests were handled in a single-threaded manner, which meant that a reply to request N could not be processed before request N-1. Once the system was multithreaded, however, it was possible for a single client to issue multiple requests and for the replies to return out of order. A transaction ID would have allowed the client to match its requests to the replies, but this was not considered; and when the server was not under peak load, no problems occurred. The testing of the system did not expose the server to a peak load, so the problem was not noticed until the system had been completely deployed.

Unhappily, the system in question was serving banking information, which meant that a small but nonzero number of users wound up seeing not their own account information but that of other customers, resulting in not just the embarrassment of the development team, but the shutting down of their project, and in several cases, firings. Alas, the firings were not out of cannons, which I always felt was a pity.

What you ought to notice about this story is that it has nothing to do with inter-thread locking, which is what most people think of when they’re told that a piece of code is multithreaded. There is no magic method to make a large and complex system work, threaded or not. The system must be understood in total, and the side effects of possible error states must be well-understood. Threaded programs and multicore processors don’t make things more dangerous per se; they just increase the damage when you get it wrong.

3.6 Threads Still Unsafe?

Unsafe at any speed.

Ralph Nader

The fight over threaded code seems as if it will never end. Threads are a key abstraction used in decomposing a system into cooperating parts, but our tools for understanding them remain woefully inadequate, to the point where newer computer languages, such as Go, have tried to make their use both explicit and comprehensible. The problem with threads isn’t just with our tools, but also with our minds. Experience with software design has shown the number of people who can comprehend how to build a system out of cooperating, relatively uncoordinated, independent tasks, is relatively small.

Dear KV,

Due to performance needs, my team is re-working some old code to run multithreaded so that it can get some advantages from the new multicore CPUs that are now shipping in high-end servers. We’re estimating that it will take at least six months to break down our software such that it will be granular enough to run as multiple threads and to implement all the proper locking and critical sections. I happened to come across an old paper online when looking up other information on threading, “Threads Considered Harmful,” and was wondering what you thought of it. The paper was written long before we had multicore CPUs, and at the time there were few commercial SMP machines, so perhaps it didn’t make sense to go to all the bother of writing threaded code then, but now, things are different. Have you heard of this paper? Do you think it is still valid?

Hanging by a...

Dear Hanging

John Ousterhout’s warning is as important today as it was when it was written, not because times and technology haven’t changed, but because, alas, people haven’t. Most people seem to decide to create multithreaded code for the reasons you state here, that is, because of wanting to get a supposed performance boost from it. These same people never seem to bother to measure their code or to see if it is even practical to run it in multiple threads; they just start slicing away at the code in the vain hope that if they have enough threads suddenly, as if by magic, their code will run faster.

Longtime readers of KV will know that I do not believe in magic bullets. Waving a wand labeled “threads” over your code is about as likely to make it run faster as is sacrificing a chicken. At least you can eat the chicken when you’re done with it, which is more than I can say for your code. In actual fact threading your code may make it run slower, because poorly written threaded code is often slower than poorly written nonthreaded code. The locking primitives required to get locking right are nontrivial and can, if improperly used, slow your code down as it all blocks on the same lock, or, even worse, introduce subtle bugs.

Another problem with threaded code is that the tools used to debug it remain primitive. Although most debuggers now claim to handle threads properly, this is in fact not always the case, and you really don’t want to be debugging your debugger while debugging your code. Race conditions are as hard to debug now as they were 20 years ago, and they don’t seem to be getting any easier to find, let alone fix.

One last thing that a lot of people miss in their rush to thread their code is the support they get from libraries that they link against. If your program requires that the libraries it uses be multithreaded as well, then you may be in for a shock when you realize that some of them are not thread safe. Using non-thread-safe libraries in your thread-safe program is going to cause you no end of trouble.

Given all of this, should you still continue to go about threading your code? Maybe. First you need to understand the trade-offs and see if the job the code does is amenable to being multithreaded. If the code has several components that can operate completely independently, then, yes, multiple threads can be a boon, if, on the other hand, the components all need to access a small shared section of data all the time, then threads will get you nowhere. Your program will spend most of its time acquiring, freeing, and waiting on the locks that protect the shared data.

So, unless it’s really a win and you and your team have thought about it a lot, I’d try not to get hung up in threads.

3.7 Authentication vs. Encryption

Security is a state of mind.

NSA Security Manual

One would think that after 20+ years of having a public network on which people buy and sell various goods that most people who work with technology would understand the difference between authentication and encryption, but as this letter shows, that knowledge is not as pervasive as one might prefer. The online world might be a better place if these concepts were not only well understood, but also applied intelligently and consistently, to pretty much all software systems, and yet...

Dear KV,

We’re building out a new web service where our users will be able to store and retrieve music in their web account so that they can listen to it anywhere they like, without having to buy a portable music player. They can listen to the music at home with a computer hooked to the Internet or on the road on their laptop. If they want, they can download music, but if they lose it through a problem with their computer, they can always get it back. Pretty neat, huh?

Now to my question. In the design meeting about this I suggested we just encrypt all the connections from the users to the web service because that would give us and them the most protection. One of the more senior folks just gave me this disgusted look, and I thought she was really going to lay into me. She said I should look up the difference between authentication and encryption. Then a couple of other folks in the meeting laughed, and we moved on to other parts of the system. I’m not building the security framework for the system, but I still want to know why she said this? All the security protocols I’ve looked at have authentication and encryption, so what’s the big deal?

Sincere and Authentic

Dear Authentic,

Well, I’m glad she laughed, screaming hurts my ears when it’s not me doing the screaming. I’m not sure what you’ve been reading about cryptography, but I bet it’s some complex math book used in graduate classes on analysis of algorithms. Fascinating as NP completeness is, and it is fascinating, these sorts of books often spend too much time on the abstract math and not on the concrete realities of applying the theories in creating a secure service.

In short, authentication is the ability to verify that an entity, such as a person, a computer, or a program, is who they claim to be. When you write a check, the bank cashes it because you’ve signed the check. The signature is the mark of authenticity on that piece of paper. If there is a question later as to whether you actually wrote me a check for $1,000,000 dollars, let’s say if I decide to deposit it in my bank account, then the bank will check the signature.

Encryption is the use of algorithms, whether they’re implemented in a computer program or not, to take a message and scramble it so that only someone with the correct key to unlock the message can retrieve the original.

It’s pretty clear from your description that authentication is more important to your web service than encryption at the moment. Why is this? Well, what you care most about in your situation is that users can only listen to the music they’ve purchase or stored on the server. The music does not need to be kept secret because it is unlikely that someone is going to steal the music by sniffing it from the network. What is more likely is that someone will try to log into someone else’s account to listen to their music. In order for a user to prove who they are, they will authenticate themselves to your service, most likely via a username and password pair. When the user wants to listen to their latest purchase, they present the username and password to the system in order to get access to their music. There are many different ways to implement this, but the basic idea, that the user has to present some piece of information that identifies them to the system to get service, is what makes this authentication and not encryption.

The password need not be encrypted, only hashed, before being sent to the server. A hash is a one-way function that takes a set of data and transforms it uniquely into another piece of data from which the original cannot be retrieved by anyone, including the author of the hash function. It is important that the hash function produce unique data for each input, as collisions make it possible for two different passwords to be the same hashed data, and that would make it harder to differentiate users.

There are plenty of books and papers on this sort of stuff, but try to avoid the pie in the sky stuff unless you’re researching new algorithms, because you really don’t need it, and it’ll just make your head hurt.

3.8 Authentication Revisted

The trouble with quotes on the Internet is that they’re very hard to authenticate.

Abraham Lincoln

Oftentimes the responses to KV’s writing are even more interesting, and impassioned, than the letter that started things off. The next letter and response bring up an interesting side of the issue discussed in Section 3.7.

Hello dear KV,

Supposing I’m a customer of Sincere-and-Authentic’s and suppose the sysadmin at my ISP is an unscrupulous, albeit music-loving, geek. He figured out that I have an account with Sincere-and-Authentic. He put in a filter in the access router to log all packets belonging to a session between me and S-and-A. He’d later mine the logs and retrieve the music. Without paying for it.

I know this is a far-fetched scenario, but if S-and-A want their business secured watertight, shouldn’t they be contemplating addressing it too? Yes, of course, S-and-A will have to weigh the risk against the cost of mitigating it, and they may well decide to live with the risk, but I see your correspondent’s suggestion as at least worth a summary debate, not something that should draw disgusting looks! There’s in fact another advantage to encrypting the payload, assuming IPSec isn’t being used: decryption will require special clients, and that will protect S-and-A that much more against the theft of their merchandise.

Balancing is the Best Defense

Dear Balancing,

Thank you for reading my column in the April 2005 issue of Queue. It’s nice to know that someone is paying attention. Of course, if you had been paying closer attention, you’d have noticed that S&A had said, “In the design meeting about this I suggested we just encrypt all the connections from the users to the web service because that would give us and them the most protection.” That phrase “just encrypt all the connections” is where the problem lies.

Your scenario is not so far-fetched, but S&A’s suggestion of “encrypting all the connections” would not address the problem. Once the user had gotten the music without their evil ISP sniffing it, they would still be able to redistribute the music themselves. Or, the evil network admin would sign up for the service themselves and simply split the cost with, say, 10 of his music-loving friends thereby getting the goods at a hefty discount. So, what S&A really needs is what is now called Digital Rights Management. It’s called this because for some reason we let the lawyers and the marketing people into the industry instead of doing with them what was suggested in Shakespeare’s Henry VI.

What S&A failed to realize was that the biggest risk of loss of revenue was not in the network, where only a small percentage of people can play tricks as your ISP network administrator can, but at the distribution and reception points of the music. Someone who works for you walking off with your valuable information is far more likely than someone trying to sniff packets from the network. Since computers can make perfect copies of data, after all that’s how we designed these things in the first place, it is the data itself that must be protected, from one end of the system to the other in order to keep from losing revenue.

All too often people do not consider the end-to-end design of their systems, and instead “just” try to fix one part.

3.9 Authentication by Example

We should treat personal electronic data with the same care and respect as weapons-grade plutonium; it is dangerous, long-lasting, and once it has leaked, there’s no getting it back.

Cory Doctorow

Now that we all know the difference between authentication and encryption we can turn our attention to the proper use of authentication. I’d like to believe that this letter is old enough that no one in their right mind would ever think of doing what the original author did to create their first pass at an authentication system, but, at this point, I think we all know better.

At a somewhat higher level there remain many issues in building and fielding an authentication system, beyond those that are covered in the letter. One of the key questions is longevity: how long should an authenticated session be allowed to last, and should continued use extend the time of a particular session?

The answers to this question run the gamut from terminating a session after a few minutes of idle time, in the case of banking applications, to seemingly forever for chat systems like Slack, or cat watching web sites like Facebook. The banking world has a default deny policy, which is not meant to protect their customers but to cover the bank’s...potential losses, showing once again that the only way to get things done is to start a lawsuit. If you’re not a banking application, how do you decide a proper session timeout? Is there some algorithm we can apply to make this decision easy? Yes, these questions are rhetorical.

Choosing a session timeout comes down to judging the downside risk of an attacker being able to acquire the same rights as a valid user of the system. If the only risk is that the attacker can also see pictures of cats, a read-only ability of, free, public, nonthreatening data, then the session timeout can be quite long. Consider the front page of an online newspaper. Most newspapers now have paywalls, with varying degrees of success, but they all leave their front pages open, or else how would they attract readers to their site? Such a system can of course have a long timeout, or, no timeout at all, so long as the only capability is the one of reading the free pages of the news. Once past the paywall, of course, the session needs to have some timeout, or else an attacker can get an infinite token with which to bypass, and therefore not pay for, the paywall. An infinite token could (and would) be posted to a system like reddit, or 8chan, for use by everyone who did a search for it. Now that we know we need a timeout, how long should it be? How often do people check the news? Probably for a news site the timeout should be a few hours, as this is how long we might expect someone to check their daily news, and maybe we extend it to nearly a day, so that the user has to log in again the next day, as if they were paying for a daily paper. The decision will, of course, have to be worked out with other groups, such as marketing, but I’ll not go into how to deal with people like that here.

One might be tempted to think that the shortest timeout is the best, but this is actually not always the case, especially if a session timeout causes the user to type their password again. The more often a user has to type their password, the more likely they are to write it on a sticky note and stick it to their desk. While I wish this were a joke, it’s not. Human beings have terrible memories for things like passwords or passphrases, and not even the advice of Randall Munroe seems to be able to help them; see https://xkcd.com/936/.

With the advent of fingerprint and face recognition systems for phones, tablets, and some laptops, the password problem has been reduced, but not eliminated. Switching to having something that you are, rather than something that you know, can change the session timeout calculus, but I know of no web site, banking or otherwise, that does not use a password, even if biometrics are an option.

So now let’s return to the letter, which isn’t about session timeouts, but once you’ve read it and the response, you’ll see that the session timeout question follows not long after.

Dear Kode Vicious,

I am a new webmaster of a (rather new) website in the company’s intranet. Recently I noticed that although I have implemented some user authentication (a start *.asp page linked to an SQL server, having usernames and passwords), some of the users found out that it is also possible enter a rather longer URL to a specific web page within that web (instead of entering the web’s homepage), and they go directly to that page without being authenticated (and without their login to be recorded in the SQL database). It makes me wonder what solution you could advise me to implement in order to ensure any and all web accesses to be checked and recorded by the web server.

New Web Master

Dear NWM,

You have my deepest sympathies; users, as I have noted before, are the bane of our existence. Those sneaky bastards will go around your systems every time just to get the data they want, without paying your login system any heed. Now, there are many ways to deal with recalcitrant users, but unfortunately I can only discuss the ones that are legal here; the others, trust me, are far more enjoyable.

Not to be too cruel, but it seems from your description that you have created a superfluous authentication system that doesn’t provide much for the users or for you. Users can and do go around your login page, and therefore you’ve just created more code without any real value, and valueless code is a real shame. You need to change your way of thinking about this problem before you can solve it.

Right now using your authentication system is voluntary, and therefore easily bypassed, because you are not enforcing the authentication on each page that your users can see. In your letter you don’t say that your system has any of the necessary features of an authentication system, such as:

• What the authentication gives the user the right to do. For example, can they read, modify, or create pages?

• How the user proves to the system that they were authenticated.

• How the user and the system agree that the user is who they say they are.

You have a system of web pages, which you believe contain valuable information since you claim you want to protect them. Yet those pages have no protection from anyone who can work out or guess what the name of the link is?! That’s just plain wrong. If you have information that must be protected, then you should be protecting it. One way to protect it is to implement the features shown in the list, namely, that the user must prove who they are to your login system and then must prove they were authenticated and have a right to see the information whenever they want to read a page.

How does the user prove they have these rights? The user must first talk to the login system to prove who they are. In your case you implemented a web page that fronts a database of usernames and passwords.

As a quick aside, I hope you are storing only the hash of the password and not the raw text. A hash uniquely destroys the password so that if the password was foo, the resulting hashed value might be the number 5. Given the number 5 someone cannot get back to foo, but given foo and the same hash function, you will always get 5. When verifying a password, you’re really comparing the hashed values; the original string foo is never stored. Keeping a database of raw username and password pairs is a serious security hole, the kind of bug that makes KV want to make a big pot of programmer hash.

So, now the user can prove who they are by logging in with their username and password, but how can you satisfy feature 3. They submit a username and password to your system, but, well, so what? The real problem with your pages is that the pages themselves are not protected. If you want to force users to authenticate themselves, then each request to your web server must include some piece of information that proves the user has been authenticated. If the server does not check requests to see if they come from authenticated users, then there is no point in having an authentication system at all. What should the user have to present to the system?

In the web world the most common piece of information exchanged between a user, or more exactly the user’s web browser and a server, is a cookie. A cookie is just a chunk of data that can be set by a server into a user’s web browser. When the user is browsing within a particular domain, the server can look at the cookie and get information from it. In an authentication system the server should set a cookie into the user’s browser that is then checked on every subsequent access to the system to validate that the user has the right to use the system.

What should go into the cookie? Well, that’s mostly up to the data that you, as the system maintainer, wish to track, but two things are absolutely necessary to prevent the system from being abused. The first is that the cookie must have a digital signature. A digital signature can prevent the cookie from being manipulated by a user to gain access when they should not. If someone were to find out the format of your cookies and your cookies were not signed, then that person could just craft their own cookies and present them to the server, bypassing your authentication system. The second necessary part is that the cookie should contain a timeout, past which the cookie is no longer good and must be replaced. Infinite timeouts on authentication cookies makes those cookies very valuable to steal, because once acquired they can never be revoked. Picking the timeout, though, is a balancing act. Your users will want their authentication tokens to last as long as possible, perhaps for months, while to maintain control of your systems you would prefer something much shorter, such as one hour. Finding the compromise between the permissive and the fascist timeout is beyond the scope of this article and depends on your users and how much management backup you have to force them to behave in the way that you’d like. I find that reminding management of how much money they’ll lose if users are able to leak and leach information from the system to be very effective in getting shorter timeouts implemented. Management hates losing money like KV hates losing the keys to his liquor cabinet.

So, now you have a model of an authentication system, instead of just a system that happens to record users logging in when they feel like it. Users log in, they get a cookie that is digitally signed to prevent tampering and that has a limited lifetime, and they must then use that cookie to read any of the other pages. There are many ways to implement this, but that’s the general outline, and there are plenty of examples of how to do this on the net, so get back in there and fix this!

3.10 Cross-Site Scripting

There is no such thing as perfect security, only varying levels of insecurity.

Salman Rushdie

One of the crosses I have had to bear in preparing this volume is realizing how many times the same topic has arisen, and even though the advice given is simple and direct, the problem continues, nearly unabated. The following response on cross-site scripting was written more than a decade ago, and yet if I were to type those words into a search bar, or for even more fun, Mitre’s Common Vulnerabilities and Exposures search (https://cve.mitre.org/cve/search_cve_list.html), I would find that not only does the problem continue unabated but that there are 341 total search results with 4 of them occurring in the first few months of this year.

The reason for nearly all cross-site scripting (CSS) vulnerabilities is a near complete disregard for properly validating input, which is the over-arching problem, while CSS is just a specific instance. I wish I could say that it’s amazing to me that after 20+ years of people developing code that faces the global Internet, that koders still fail to validate user input, but then most people still don’t wash their hands after they use the toilet, so maybe people just never learn.

The input validation meta issue brings us to three important points in system design when handling user input:

• Never pass user input to anything that will treat the input as something to execute.

• Try to match any user input against any known good pattern.

• Use the built-in sanitation routines in your chosen language.

Every language, for better or for worse, usually for worse, has a way to call out to the system on which it’s running to get some work done, such as deleting a file, changing permissions, or starting another program. These catchall idioms usually look like the system() routine found in the C library, but nearly every language, Python, PHP, Go, Rust, and on and on. The best course of action is to absolutely never use this idiom. The second best course of action is to never allow input from a user to find its way to that idiom. The problems caused by this idiom are so common that nearly all static analyzers will search your code and put big screaming warnings up when they find such a thing.

Now that we’re past the point of handing user input to the system() routine, let’s think about how we might properly accept input from the user. While it is not always possible to anticipate every user input, there are many cases in which we are only interested in a few types of responses, such as a predetermined set of answers to a question. If we are lucky enough to be in this situation, then we can build that predetermined list into our system and disallow any input not part of the list. Revesting Postel’s early Internet programming wisdom, we would be conservative in what we accept.

Finally, for point three, we come to the case where we have to deal with nearly arbitrary input from the user, and this is the case that is covered in the following letter and response.

Dear KV,

I know you usually spend all your time deep in the bowels of systems with C and C++, at least that’s what I gather from reading your columns so far, but I was wondering if you could help me with a problem in a language a little further removed from low-level bits and bytes, PHP. Most the systems where I work are written in PHP and, as I bet you’ve already worked out, those systems are web sites. My most recent project is a merchant site that will also support user comments. Users will be able to submit reviews of products and merchants to the site. One of the things that our QA team keeps complaining about is possible XSS attacks, cross-site scripting. Our testers seem to have a special ability to find these and so I wanted to ask you about this. First, why is cross-site scripting such a big deal to them, second how can I avoid having such bugs in my code, and finally why is cross-site scripting abbreviated XSS instead of CSS?

Cross with Scripted Sites

Dear CSS,

First, let’s get something straight, I may spend a lot of time with C and C++, but I object to the use of the word “bowels” in this context. My job is bad enough without having to now have this image of literally working in the bowels of anything.

Let me answer your last question first, since it’s the easiest. The reason that cross-site scripting is abbreviated XSS is the same as the reason that I spell code as kode. Programmers and engineers think they’re clever and like to put their mark on things by changing the language, in particular turning every possible term they coin into some acronym that only they know. It is one of the side effects of specialization that we will leave alone, just now, before my more literate friends come after me with torches and pitchforks.

Now, back to what I think we can both agree are your more serious queries: cross-site scripting, that is, the ability to inject JavaScript into a site and to then have the site send that scripting code on to the user. There are actually many risks involved in cross-site scripting attacks because the JavaScript code can do many different malicious things. For example, the code can completely rewrite the displayed HTML, which in your case means that someone else would be able to completely overwrite the reviews that the user submitted, probably not something you’d like others to be able to do. Another example is that the malicious code can steal the user’s cookies, and cookies are often used in web applications to provide identify the user. If the user’s cookies get stolen, then the attacker can become the user and perhaps take over their account. If your site uses cookies in this way, this is a pretty big risk. So, you can see why QA gets their knickers in a twist, and to be honest I’m surprised they never bothered to explain just why this was a risk, or maybe they just assumed you knew better.

Winding up with a cross-site scripting bug is almost always the result of not doing proper input validation. Since you say that you’ve read earlier columns, then you must know that I don’t trust users, and neither should you. When designing a web site, you have to just accept that with millions of potential users some percentage of those people who use your site will attack it. It’s the way the world is, some people are just jerks. This means we have to design not only to handle regular users, but also the jerks.

In the case of working with user reviews, I’m sure that some marketing type has demanded that users be able to not only upload plain text like, “Wow, this merchant is great, I got all my stuff in just 24 hours, I’d buy from them again!” but also to be able to use HTML like, ¡b¿¡font color=”red”¿Wow!¡/font¿¡/b¿, which is full of bold and red, and if they could get away with it, dancing GIFs, because marketing people seem to get paid based on the number of incredibly stupid features they add to a project. I direct this comment not at all marketing people, just those who think that an interface with 20 buttons is far better than one with 10. I believe Dante wrote about such people and that there was a special level of hell for them. The problem before you is how to let some subset of HTML through, at least for the bold, underline, and perhaps colors, and to not allow anything else. The approach you’re looking for is a white list and in pseudo-code a function to clean up a string to only allow these tags looks something like the following:

//
// Function: string _ clean
// Input: an untreated string
// Output: A string which contains only upper and lower case letters,
//         numbers, simple punctuation (. , ! ?) and three types of HTML tags,
//         bold, italic and underline.

string string _ clean(string dirty _ string)
{

string return _ string = "";

array html _ white _ list = ['<b>', // bold
'<i>', // italic
'<u>']; // underline

array punctuation _ white _ list = ['.', ',', '!', '?']

for (i = 0, i < len(dirty _ string), i++)
{

if (isalpha(dirty _ string[i])) {
return _ string += dirty _ string[i];
continue;
} else if (isnumber(dirty _ string[i])) {
return _ string += dirty _ string[i];
continue;
} else {
if (dirty _ string[i] is in $punctuation _ white _ list) {
return _ string += dirty _ string[i];
continue;
} else if (dirty _ string[i] == '<') {
$tag = substring(dirty _ string, i, i + 2);
if ($tag in $html _ white _ list) {
return _ string += $tag;
} else {
return _ string += ' ';
i += 2;
}
}
}
return _ string += ' ';
}

return return _ string;

}

The string_clean function has several features I’d like to point out. The first is that it is very strict, probably stricter than you’ll be able to get away with when dealing with marketing, but I wish you luck. The allowed characters are all the upper- and lowercase roman alphabetic characters, all ten digits, and then four types of punctuation, periods, commas, question marks, and exclamation points. No parentheses and no braces are allowed, which protects against the case of ?{ getting through. In terms of HTML, only three tags are allowed, bold (<b>), italic (<i>), and underline (<u>). The function is implemented as a white list, which means that only the allowed characters are appended to the returned string. Many string cleaning routines are implemented as black lists, which is to say they list what is not allowed. The problem with black lists and white lists was treated in the letter from Input Invalid, so I won’t go over the details again here. For those interested in efficiency note that we check for the most common case first, a letter, the next most common, a number, and then the least common cases, which are punctuation and finally the allowable tags. I picked this order so that the code would append the character and go round the loop most quickly in the most common cases, which hopefully gives us the best possible performance. You should also note that the default action is to ignore the input character and to simply append a space to the return string. We append a string so that it is possible to see where there might have been illegal text. Simply removing the offending character makes it too easy to miss where the attack may have been.

Of course, this is a simple first pass at a filtering function, and it would have to be tailored to your environment, but I hope it gives you a shove in the right direction. In order to protect against such attacks you must not only code such a function, but you and everyone on your team must use it for each and every case of user input. I cannot count the number of times that a suitable filtering function existed in a library and yet, for some perverse reason, the engineers working on the product decided to simply ignore it or to go around it because they felt they were better at treating the input themselves. I have one piece of advice for such people, don’t do it. If you require some special abilities with a particular piece of input, then either extend the function or create a new one, one that can also be used consistently. It will save you a lot of time and headaches in the long run.

3.11 Phishing and Infections

Using encryption on the Internet is the equivalent of arranging an armored car to deliver credit card information from someone living in a cardboard box to someone living on a park bench.

Gene Spafford

Phishing is perhaps the least technical of the attacks that we have to protect against using technical means, for phishing is, in reality, more like a human to human con job; rather than being an attack on code, it is an attack on the system component between the keyboard and the chair.

People have been running con jobs on each other since there were two people and one wanted to get something from the other via less than honest means. Phishing is just the ability to trick people using computer systems connected to a global network, which broadens the attack and reduces the individual risk of getting caught, since the attacker and the attacked do not have to meet in the real world.

Many attempts have been made to make phishing harder via technical means, such as validating cryptographic signatures on email, having web browsers block known “bad” sites, changing the color of their address bar, displaying a lock to show a properly authenticated and encrypted session, and many other things, none of which have really made a dent in the phisher’s ability to fool some of the people some of the time.

Since phishing is such a human endeavor it’s actually the humans that we need to fix, although as the letter and response in this section show, there are some technical things we can do to make the phisher’s life more difficult.

It might seem obvious that after over a decade of stories about phishing in the news, and for those of us in the corporate world, too many awful, animated “trainings” about phishing, that people would have become naturally more suspicious, but this has not been KV’s experience. It often seems as if there are people who are naturally suspicious and those who are not. I have often told the story of my mother, who was not someone who worked with computers, and how she handled her email:

kv: I was thinking of getting you a new computer.

mom: I like my machine, it’s fine.

kv: Sure, but it must be full of viruses by now.

mom: No, no problems.

kv: How can that be?

mom: I don’t browse random sites on the Internet, and when I get a mail from someone I don’t know, I immediately delete it, without opening it, and then empty the trash to make sure it’s gone.

kv: ...

I think back to this conversation a lot because I wish that half the people I’ve met in technology positions did the same and because it reminds me that part of the proper mindset to avoid phishing has nothing to do with technical know-how; it is simply that suspicious way of thinking that I was immersed in from childhood and that my entire family shares. How can one pass this sort of thinking on other than by bringing them up in a family of admitted paranoids? Perhaps we can use the immortal words of General Jack D. Ripper:

I want to impress upon you the need for extreme watchfulness. The enemy may come individually, or he may come in strength. He may even come in the uniform of our own troops. But however he comes, we must stop him. Now, I’m going to give you three simple rules: First, trust no one, whatever his uniform or rank, unless he is known to you personally; Second, anyone or anything that approaches within 200 yards of the perimeter is to be fired upon; Third, if in doubt, shoot first then ask questions afterward. I would sooner accept a few casualties through accidents rather losing the entire base and its personnel through carelessness.

which for the case of phishing means we teach people two important things:

Trust no one, even if they’re personally known to you. If you get a request to do something, verify that request via a different channel; e.g., if you get an email, call them.

Shoot first and ask questions afterward. For email communications this means to delete an email you think to be phishing. If it’s really important, and actually valid, the person will try to get ahold of you again, and at that point if you’re still worried, you can contact them via an alternate means.

One topic I didn’t cover in my response, which I should mention here, is the idiocy of most password recovery questions, as this comes into play once someone knows they’ve been phished. It boggles KV’s already addled mind that in 2020 there are still systems that ask questions that are easy to glean from public records, or online searches, such as mother’s maiden name and locations where you’ve lived. One colleague in the security world treats whatever question is asked as if it asked, “What is your philosophy of life?” and fills in something witty and memorable, only to him, into the field, so that if an attacker ever has to talk to a customer service agent to steal an account, it would be quite difficult. Of course, given the number of people who use “password” as their password, which is depressingly large, this won’t help everyone, but it would be useful to those of us who actually care about our online security.

If we can somehow get these ideas to stick in people’s heads, the problem of phishing will be greatly reduced, but until that day I guess we’ll have to color our URLs and throw up large warnings and hope for the best.

Dear KV,

I noticed you covered cross-site scripting a few issues back, and I’m wondering if you have any advice on another web problem, phishing. I work at a large financial institution, and every time we roll out a new service check box the security teams come down on us because either the login page looks different or they claim that it’s easy to phish information from our users using one of our forms. It’s not like we want our users to be phished, we actually take this quite seriously, but it’s also, I don’t think, a technical problem; our users are just stupid and give away their information to anyone who seems willing to put up a reasonable fake of one of our pages. I mean come on, doesn’t the URL give away enough information?

Phrustrated

Dear Phrustrated,

Ah, yes, your users are stupid: they just sit around waiting for someone to pop up a login screen or a page full of inputs for personal information and fill them out; they’re doing it just to get you. It’s very comfortable to think along these lines because it leaves you feeling superior and means you don’t have to do any work to fix the problem; instead you think you should fix the users. Unfortunately, as I have learned from long experience, beating stupid people doesn’t make them any smarter.

Now, I don’t like users any more than you do; they’re demanding and want things simple and just ruin my fun at playing with what really are “my toys.” Alas, we’re both paid to actually make these toys work well in the service of the users and so we have to give them some, small consideration. So, what is phishing? Well, first of all it’s a very annoying misspelling, far more annoying than “kode” for code. Go ahead, call me a hypocrite, I’ll wait.

More to the point, phishing is the ability of an attacker to get someone to give away important or useful information. At the highest level this is one of the oldest tricks in the book and likely dates back as far as the oldest profession. It’s a con job, your users are the marks, and unless the marks are paying attention, they are going to be conned out of their username and password, or Social Security number, phone number, birthday, etc. The Internet has amplified the abilities of the old-time con man (there must be also con women, but that word is not listed in my dictionary) because now there is a tremendous amount of important information stored on computers and because the Internet reaches into hundreds of millions of homes from anywhere on Earth.

Although there is no sure technical fix to the phishing problem, there are ways to evaluate possible solutions, and these ought to be kept in mind for those moments when someone in a meeting says, “Well, if we just...” I have an allergic reaction to that particular phrase because it is usually followed by a specious or poorly thought out suggestion that sounds good until you try to follow it to a logical conclusion.

Obviously, there are a lot of smart people thinking about this, but the best advice I’ve seen on evaluating anti-phishing technologies comes from Rusty Shackleford. Rusty’s rules can best be summed up in this way:

• Anything the attacker can see, the attacker can spoof.

• Anything the user knows, the user can, and will, disclose.

Corollary: Anything the user’s browser knows, the user’s browser can and will (quite willingly) disclose.

• Your solution is only as good as its first step. That is, your solution is only as good as what your users have to do when they find themselves in unfamiliar surroundings.

Let’s take these one by one. “Anything the attacker can see, the attacker can spoof.” It may seem simple, but many people miss this point. Often people go to a lot of trouble to put visual cues into pages so that the user “knows” they’re logging into the right page. The problem is that anything you show to all your users, all the “bad guys” can see as well, and pretty easily re-create, no matter how complex they are. Eventually all that complexity just gets lost to the user anyway, so don’t bother; they’re not going to notice. Now, if you can come up with something that personalizes the page, perhaps an image or a picture, and not one chosen from a list of 10, or 100, then that might begin to provide some protection. Sounds are another way to personalize a page, though an annoying one for those of us who hate noise in public places like cafes or cubicles.

A harder problem to tackle is the one that “Anything the users knows, the user can, and will, disclose” with its corollary about the user’s browser being tricked in place of the user themselves. Let’s face it, this is the root of the problem; the users are being tricked, and if an anti-phishing system just depends on a different set of data being collected from the user, for instance the question “What is the meaning of life?” instead of “What is your Mother’s maiden name?” then you’re just moving the problem around, like shuffling the deck chairs on the Titanic. One of the goals of a good anti-phishing system should be, if possible, to do away with collecting any “personal and secret” information from the user because that information isn’t really personal or secret and they’ll willingly type it into a cleverly built phishing page.

Perhaps the hardest of the rules to understand is, “You’re only as good as your square first step.” What Rusty was saying here is that no matter how good and tightly constructed the later phases of your system, the whole thing will unravel if the first step is susceptible to any of the issues pointed out in the previous two rules. Confused users are the easiest ones to phish. So, for example, if your account recovery page requires the user to type in a lot of complex “personal and confidential” information that the user knows, it is likely that the phisher is going to take advantage of this, and instead of hosting a fake login page, they’ll happily host a fake account recovery page. It’s just as easy to steal an account with the account recovery information as it is with the login and password.

Alright, I admit it, I was unable to solve phishing in 1,200 words or less, but I hope Rusty’s advice is a help to you and makes you a little less frustrated. It’s certainly allowed me to quash some of the more questionable anti-phishing ideas I’ve heard of, and some days, that’s half the battle.

3.12 UI Design

A common mistake that people make when trying to design something completely foolproof was to underestimate the ingenuity of complete fools.

Douglas Adams

What can a low-level systems person really know about UI design? Well, I may not know much about UIs, but I know what I like, which is to say I know a lot about what I don’t like. The following letter and response aren’t so much about how to design a UI but how to insulate the design of the UI from the design of the overall system. Many modern user interfaces are built on the Model-View-Controller and Model-View-Presenter paradigms, which do an adequate job of insulating how the system looks from how it does the job it needs to do. The model is meant to contain the data, and the controller the logic for manipulating that data, while the view or presenter presents that data to the user in whatever way the UI designer thinks is most appropriate. Keeping the UI designer in a box where their choices cannot negatively affect the overall logic of the system is often best; sometimes I even drill air holes into that box for them, but sometimes I do not.

The following letter and response do not discuss a formal paradigm such as MVC or MVP but do touch upon the important point, especially for those doing large systems design, of trying, as much as possible, to create a clean interface between how the system looks to the user and what it does for the user, for this type of fence makes for the best type of neighbors.

Dear KV,

I’ve been reading your column occasionally in Queue, and I haven’t seen you address anything related to user interface design and how it can completely torque a piece of software. I happen to work as a programmer on a project for a company that sells point-of-sale software, which is a nice way of saying cash registers. The goals of the marketing people and user interface designers, we have several for our different product lines, always seem to twist the software into directions that only make it more fragile. These people ask for features that while to a naive user might make the user interface easier to customize or use, to any of the programmers on the project it’s plainly obvious that the feature in question will have a negative impact on code size, clarity, or some other nasty side effect. Several times our releases have been delayed because mid-stream we were asked for a feature that proved to have such a horrible side effect that it had to be removed right at the end. Sometimes these “features” are really just visual changes, but our system is so easy to change visually that it seems to invite the marketing and design folks to change just for the fun of it, as if the color of the buttons ought to be red one day and then blue the next. Is there any way to make these pests go away?

Torqued

Dear Torqued,

What’s this that you “...only read me occasionally!?” Well, if you had been reading all the time, you’d know that I haven’t actually addressed your issue, but I’ll let you in on a little secret. Long before my lucky escape from the presentation layer I thought I too would like to work on user interfaces. After all, they’re the first thing you see when someone uses your software, and done right they are the first thing that people praise. Very few people come up to you and say, “Hey, nice protocol, I really like the way you made use of that spare bit in the flags field.” It was actually quite a big kick the first time I could say to my mom, (yes, I have a mom and was not hatched as so many people seem to claim), “Hey, see that? I did that!” and could explain in simple terms what the thing I did was doing. Two things took me away from UI work: the first was a deep interest in the lower levels of operating systems such as device drivers and networking, and the second was the same marketing and design people you mention. I remember one, who seemed to almost delight in asking for changes as if the accumulation of modifications was his personal contribution to the system we were building, which they weren’t; they were just a drain on resources. Alas, this particular person wasn’t fired out of a cannon, as I often asked management to do, nor even fired; he just took another job, I suspect in order to find someone else to try and torture, or because I kept slashing his tires. I still wonder if he ever figured out who it was who did that.

Now, before I go on, let me say that there are some wonderful user interface designers who really get that it takes more than a few seconds to make pervasive changes in software and who, therefore, choose their requests carefully and work with the programmers to get things into a suitable state. There are five of them, and no, I will not give out their email addresses.

Over the years I have developed a few strategies for dealing with the more interfering types of designers, who aren’t really designers at all; they’re just quacks who can talk smoothly about the use of color and are fine for helping you decorate your house but useless at software. Oh, and none of these strategies involves violence or will get you arrested, I think, so I feel that I can share them.

One of the most important things a programmer or software architect can do in the early stages of building a system, to protect themselves from being yanked around, is to separate the way in which information is presented to the user interface from how it is stored and manipulated in the real system. I know that such a rule must seem obvious, but I have seen people do the exact opposite time and again. It is a fine idea to look at what data the system needs to store and manipulate as a set of inputs, but how it is stored and manipulated must be separate from how it is input or you are already lost. Dante seems to have left out this particular region of hell in The Inferno, likely because such sins were less prevalent before computers and user interfaces became ubiquitous, but it is there nonetheless. I know, I’ve been. I got the scars and the company T-shirt to cover them with.

OK, now that we’ve got the presentation and manipulation systems apart, the next thing to do is to design a presentation layer that is easy to change without breaking the rest of the system. If your designer wants to change the color, text, font, font size, and character set used throughout the user interface on a weekly, daily, or hourly basis, let them! It gives you more time to code real features if they’re off playing with Pantone color wheels, fonts, and button borders. Make sure that changing the user interface doesn’t require any knowledge of building the software or using any complex tools like text editors. The last thing you want is to babysit, as I have, the designer as they complain that every time they build the code, it breaks. It should be completely unnecessary to compile anything to change the user interface.

Making sure that the user interface can be simulated in the absence of a real system is often a big win as well. You’ll be much happier if the designers can play with the UI while you’re building the rest of the system instead of dealing with the constant maintenance of the unfinished system because the designers need some feature right now that you know you don’t need until much later.

With all this decoupling going on, the biggest internal danger to watch out for in your system is layer proliferation. A favorite quote of mine, attributed to Van Jacobsen, a networking researcher, is, “Layers are a good way to think about network protocols, but a poor way to implement them.” The same is true for most software. How many layers do you need? Enough to get the job done and not so many that they impact performance. Don’t like squishy answers like that? Tough, suck it up. It’s your system, and you’ll have to deal with this on your own, but if you don’t think about it at all, you will really hate yourself later. Either you will wind up with too few layers, and therefore nasty tendrils of code violating the layers and causing problems, or you’ll wind up on the other end of the spectrum where you might as well move the data on paper because it will be faster than waiting for it to crawl its way up from the murky depths of secondary storage only to see the light of day minutes later. Neither is a system you want to work on, trust me.

So, there you have it, KV’s quick list of how to avoid getting your chain yanked by the color selection brigade, separate the API from the UI, build a presentation layer that can be changed without any need to be a programmer, simulate the back end as soon as you can, and make sure you have the right number of layers.

3.13 Secure Logging

The most secure code in the world is code which is never written.

Colin Percival

Logging systems are often the Achilles heel of security. A very common interaction when reviewing a system can be paraphrased in the following way:

kv: What kind of data does the system store?

suspect: Some personal information.

kv: [Raises eyebrows, lowers voice, speaks more slowly] Such as...?

suspect: Oh, you know, name, address, phone number, email...

kv: And where is this stored?

suspect: We store it in our database.

kv: [Lowers voice even further, to a dangerous whisper] Encrypted?

suspect: [Jumps as if bitten] Of course!

kv: [Voice returning to normal] Good, good. What about payment information?

suspect: We store that in a separate database, also encrypted.

kv: Great! Now, does the system log transactions for debugging and tracing down problems?

suspect: Of course!

kv: Please tell me what data is logged for each transaction.

suspect: We usually run with a high logging level so it’s easy to track down problems. At that level we log all the information for each transaction.

kv: Including the personal information?

suspect: [Starts to look worried.] Yes...

kv: [Asks in an off-hand manner] And the credit card info?

suspect: [Looks at shoes] Yes...

kv: [Takes off glasses, rubs bald head] In plain text.

suspect: Uh, yes.

This is not an exact transcript, because an exact transcript would have a lot more colorful metaphors right at the end.

The point of the dialogue of course is the point we often come to when people think they are securing a system; they only lock the front door, but they leave the windows, cellar, and back doors wide open. Building a secure system doesn’t just mean following a set of guidelines or a run book; it means thinking about all the places where data is accessed and how that access is controlled. Logging systems are just a very common form of leaked data, but there are plenty more, for instance debugging interfaces, which are another common way to access a system that is frequently left open in shipping systems.

With the advent of cheap, and easy to use, cryptographic filesystems much of the storage of such log data should now be less of a problem, but it is still rare to see these in practice.

Building a secure logging system is only one part of building a secure system, but as logging systems are so common, let’s start there.

Dear KV,

I’ve been stuck with writing the logging system for a new payment processing system at work. As you might imagine, this requires logging a lot of data because we have to be able to reconcile the data in our logs with our customers and other users, such as credit card companies, at the end of each billing cycle, and also if there is any argument over the bill itself. I’ve been given the job for two reasons, which are that I’m the newest person in the group and because no one thinks writing yet another logging system is very interesting. I’ve also not gotten a lot of help from the other people on the team who claim to have, “written far more logging systems in their time than they want to think about.” Any advice on doing up a proper logging system?

Logged Out

Dear Logged Out,

If so many of your teammates have written logging systems before, how come you’re not using those? Perhaps your teammates are lying to you and have never written a single line of a logging system, or perhaps, and I suspect this is probably more likely, they tried and their systems sucked rocks. Or perhaps I’m just cynical.

It turns out that writing a good logging system, like writing any good piece of software, is both difficult and rare. Many of your decisions are going to depend on the requirements put on the data you’re logging, and since you’re logging financial transactions, you have a lot of requirements, some of which must include: the ability to keep the data private, to audit the log for errors, and to be able to verify that the data contained in the log has not been tampered with.

Data privacy is now a big deal in our industry. It’s too bad that it wasn’t a big enough deal to the companies made famous in the last few years for breaching private data, such as Choice Point, Bank of America, Wells Fargo, and Ernst and Young, but they’re all smarting for it now. Personal data breaches are now such a big problem that several governments have enacted strong legislation to punish those offenders, and I think you’d like to avoid such punishment, I know I would.

The best way to keep data private is not to store it at all. Storing data makes it possible to breach it, which seems obvious, but then again every time I think something is obvious I wind up reading a news item that tells me, no, not obvious enough. Only keep whatever data you need to back up whatever claim you need to make, and don’t keep data for too long. Most financial institutions have limits on how long they’ll keep data; follow the relevant ones for your product to the letter, and don’t keep anything a second longer than you need it.

Once you’ve winnowed down the list of things you actually need to keep in the log, decide which ones can be blinded, which ones must be encrypted, and which can be left in the open. Blinding data means that it is destroyed, but in a way that makes it unique. A hash function is a great way to do this. Given any input, a good hash function produces unique, seemingly random, output. Consider the following example using the md5 program on my Mac:

? md5 -s "1234 5678 9012 3456"
MD5 ("1234 5678 9012 3456") = d135e2aaf43ba5f98c2378236b8d01d8

? md5 -s "1234 5678 9012 3457"
MD5 ("1234 5678 9012 3457") = 0c617735776f122a95e88b49f170f5bf

Given two strings, which look like fake credit card numbers, where only one digit is different in one position, the md5 program produces what look like two different random numbers. If you can find a pattern in these, please contact your local member of MI6 or equivalent as they have a job for you in their signals department.

Not only are these two numbers seemingly random, but they are also unique, which means they make a fine primary key for using in your data logging. Each log entry with these numbers uniquely identifies the credit card, but someone reading the log cannot figure out the original credit card number from the hash. Blinding can be used on all kinds of data, but it’s definitely good to use it on things that if they were stolen or compromised could be used by others.

If there is data that you absolutely must be able to use again in its original form, that is, it cannot be blinded, then it’s time to start encrypting, at least if that data is valuable in any way. I am amazed at the number of people who go to great lengths to properly encrypt data in their databases and live systems but then just chuck it all, unceremoniously, in plain form, into the logs. I guess I should stop being amazed, but it’s preferable to banging my head on the desk, wall, floor, or engineer in question.

What kind of data might need to be kept secret in your logs? An exhaustive list isn’t possible, but certainly personal details like the person’s full name, address, phone number, mobile number, and email address are a good start. While you’re at it, the amounts paid, locations of payment, and other payment specifics should also be kept secret as they make your logs a juicy target for people trying to dig up financial data on your company. You might be asking, “Well, what’s left?” and I have to say that in a financial system, probably not a lot, but I’m sure there is data around for debugging purposes that might be OK to go into the log in plain form. For instance, the time the entry was made is probably not going to be secret.

Now that you’ve eliminated all extraneous data, blinded what you could, and likely encrypted most of the rest, you have to make sure that the log itself is secure from tampering. You will need to do two things in order to prevent tampering, sign the entries and sign the log, each with a different key. The entries are signed to ensure their validity, and the whole log is signed to make sure that no one has added or removed entries by hand. The reason two different keys are used is that two different people should have those keys, thereby requiring collusion to violate the security of the system. It’s also a good idea to change your keys regularly so that if a key is stolen, the amount of data that is exposed is minimized.

There are many other things to touch upon with a logging system, such as where the data is stored, how it may or may not be moved across a network, when the logs need rotating, and how to write tools to analyze and read the log, but what I’ve given here are the basics of making a logging system that I hope doesn’t suck and doesn’t make it trivial to violate the privacy of your users and land your company on the front page of the news. Oh, and one last piece of advice, don’t leave the logs on a laptop in your car. Obvious? Sure, it’s obvious.

3.14 Java

If Java had true garbage collection, most programs would delete themselves upon execution.

Robert Sewel

I have been lucky enough in the years between receiving this letter and writing the response to, mostly, have avoided the scourge that is Java. You will find that in these pages I often extol the idea that every language has a domain for which it is best suited, but that liberal ideology doesn’t mean that I don’t have opinions on languages when I see the damage they have wrought. It is nearly laughable to me that the original impetus for Java was embedded systems, where it was thought to be “Write once, run everywhere,” but as someone who works with embedded systems daily I can say that I very rarely come across Java. Having escaped from its creators, original intentions, Java literally went everywhere...servers, browsers, and, currently, applications running on the Android system in phones. The use of Java in Android now means that there is a generation of koders who, in order to write their applications, have been forced to use this language, one that often results in people writing code as described in the following letter and response. The only thing worse than the fact that Java is part of Android is that during the late 1990s and early 2000s many undergraduates were taught Java as their introduction to programming, a mistake of truly mind-bending proportions, for while Java contained all that was supposed to be good in software engineering at that time (classes, objects, methods, etc., etc.), it did not have a way to gently introduce people to these concepts. The moment you touched anything in a Java system you got all the concepts rained down upon you like hell fire. As teaching languages go, Java is the last one I would foist upon new students, but many departments felt that industry wanted programmers trained in the latest language, rather than having them learn to program intelligently so that they could program in any language.

As you can see from my comments my love of Java has only grown since KV was first asked, and replied, about this grand creation that hopefully someday will go the way of the dodo. People remember that the dodo went extinct, but they don’t always remember that that extinction was brought about by conscious action by the hand of man.

Dear KV,

You mention Java in passing.² I’ve done a day’s intro and got a book on it but never had to produce any serious code in it.

2. George V. Neville-Neil: Gettin’ Your Kode On, in: ACM Queue, Feb. 2006.

In an admin role, I’ve been up close and personal with a number of Java server projects, and they seem to share a number of problems:

• Performance. About 10 times slower than C.

• Complexity. Seems very large and obtuse.

• Slow to code. Always running behind.

My observation is that Java requires a GUI/IDE to code effectively—that ordinary mortals can’t keep all the classes and their APIs in their close cognitive span (at their fingertips). Compare this to Perl, which doesn’t have a decent IDE available. I’d suggest, not because people of talent or interest aren’t available, but because there is no driving need for it. It’s simple to get into and efficient to code with a text editor. And it’s not a language that’s suitable for large projects. Maybe Perl 6 will be. ;-)

Is there any data that says that Java projects are any more or less successful than older languages? It’s got heavy commercial support and lots of press and has noble aims of helping programmers and reducing certain classes of faults... But as professional programmers, we use sharp tools, and they are dangerous for exactly the reasons they are useful. Trying to protect everyone from “level 1” programmer errors seems very limited to me.

I keep seeing projects to replace “legacy” apps start amid fanfare and hoopla—and significant budgets—using “the most modern techniques” that end up being cancelled or only partially implemented.

Am I missing something??

Run Down With Java

Dear Run Down,

Having taken a course on Java and read a book on it, you’re actually ahead of old KV on the Java wave. I’m still hacking C, Python, and bits of PHP for the most part. Given your comments, perhaps I’m lucky, but somehow I doubt that; I’m rarely lucky.

In a way I could almost reprint your letter without comment, but I think there are larger issues that you raise, and I really can’t let these things go without commenting, or, perhaps, screaming and tearing my hair out. It turns out that shaving my head has helped with all those bald patches I got from tearing my hair out.

As a reader of KV you’ve probably already realized that I rarely bash languages or make comparisons between them, and I’m going to stick to my guns on that, even in this response. I don’t believe the majority of the problems you’re seeing come from Java itself, but how it is used, and also the way in which the software industry works at this point in time.

The closest I’ve come to Java was to work on a project to build some lower-level code in C that would be managed by a Java application. There were two teams, one that wrote the systems in C, which could operate independently of the Java management application, and one that wrote in Java. Now, you would expect that the Java team and the C team would have met on a regular basis and that they would have exchanged data and design documents so that the most effective set of APIs could be built in order to efficiently manage the lower-level code. Well, you would be wrong. The teams worked nearly independently, and most of the interactions were disastrous. There were many reasons for this, some of which were traditional management problems, but the real reason for this “failure to communicate” was that the two teams were on two different worlds, and no one wanted to string a phone line between them.

The Java team members were all into abstraction; their APIs were beautiful creations of sugar and syntax that scintillated in the sunshine moving everyone to gaze in wonder. The problem was that they didn’t understand the underlying code they were interacting with, other than to know what the data types and structure layouts were. They did not have a deep appreciation of what their management application, so called, was supposed to manage. They made grand assumptions, often wrong, and when they ran their code, it was slow, buggy, and crashed a lot.

The C team members weren’t all perfect either. There was a certain level of arrogance, shocking I know, toward the Java team, and although information wasn’t hidden, it was certainly the case that if a C engineer thought one of the Java engineer’s didn’t “get it,” they would just throw up their hands and walk away. The C team did produce working code that shipped, worked well, and didn’t fall over. The problem was that the goal of the company was to build an integrated set of products that could be managed by a single application. Although the C team won the battle, the company lost the war.

Someone looking at the code as it was delivered might have thought, “Well, the Java programmers just weren’t up to the task, next time hire better programmers, or get better tools or...” The fact is that that’s not the real problem here. The problem wasn’t Java; it was the fact that the people building the system could produce a lot of lines of code but didn’t understand what they were building.

I have seen this problem many times. It often seems that projects are planned like some line from an old Judy Garland/Mickey Rooney musical. One character says to the other, “Hey, kids, let’s put on a show!” It always works in the movies, but as a project plan it rarely leads to people living happily ever after.

In order to build something complex you have to understand what you’re building. The legacy applications you mention are another great example. Ever seen a company convert a legacy app? I hope not, it’s not very fun. The way legacy conversion goes is that you have a program that works. It does something. You may have the source code or you may not. No laughing now, I’ve seen this. When the legacy program runs, it does what it should, most of the time. Next the team comes in and tries to dissect what the program does and then reproduce it, with bug for bug compatibility, and finds that their modern techniques don’t reproduce the same bugs in the right way and so they get to a point where the program sort of works, or sort of doesn’t, and then usually give up and re-implement whatever it was, from scratch.

One of the reasons such travesties can continue to occur is that unlike in any engineering discipline in the real world, think aeronautics, or civil engineering, failure just means a loss of money. Now, when I say “just,” that can be a big just. The overhaul of the US Internal Revenue Service computer systems cost millions in overruns as did the system developed for the Department of Motor Vehicles in California. There is a laundry list of such failed projects to choose from. These may make headlines for a while, but they’re not quite on the level of a bridge failing, like the Tacoma Narrows, or the Space Shuttle exploding, twice. People generally remember where they were when the space shuttle Challenger blew up, but they don’t remember where they were when they heard about an IRS computer cost overrun.

With more and more computers and software being put into mission-critical systems, perhaps this attitude will change with time. Unfortunately, it’s going to require a few more spectacular failures, likely with a human instead of monetary cost attached, before people put more time into planning what they do and figuring out what their code is actually meant to be doing. Once we do that, then the fact that we’re using Java or Perl or the language du jour will have a lot less effect and probably be discussed a lot less as well.

3.15 Secure P2P

When you pirate MP3s, you’re downloading COMMUNISM.

Anon

The following piece truly takes me down memory lane, to back when peer-to-peer (P2P) systems were looked upon in the same way block chain is today. Those were the days of Napster and the exchange of material, copyrighted and otherwise, from one computer to another without much reliance on central servers. Today there are plenty of systems that operate in a peer-to-peer fashion, and their legality and legitimacy is unquestioned. The fact is that peer-to-peer systems continue to be used to exchange data that annoys lawyers with mouse ear hats, but those uses have now been overtaken by other applications that are considered more respectable, or at least are backed by VC money.

As is often said by koders when we get, inevitably, stuck talking about topics that stray into law, IANAL (I am Not A Lawyer), and the response here quickly morphs into a discussion of secure file sharing, which might even be the start of some not unreasonable ideas on securing distributed systems, since that’s all, from a technical standpoint, a P2P system is. They are systems that do not, as much as possible, depend on a centralized infrastructure to achieve their goal but instead spread the work among a set of clients, where work is done by each according to its ability, and service is provided to each according to its needs.

Dear KV,

I’ve just started on a project working with P2P software, and I have a few questions. Now, I know what you’re thinking, and no, this isn’t some copyright-violating piece of cowboy code, it’s a respectable corporate application for people to use to exchange data like documents, drawings, and work-related information.

My biggest issue with this project is security, like accidentally exposing our user’s data or leaving them open to viruses. There must be more things to worry about, but those are the top two.

So, I want to ask, “What would KV do?”

Unclear Peer

Dear UP,

KV would run, not walk, to the nearest bar and find a lawyer. You can always find lawyers in bars, or at least I do; they’re the only ones drinking faster than me.

OK, so let’s assume your company has lawyers to protect them from the usual charges of providing a system whereby people can exchange material that perhaps certain other people, who also have lawyers, consider it wrong to exchange. What else is there to worry about? Plenty.

At the crux of all file sharing systems, whether they are peer to peer, client/server, or what have you, is what type of publish/subscribe paradigm they follow. The publish/subscribe model used defines how users share data.

The models follow a spectrum from low risk to high risk An example of a high-risk model is one in which the application attempts to share as much data as possible, such as sharing all data on your disk with everyone as the basic default setting. Laugh if you like, but you’ll cry when you find out that lots of companies have built just such systems, or systems that are close to being as permissive as that. Here are some suggestions for building a low-risk peer-to-peer file sharing system.

First of all, the default mode of all such software should be to deny access. Immediately after installing the software, no new files should be available to anyone. There are several cases of software that did not obey this simple rule, and so when a nefarious person wanted to steal data from someone, they would trick them into downloading and installing the file sharing software; this is often referred to as a “drive by install.” The attacker would then have free access to a person’s computer or at least to their MyDocuments or similar folder.

Second, the person sharing the files, that is, the sharer, should have the most control over the data. The person connecting to the sharer’s computer should only be able to see and copy the files that the sharer wishes them to see and copy. In a reasonably low-risk system, the sharing of data would have a timeout such that unless the requester got the data by some time, say, 24 hours, the data would no longer be available. Such timeouts can be implemented by having the sharer’s computer generate a one-time-use token containing a timeout that the requester’s computer must present to get a particular file.

Third, the system should be slow to open up access. Although we don’t want the user to have to say “OK” to everything, because eventually they will just click OK without thinking, you do want a system that requires user intervention to give more access.

Fourth, files should not be stored in a known default location or one that is easily guessable. Sharing a well-known folder, like MyDocuments, has gotten plenty of people into trouble. The best way to store downloaded files and shared files is to have the file sharing application create and track randomly named folders beneath a well-known location in the filesystem. Choosing a reasonably sized random string of letters and digits as a directory name is a good practice. Why should you bother with creating hard to figure out folder names? Because it makes it harder for virus and malware writers to know where to go to steal important information. If the filename and path are well-known, it’s far easier for someone to steal data from the machine.

Fifth, and last for this particular letter, the sharing should be one to one, and not one to many. There are many systems that share data one to many, including most file swapping applications, such that anyone who can find your machine can get at the data you are willing to share. Global sharing should be the last option a user has, and not the first. The first option should be to a single person, the second to a group of people, and the last, global.

You may note that a lot of this advice is in direct conflict with some of the more famous file sharing, peer-to-peer systems that have been created in the last few years. The reason for that is that I have been trying to show you a system that allows for data protection while data is being shared. If you want to create an application that is as open, and also as dangerous as Napster, or its errant children, were and are, then that’s a different story, but it sounded from your letter that that was not what you wanted.

Other things you will have to worry about include the security of the application itself. A program that is designed to take files from other computers is a perfect vector for attacks by virus writers. It would be unwise, well, actually, it would be incredibly stupid, to write such a program such that it executes or displays files immediately after transfer without asking the user first. I have to admit that answering “Yes” to “Would you like to run this .exe file?” on Windows is about the same as saying, “Would you like me to pull the trigger?” in a game of Russian Roulette.

Another open research area, er, I mean, big headache, which I’ll not get into here, is the authentication system itself. Outside of all the other advice I just gave this problem is itself quite thorny. How do I know that you are you? How do you know that I am me? Perhaps I’m the Walrus...

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Systems Design

Create new playlist

Sign In

Sign Up

3. Systems Design

3.1 Abstractions

3.2 Driven

3.3 Driven Revisited

3.4 Changative Changes

3.5 Threading the Needle

3.6 Threads Still Unsafe?

3.7 Authentication vs. Encryption

3.8 Authentication Revisted

3.9 Authentication by Example

3.10 Cross-Site Scripting

3.11 Phishing and Infections

3.12 UI Design

3.13 Secure Logging

3.14 Java

3.15 Secure P2P

Table of Contents for
3. Systems Design