Code's Worst Enemy
I'm a programmer, and I'm on vacation today. Guess what I'm doing? As much as I'd love to tell you I'm sipping Mai Tais in the Bahamas, what I'm actually doing on my vacation is programming.
So it's a "vacation" only in the HR sense – I'm taking official time off work, to give myself some free time to get my computer game back online. It's a game I started writing about ten years ago, and spent about seven years developing. It's been offline for a while and I need to bring it back up, in part so the players will stop stalking me. It's going to take me at least a week of all-day days, so I had to take a vacation from work to make it happen.
Why did my game go offline? Not for want of popularity. It's a pretty successful game for a mostly part-time effort from one person. I've had over a quarter million individuals try it out (at least getting as far as creating a character), and tens of thousands of people who've spent countless hours playing it over the years. It's won awards and been featured in magazines; it's attracted the attention of game portals, potential investors, and whole schools full of kids.
Yup, kids. It was supposed to be a game for college students, but it's been surprisingly popular with teenagers and even pre-teens, who you'd think would be off playing some 3D console game or other. But I wrote it for myself, and apparently there are sufficient people who like the same kinds of games I do to create a sustainable community.
I took the game down for all sorts of mundane reasons - it needed some upgrades, work got busy, I didn't have lots of time at nights, etc. But the mundane reasons all really boil down to just one rather deeper problem: the code base is too big for one person to manage.
I've spent nearly ten years of my life building something that's too big.
I've done a lot of thinking about this — more than you would probably guess. It's occupied a large part of my technical thinking for the past four or five years, and has helped shaped everything I've written in that time, both in blogs and in code.
For the rest of this little rant, I'm going to assume that you're a young, intelligent, college age or even high school age student interested in becoming a better programmer, perhaps even a great programmer.
(Please – don't think I'm implying that I'm a great programmer. Far from it. I'm a programmer who's committed decades of terrible coding atrocities, and in the process I've learned some lessons that I'm passing along to you in the hopes that it'll help you in your quest to become a great programmer.)
I have to make the assumption that you're young in order to make my point, because if I assume I'm talking to "experienced" programmers, my blood pressure will rise and I will not be able to focus for long enough to finish my rant. You'll see why in a bit.
Fortunately for me, you're young and eager to learn, so I can tell you how things really are. Just keep your eyes open for the next few years, and watch to see if I'm right.
I happen to hold a hard-won minority opinion about code bases. In particular I believe, quite staunchly I might add, that the worst thing that can happen to a code base is size.
I say "size" as a placeholder for a reasonably well-formed thought for which I seem to have no better word in my vocabulary. I'll have to talk around it until you can see what I mean, and perhaps provide me with a better word for it. The word "bloat" might be more accurate, since everyone knows that "bloat" is bad, but unfortunately most so-called experienced programmers do not know how to detect bloat, and they'll point at severely bloated code bases and claim they're skinny as a rail.
Good thing we're not talking to them, eh?
I say my opinion is hard-won because people don't really talk much about code base size; it's not widely recognized as a problem. In fact it's widely recognized as a non-problem. This means that anyone sharing my minority opinion is considered a borderline lunatic, since what rational person would rant against a non-problem?
People in the industry are very excited about various ideas that nominally help you deal with large code bases, such as IDEs that can manipulate code as "algebraic structures", and search indexes, and so on. These people tend to view code bases much the way construction workers view dirt: they want great big machines that can move the dirt this way and that. There's conservation of dirt at work: you can't compress dirt, not much, so their solution set consists of various ways of shoveling the dirt around. There are even programming interview questions, surely metaphorical, about how you might go about moving an entire mountain of dirt, one truck at a time.
Industry programmers are excited about solutions to a big non-problem. It's just a mountain of dirt, and you just need big tools to move it around. The tools are exciting but the dirt is not.
My minority opinion is that a mountain of code is the worst thing that can befall a person, a team, a company. I believe that code weight wrecks projects and companies, that it forces rewrites after a certain size, and that smart teams will do everything in their power to keep their code base from becoming a mountain. Tools or no tools. That's what I believe.
It turns out you have to have something bad happen to you before you can hold my minority opinion. The bad thing that happened to me is that I wrote a beautiful game in an ugly language, and the result was lovely on the outside and quite horrific internally. The average industry programmer today would not find much wrong with my code base, aside from the missing unit tests (which I now regret) that would, alas, double the size of my game's already massive 500,000-line code base. So the main thing they would find wrong with it is, viewed in a certain way, that it's not big enough. If I'd done things perfectly, according to today's fashions, I'd be even worse off than I am now.
Some people will surely miss my point, so I'll clarify: I think unit testing is great. In fact I think it's critical, and I vastly regret not having unit tests for my game. My point is that I wrote the game the way most experienced programmers would tell you to write that kind of system, and it's now an appallingly unmanageable code base. If I'd done the "right thing" with unit tests, it would be twice appalling! The apparent paradox here is crucial to understanding why I hold my minority belief about code base size.
Most programmers never have anything truly bad happen to them. In the rare cases when something bad happens, they usually don't notice it as a problem, any more than a construction worker notices dirt as a problem. There's just a certain amount of dirt at every site, and you have to deal with it: it's not "bad"; it's just a tactical challenge.
Many companies are faced with multiple million lines of code, and they view it as a simple tools issue, nothing more: lots of dirt that needs to be moved around occasionally.
Most people have never had to maintain a half-million line code base singlehandedly, so their view of things will probably be different from mine. Hopefully you, being the young, eager-to-learn individual that you are, will realize that the only people truly qualified to express opinions on this matter are those who have lived in (and helped create) truly massive code bases.
You may hear some howling in response to my little rant today, and a lot of hand-wavy "he just doesn't understand" dismissiveness. But I posit that the folks making these assertions have simply never been held accountable for the messes they've made.
When you write your own half-million-line code base, you can't dodge accountability. I have nobody to blame but myself, and it's given me a perspective that puts me in the minority.
It's not just from my game, either. That alone might not have taught me the lesson. In my twenty years in the industry, I have hurled myself forcibly against some of the biggest code bases you've ever imagined, and in doing so I've learned a few things that most people never learn, not in their whole career. I'm not asking you to make up your mind on the matter today. I just hope you'll keep your eyes and ears open as you code for the next few years.
I'm going to try to define bloat here. I know in advance that I'll fail, but hopefully just sketching out the problem will etch out some patterns for you.
There are some things that can go wrong with code bases that have a nice intuitive appeal to them, inasmuch as it's not difficult for most people to agree that they're "bad".
One such thing is complexity. Nobody likes a complex code base. One measure of complexity that people sometimes use is "cyclomatic complexity", which estimates the possible runtime paths through a given function using a simple static analysis of the code structure.
I'm pretty sure that I don't like complex code bases, but I'm not convinced that cyclomatic complexity measurements have helped. To get a good cyclomatic complexity score, you just need to break your code up into smaller functions. Breaking your code into smaller functions has been a staple of "good design" for at least ten years now, in no small part due to the book Refactoring by Martin Fowler.
The problem with Refactoring as applied to languages like Java, and this is really quite central to my thesis today, is that Refactoring makes the code base larger. I'd estimate that fewer than 5% of the standard refactorings supported by IDEs today make the code smaller. Refactoring is like cleaning your closet without being allowed to throw anything away. If you get a bigger closet, and put everything into nice labeled boxes, then your closet will unquestionably be more organized. But programmers tend to overlook the fact that spring cleaning works best when you're willing to throw away stuff you don't need.
This brings us to the second obviously-bad thing that can go wrong with code bases: copy and paste. It doesn't take very long for programmers to learn this lesson the hard way. It's not so much a rule you have to memorize as a scar you're going to get whether you like it or not. Computers make copy-and-paste really easy, so every programmer falls into the trap once in a while. The lesson you eventually learn is that code always changes, always always always, and as soon as you have to change the same thing in N places, where N is more than 1, you'll have earned your scar.
However, copy-and-paste is far more insidious than most scarred industry programmers ever suspect. The core problem is duplication, and unfortunately there are patterns of duplication that cannot be eradicated from Java code. These duplication patterns are everywhere in Java; they're ubiquitous, but Java programmers quickly lose the ability to see them at all.
Java programmers often wonder why Martin Fowler "left" Java to go to Ruby. Although I don't know Martin, I think it's safe to speculate that "something bad" happened to him while using Java. Amusingly (for everyone except perhaps Martin himself), I think that his "something bad" may well have been the act of creating the book Refactoring, which showed Java programmers how to make their closets bigger and more organized, while showing Martin that he really wanted more stuff in a nice, comfortable, closet-sized closet.
Martin, am I wrong?
As I predicted would happen, I haven't yet defined bloat except in the vaguest terms. Why is my game code base half a million lines of code? What is all that code doing?
The other seminal industry book in software design was Design Patterns, which left a mark the width of a two-by-four on the faces of every programmer in the world, assuming the world contains only Java and C++ programmers, which they often do.
Design Patterns was a mid-1990s book that provided twenty-three fancy new boxes for organizing your closet, plus an extensibility mechanism for defining new types of boxes. It was really great for those of us who were trying to organize jam-packed closets with almost no boxes, bags, shelves or drawers. All we had to do was remodel our houses to make the closets four times bigger, and suddenly we could make them as clean as a Nordstrom merchandise rack.
Interestingly, sales people didn't get excited about Design Patterns. Nor did PMs, nor marketing folks, nor even engineering managers. The only people who routinely get excited about Design Patterns are programmers, and only programmers who use certain languages. Perl programmers were, by and large, not very impressed with Design Patterns. However, Java programmers misattributed this; they concluded that Perl programmers must be slovenly, no good bachelors who pile laundry in their closests up to the ceiling.
It's obvious now, though, isn't it? A design pattern isn't a feature. A Factory isn't a feature, nor is a Delegate nor a Proxy nor a Bridge. They "enable" features in a very loose sense, by providing nice boxes to hold the features in. But boxes and bags and shelves take space. And design patterns – at least most of the patterns in the "Gang of Four" book – make code bases get bigger. Tragically, the only GoF pattern that can help code get smaller (Interpreter) is utterly ignored by programmers who otherwise have the names of Design Patterns tatooed on their various body parts.
Dependency Injection is an example of a popular new Java design pattern that programmers using Ruby, Python, Perl and JavaScript have probably never heard of. And if they've heard of it, they've probably (correctly) concluded that they don't need it. Dependency Injection is an amazingly elaborate infrastructure for making Java more dynamic in certain ways that are intrinsic to higher-level languages. And – you guessed it – DI makes your Java code base bigger.
Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.
I recently had the opportunity to watch a self-professed Java programmer give a presentation in which one slide listed Problems (with his current Java system) and the next slide listed Requirements (for the wonderful new vaporware system). The #1 problem he listed was code size: his system has millions of lines of code.
Wow! I've sure seen that before, and I could really empathize with him. Geoworks had well over ten million lines of assembly code, and I'm of the opinion that this helped bankrupt them (although that also appears to be a minority opinion – those industry programmers just never learn!) And I worked at Amazon for seven years; they have well over a hundred million lines of code in various languages, and "complexity" is frequently cited internally as their worst technical problem.
So I was really glad to see that this guy had listed code size as his #1 problem.
Then I got my surprise. He went on to his Requirements slide, on which he listed "must scale to millions of lines of code" as a requirement. Everyone in the room except me just nodded and accepted this requirement. I was floored.
Why on earth would you list your #1 problem as a requirement for the new system? I mean, when you're spelling out requirements, generally you try to solve problems rather than assume they're going to be created again. So I stopped the speaker and asked him what the heck he was thinking.
His answer was: well, his system has lots of features, and more features means more code, so millions of lines are Simply Inevitable. "It's not that Java is verbose!" he added – which is pretty funny, all things considered, since I hadn't said anything about Java or verbosity in my question.
The thing is, if you're just staring in shock at this story and thinking "how could that Java guy be so blind", you are officially a minority in the programming world. An unwelcome one, at that.
Most programmers have successfully compartmentalized their beliefs about code base size. Java programmers are unusually severe offenders but are by no means the only ones. In one compartment, they know big code bases are bad. It only takes grade-school arithmetic to appreciate just how bad they can be. If you have a million lines of code, at 50 lines per "page", that's 20,000 pages of code. How long would it take you to read a 20,000-page instruction manual? The effort to simply browse the code base and try to discern its overall structure could take weeks or even months, depending on its density. Significant architectural changes could take months or even years.
In the other compartment, they think their IDE makes the code size a non-issue. We'll get to that shortly.
And a million lines is nothing, really. Most companies would love to have merely a million lines of code. Often a single team can wind up with that much after a couple years of hacking. Big companies these days are pushing tens to hundreds of millions of lines around.
I'll give you the capsule synopsis, the one-sentence summary of the learnings I had from the Bad Thing that happened to me while writing my game in Java: if you begin with the assumption that you need to shrink your code base, you will eventually be forced to conclude that you cannot continue to use Java. Conversely, if you begin with the assumption that you must use Java, then you will eventually be forced to conclude that you will have millions of lines of code.
Is it worth the trade-off? Java programmers will tell you Yes, it's worth it. By doing so they're tacitly nodding to their little compartment that realizes big code bases are bad, so you've at least won that battle.
But you should take anything a "Java programmer" tells you with a hefty grain of salt, because an "X programmer", for any value of X, is a weak player. You have to cross-train to be a decent athlete these days. Programmers need to be fluent in multiple languages with fundamentally different "character" before they can make truly informed design decisions.
Recently I've been finding that Java is an especially bad value for X. If you absolutely must hire an X programmer, make sure it's Y.
I didn't really set out to focus this rant on Java (and Java clones like C#, which despite now being a "better" language still has Java's fundamental character, making it only a marginal win at best.) To be sure, my minority opinion applies to any code base in any language. Bloat is bad.
But I find myself focusing on Java because I have this enormous elephant of a code base that I'm trying to revive this week. Can you blame me? Hopefully someone with a pet C++ elephant can come along and jump on the minority bandwagon with me. For now, though, I'll try to finish my explanation of bloat as a bona-fide problem using Java for context.
The Java community believes, with near 100% Belief Compliance, that modern IDEs make code base size a non-issue. End of story.
There are several problems with this perspective. One is simple arithmetic again: given enough code, you eventually run out of machine resources for managing the code. Imagine a project with a billion lines of code, and then imagine trying to use Eclipse or IntelliJ on that project. The machines – CPU, memory, disk, network – would simply give up. We know this because twenty-million line code bases are already moving beyond the grasp of modern IDEs on modern machines.
Heck, I've never managed to get Eclipse to pull in and index even my 500,000-line code base, and I've spent weeks trying. It just falls over, paralyzed. It literally hangs forever (I can leave it overnight and it makes no progress.) Twenty million lines? Forget about it.
It may be possible to mitigate the problem by moving the code base management off the local machine and onto server clusters. But the core problem is really more cultural than technical: as long as IDE users refuse to admit there is a problem, it's not going to get solved.
Going back to our crazed Tetris game, imagine that you have a tool that lets you manage huge Tetris screens that are hundreds of stories high. In this scenario, stacking the pieces isn't a problem, so there's no need to be able to eliminate pieces. This is the cultural problem: they don't realize they're not actually playing the right game anymore.
The second difficulty with the IDE perspective is that Java-style IDEs intrinsically create a circular problem. The circularity stems from the nature of programming languages: the "game piece" shapes are determined by the language's static type system. Java's game pieces don't permit code elimination because Java's static type system doesn't have any compression facilities – no macros, no lambdas, no declarative data structures, no templates, nothing that would permit the removal of the copy-and-paste duplication patterns that Java programmers think of as "inevitable boilerplate", but which are in fact easily factored out in dynamic languages.
Completing the circle, dynamic features make it more difficult for IDEs to work their static code-base-management magic. IDEs don't work as well with dynamic code features, so IDEs are responsible for encouraging the use of languages that require... IDEs. Ouch.
Java programmers understand this at some level; for instance, Java's popular reflection facility, which allows you to construct method names on the fly and invoke those methods by name, defeats an IDE's ability to perform basic refactorings such as Rename Method. But because of successful compartmentalization, Java folks point at dynamic languages and howl that (some) automated refactorings aren't possible, when in fact they're just as possible in these languages as they are in Java – which is to say, they're partly possible. The refactorings will "miss" to the extent that you're using dynamic facilities, whether you're writing in Java or any other language. Refactorings are essentially never 100% effective, especially as the code base is shipped offsite with public APIs: this is precisely why Java has a deprecation facility. You can't rename a method on everyone's machine in the world. But Java folks continue spouting the provably false belief that automated refactorings work on "all" their code.
I'll bet that by now you're just as glad as I am that we're not talking to Java programmers right now! Now that I've demonstrated one way (of many) in which they're utterly irrational, it should be pretty clear that their response isn't likely to be a rational one.
The rational response would be to take a very big step back, put all development on hold, and ask a difficult question: "what should I be using instead of Java?"
I did that about four years ago. That's when I stopped working on my game, putting it into maintenance mode. I wanted to rewrite it down to, say, 100,000 to 150,000 lines, somewhere in that vicinity, with the exact same functionality.
It took me six months to realize it can't be done with Java, not even with the stuff they added to Java 5, and not even with the stuff they're planning for Java 7 (even if they add the cool stuff, like non-broken closures, that the Java community is resisting tooth and nail.)
It can't be done with Java. But I do have a big investment in the Java virtual machine, for basically the same reason that Microsoft now has a big investment in .NET. Virtual machines make sense to me now. I mean, they "made sense" at some superficial level when I read the marketing brochures, but now that I've written a few interpreters and have dug into native-code compilers, they make a lot more sense. It's another rant as to why, unfortunately.
So taking for granted today that VMs are "good", and acknowledging that my game is pretty heavily tied to the JVM – not just for the extensive libraries and monitoring tools, but also for more subtle architectural decisions like the threading and memory models – the rational answer to code bloat is to use another JVM language.
One nice thing about JVM languages is that Java programmers can learn them pretty fast, because you get all the libraries, monitoring tools and architectural decisions for free. The downside is that most Java programmers are X programmers, and, as I said, you don't want X programmers on your team.
But since you're not one of those people who've decided to wear bell-bottom polyester pants until the day you die, even should you live unto five hundred years, you're open to language suggestions. Good for you!
Three years ago, I set out to figure out which JVM language would be the best code-compressing successor to Java. That took a lot longer than I expected, and the answer was far less satisfactory than I'd anticipated. Even now, three years later, the answer is still a year or two away from being really compelling.
I'm patient now, though, so after all the dust settles, I know I've got approximately a two-year window during which today's die-hard Java programmers are writing their next multi-million line disaster. Right about the time they're putting together their next Problems/Requirements slide, I think I'll actually have an answer for them.
In the meantime, I'm hoping that I'll have found time to rewrite my game in this language, down from 500,000 lines to 150,000 lines with the exact same functionality (plus at least another 50k+ for unit tests.)
So what JVM language is going to be the Next Java?
Well, if you're going for pure code compression, you really want a Lisp dialect: Common Lisp or Scheme. And there are some very good JVM implementations out there. I've used them. Unfortunately, a JVM language has to be a drop-in replacement for Java (otherwise a port is going to be a real logistics problem), and none of the Lisp/Scheme implementors seems to have that very high on their priority list.
Plus everyone will spit on you. People who don't habitually spit will expectorate up to thirty feet, like zoo camels, in order to bespittle you if you even suggest the possibility of using a Lisp or Scheme dialect at your company.
So it's not gonna be Lisp or Scheme. We'll have to sacrifice some compression for something a bit more syntactically mainstream.
It could theoretically be Perl 6, provided the Parrot folks ever actually get their stuff working, but they're even more patient than I am, if you get my drift. Perl 6 really is a pretty nice language design, for the record – I was really infatuated with it back in 2001. The love affair died about five years ago, though. And Perl 6 probably won't ever run on the JVM. It's too dependent on powerful Parrot features that the JVM will never support. (I'd venture that Parrot probably won't ever support them either, but that would be mean.)
Most likely New Java is going to be an already reasonably popular language with a very good port to the JVM. It'll be a language with a dedicated development team and a great marketing department.
That narrows the field from 200+ languages down to maybe three or four: JRuby, Groovy, Rhino (JavaScript), and maybe Jython if it comes out of its coma.
Each of these languages (as does Perl 6) provides mechanisms that would permit compression of a well-engineered 500,000-line Java code base by 50% to 75%. Exactly where the dart lands (between 50% and 75%) remains to be seen, but I'm going to try it myself.
I personally tried Groovy and found it to be an ugly language with a couple of decent ideas. It wants to be Ruby but lacks Ruby's elegance (or Python's for that matter). It's been around a long time and does not seem to be gaining any momentum, so I've ruled it out for my own work. (And I mean permanently – I will not look at it again. Groovy's implementation bugs have really burned me.)
I like Ruby and Python a lot, but neither JVM version was up to snuff when I did my evaluation three years ago. JRuby has had a lot of work done to it in the meantime. If the people I work with weren't so dead-set against Ruby, I'd probably go with that, and hope like hell that the implementation is eventually "fast enough" relative to Java.
As it happens, though, I've settled on Rhino. I'll be working with the Rhino dev team to help bring it up to spec with EcmaScript Edition 4. I believe that ES4 brings JavaScript to rough parity with Ruby and Python in terms of (a) expressiveness and (b) the ability to structure and manage larger code bases. Anything it lacks in sugar, it more than makes up for with its optional type annotations. And I think JavaScript (especially on ES4 steroids) is an easier sell than Ruby or Python to people who like curly braces, which is anyone currently using C++, Java, C#, JavaScript or Perl. That's a whooole lot of curly brace lovers. I'm nothing if not practical these days.
I don't expect today's little rant to convince anyone to share my minority opinion about code base size. I know a that few key folks (Bill Gates, for instance, as well as Dave Thomas, Martin Fowler and James Duncan Davidson) have independently reached the same conclusion: namely, that bloat is the worst thing that can happen to code. But they all got there via painful things happening to them.
I can't exactly wish for something painful to happen to Java developers, since hey, it's already happening; they've already taught themselves to pretend it's not hurting them.
But as for you, the eager young high school or college student who wants to become a great programmer someday, hopefully I've given you an extra dimension to observe as your tend your code gardens for the next few years.
When you're ready to make the switch, well, Mozilla Rhino will be ready for you. It works great today and will be absolutely outstanding a year from now. And I sincerely hope that JRuby, Jython and friends will also be viable Java alternatives for you as well. You might even try them out now and see how it goes.
Your code base will thank you for it.
So it's a "vacation" only in the HR sense – I'm taking official time off work, to give myself some free time to get my computer game back online. It's a game I started writing about ten years ago, and spent about seven years developing. It's been offline for a while and I need to bring it back up, in part so the players will stop stalking me. It's going to take me at least a week of all-day days, so I had to take a vacation from work to make it happen.
Why did my game go offline? Not for want of popularity. It's a pretty successful game for a mostly part-time effort from one person. I've had over a quarter million individuals try it out (at least getting as far as creating a character), and tens of thousands of people who've spent countless hours playing it over the years. It's won awards and been featured in magazines; it's attracted the attention of game portals, potential investors, and whole schools full of kids.
Yup, kids. It was supposed to be a game for college students, but it's been surprisingly popular with teenagers and even pre-teens, who you'd think would be off playing some 3D console game or other. But I wrote it for myself, and apparently there are sufficient people who like the same kinds of games I do to create a sustainable community.
I took the game down for all sorts of mundane reasons - it needed some upgrades, work got busy, I didn't have lots of time at nights, etc. But the mundane reasons all really boil down to just one rather deeper problem: the code base is too big for one person to manage.
I've spent nearly ten years of my life building something that's too big.
I've done a lot of thinking about this — more than you would probably guess. It's occupied a large part of my technical thinking for the past four or five years, and has helped shaped everything I've written in that time, both in blogs and in code.
For the rest of this little rant, I'm going to assume that you're a young, intelligent, college age or even high school age student interested in becoming a better programmer, perhaps even a great programmer.
(Please – don't think I'm implying that I'm a great programmer. Far from it. I'm a programmer who's committed decades of terrible coding atrocities, and in the process I've learned some lessons that I'm passing along to you in the hopes that it'll help you in your quest to become a great programmer.)
I have to make the assumption that you're young in order to make my point, because if I assume I'm talking to "experienced" programmers, my blood pressure will rise and I will not be able to focus for long enough to finish my rant. You'll see why in a bit.
Fortunately for me, you're young and eager to learn, so I can tell you how things really are. Just keep your eyes open for the next few years, and watch to see if I'm right.
Minority View
I happen to hold a hard-won minority opinion about code bases. In particular I believe, quite staunchly I might add, that the worst thing that can happen to a code base is size.
I say "size" as a placeholder for a reasonably well-formed thought for which I seem to have no better word in my vocabulary. I'll have to talk around it until you can see what I mean, and perhaps provide me with a better word for it. The word "bloat" might be more accurate, since everyone knows that "bloat" is bad, but unfortunately most so-called experienced programmers do not know how to detect bloat, and they'll point at severely bloated code bases and claim they're skinny as a rail.
Good thing we're not talking to them, eh?
I say my opinion is hard-won because people don't really talk much about code base size; it's not widely recognized as a problem. In fact it's widely recognized as a non-problem. This means that anyone sharing my minority opinion is considered a borderline lunatic, since what rational person would rant against a non-problem?
People in the industry are very excited about various ideas that nominally help you deal with large code bases, such as IDEs that can manipulate code as "algebraic structures", and search indexes, and so on. These people tend to view code bases much the way construction workers view dirt: they want great big machines that can move the dirt this way and that. There's conservation of dirt at work: you can't compress dirt, not much, so their solution set consists of various ways of shoveling the dirt around. There are even programming interview questions, surely metaphorical, about how you might go about moving an entire mountain of dirt, one truck at a time.
Industry programmers are excited about solutions to a big non-problem. It's just a mountain of dirt, and you just need big tools to move it around. The tools are exciting but the dirt is not.
My minority opinion is that a mountain of code is the worst thing that can befall a person, a team, a company. I believe that code weight wrecks projects and companies, that it forces rewrites after a certain size, and that smart teams will do everything in their power to keep their code base from becoming a mountain. Tools or no tools. That's what I believe.
It turns out you have to have something bad happen to you before you can hold my minority opinion. The bad thing that happened to me is that I wrote a beautiful game in an ugly language, and the result was lovely on the outside and quite horrific internally. The average industry programmer today would not find much wrong with my code base, aside from the missing unit tests (which I now regret) that would, alas, double the size of my game's already massive 500,000-line code base. So the main thing they would find wrong with it is, viewed in a certain way, that it's not big enough. If I'd done things perfectly, according to today's fashions, I'd be even worse off than I am now.
Some people will surely miss my point, so I'll clarify: I think unit testing is great. In fact I think it's critical, and I vastly regret not having unit tests for my game. My point is that I wrote the game the way most experienced programmers would tell you to write that kind of system, and it's now an appallingly unmanageable code base. If I'd done the "right thing" with unit tests, it would be twice appalling! The apparent paradox here is crucial to understanding why I hold my minority belief about code base size.
Most programmers never have anything truly bad happen to them. In the rare cases when something bad happens, they usually don't notice it as a problem, any more than a construction worker notices dirt as a problem. There's just a certain amount of dirt at every site, and you have to deal with it: it's not "bad"; it's just a tactical challenge.
Many companies are faced with multiple million lines of code, and they view it as a simple tools issue, nothing more: lots of dirt that needs to be moved around occasionally.
Most people have never had to maintain a half-million line code base singlehandedly, so their view of things will probably be different from mine. Hopefully you, being the young, eager-to-learn individual that you are, will realize that the only people truly qualified to express opinions on this matter are those who have lived in (and helped create) truly massive code bases.
You may hear some howling in response to my little rant today, and a lot of hand-wavy "he just doesn't understand" dismissiveness. But I posit that the folks making these assertions have simply never been held accountable for the messes they've made.
When you write your own half-million-line code base, you can't dodge accountability. I have nobody to blame but myself, and it's given me a perspective that puts me in the minority.
It's not just from my game, either. That alone might not have taught me the lesson. In my twenty years in the industry, I have hurled myself forcibly against some of the biggest code bases you've ever imagined, and in doing so I've learned a few things that most people never learn, not in their whole career. I'm not asking you to make up your mind on the matter today. I just hope you'll keep your eyes and ears open as you code for the next few years.
Invisible Bloat
I'm going to try to define bloat here. I know in advance that I'll fail, but hopefully just sketching out the problem will etch out some patterns for you.
There are some things that can go wrong with code bases that have a nice intuitive appeal to them, inasmuch as it's not difficult for most people to agree that they're "bad".
One such thing is complexity. Nobody likes a complex code base. One measure of complexity that people sometimes use is "cyclomatic complexity", which estimates the possible runtime paths through a given function using a simple static analysis of the code structure.
I'm pretty sure that I don't like complex code bases, but I'm not convinced that cyclomatic complexity measurements have helped. To get a good cyclomatic complexity score, you just need to break your code up into smaller functions. Breaking your code into smaller functions has been a staple of "good design" for at least ten years now, in no small part due to the book Refactoring by Martin Fowler.
The problem with Refactoring as applied to languages like Java, and this is really quite central to my thesis today, is that Refactoring makes the code base larger. I'd estimate that fewer than 5% of the standard refactorings supported by IDEs today make the code smaller. Refactoring is like cleaning your closet without being allowed to throw anything away. If you get a bigger closet, and put everything into nice labeled boxes, then your closet will unquestionably be more organized. But programmers tend to overlook the fact that spring cleaning works best when you're willing to throw away stuff you don't need.
This brings us to the second obviously-bad thing that can go wrong with code bases: copy and paste. It doesn't take very long for programmers to learn this lesson the hard way. It's not so much a rule you have to memorize as a scar you're going to get whether you like it or not. Computers make copy-and-paste really easy, so every programmer falls into the trap once in a while. The lesson you eventually learn is that code always changes, always always always, and as soon as you have to change the same thing in N places, where N is more than 1, you'll have earned your scar.
However, copy-and-paste is far more insidious than most scarred industry programmers ever suspect. The core problem is duplication, and unfortunately there are patterns of duplication that cannot be eradicated from Java code. These duplication patterns are everywhere in Java; they're ubiquitous, but Java programmers quickly lose the ability to see them at all.
Java programmers often wonder why Martin Fowler "left" Java to go to Ruby. Although I don't know Martin, I think it's safe to speculate that "something bad" happened to him while using Java. Amusingly (for everyone except perhaps Martin himself), I think that his "something bad" may well have been the act of creating the book Refactoring, which showed Java programmers how to make their closets bigger and more organized, while showing Martin that he really wanted more stuff in a nice, comfortable, closet-sized closet.
Martin, am I wrong?
As I predicted would happen, I haven't yet defined bloat except in the vaguest terms. Why is my game code base half a million lines of code? What is all that code doing?
Design Patterns Are Not Features
The other seminal industry book in software design was Design Patterns, which left a mark the width of a two-by-four on the faces of every programmer in the world, assuming the world contains only Java and C++ programmers, which they often do.
Design Patterns was a mid-1990s book that provided twenty-three fancy new boxes for organizing your closet, plus an extensibility mechanism for defining new types of boxes. It was really great for those of us who were trying to organize jam-packed closets with almost no boxes, bags, shelves or drawers. All we had to do was remodel our houses to make the closets four times bigger, and suddenly we could make them as clean as a Nordstrom merchandise rack.
Interestingly, sales people didn't get excited about Design Patterns. Nor did PMs, nor marketing folks, nor even engineering managers. The only people who routinely get excited about Design Patterns are programmers, and only programmers who use certain languages. Perl programmers were, by and large, not very impressed with Design Patterns. However, Java programmers misattributed this; they concluded that Perl programmers must be slovenly, no good bachelors who pile laundry in their closests up to the ceiling.
It's obvious now, though, isn't it? A design pattern isn't a feature. A Factory isn't a feature, nor is a Delegate nor a Proxy nor a Bridge. They "enable" features in a very loose sense, by providing nice boxes to hold the features in. But boxes and bags and shelves take space. And design patterns – at least most of the patterns in the "Gang of Four" book – make code bases get bigger. Tragically, the only GoF pattern that can help code get smaller (Interpreter) is utterly ignored by programmers who otherwise have the names of Design Patterns tatooed on their various body parts.
Dependency Injection is an example of a popular new Java design pattern that programmers using Ruby, Python, Perl and JavaScript have probably never heard of. And if they've heard of it, they've probably (correctly) concluded that they don't need it. Dependency Injection is an amazingly elaborate infrastructure for making Java more dynamic in certain ways that are intrinsic to higher-level languages. And – you guessed it – DI makes your Java code base bigger.
Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.
Millions of Lines of Code
I recently had the opportunity to watch a self-professed Java programmer give a presentation in which one slide listed Problems (with his current Java system) and the next slide listed Requirements (for the wonderful new vaporware system). The #1 problem he listed was code size: his system has millions of lines of code.
Wow! I've sure seen that before, and I could really empathize with him. Geoworks had well over ten million lines of assembly code, and I'm of the opinion that this helped bankrupt them (although that also appears to be a minority opinion – those industry programmers just never learn!) And I worked at Amazon for seven years; they have well over a hundred million lines of code in various languages, and "complexity" is frequently cited internally as their worst technical problem.
So I was really glad to see that this guy had listed code size as his #1 problem.
Then I got my surprise. He went on to his Requirements slide, on which he listed "must scale to millions of lines of code" as a requirement. Everyone in the room except me just nodded and accepted this requirement. I was floored.
Why on earth would you list your #1 problem as a requirement for the new system? I mean, when you're spelling out requirements, generally you try to solve problems rather than assume they're going to be created again. So I stopped the speaker and asked him what the heck he was thinking.
His answer was: well, his system has lots of features, and more features means more code, so millions of lines are Simply Inevitable. "It's not that Java is verbose!" he added – which is pretty funny, all things considered, since I hadn't said anything about Java or verbosity in my question.
The thing is, if you're just staring in shock at this story and thinking "how could that Java guy be so blind", you are officially a minority in the programming world. An unwelcome one, at that.
Most programmers have successfully compartmentalized their beliefs about code base size. Java programmers are unusually severe offenders but are by no means the only ones. In one compartment, they know big code bases are bad. It only takes grade-school arithmetic to appreciate just how bad they can be. If you have a million lines of code, at 50 lines per "page", that's 20,000 pages of code. How long would it take you to read a 20,000-page instruction manual? The effort to simply browse the code base and try to discern its overall structure could take weeks or even months, depending on its density. Significant architectural changes could take months or even years.
In the other compartment, they think their IDE makes the code size a non-issue. We'll get to that shortly.
And a million lines is nothing, really. Most companies would love to have merely a million lines of code. Often a single team can wind up with that much after a couple years of hacking. Big companies these days are pushing tens to hundreds of millions of lines around.
I'll give you the capsule synopsis, the one-sentence summary of the learnings I had from the Bad Thing that happened to me while writing my game in Java: if you begin with the assumption that you need to shrink your code base, you will eventually be forced to conclude that you cannot continue to use Java. Conversely, if you begin with the assumption that you must use Java, then you will eventually be forced to conclude that you will have millions of lines of code.
Is it worth the trade-off? Java programmers will tell you Yes, it's worth it. By doing so they're tacitly nodding to their little compartment that realizes big code bases are bad, so you've at least won that battle.
But you should take anything a "Java programmer" tells you with a hefty grain of salt, because an "X programmer", for any value of X, is a weak player. You have to cross-train to be a decent athlete these days. Programmers need to be fluent in multiple languages with fundamentally different "character" before they can make truly informed design decisions.
Recently I've been finding that Java is an especially bad value for X. If you absolutely must hire an X programmer, make sure it's Y.
I didn't really set out to focus this rant on Java (and Java clones like C#, which despite now being a "better" language still has Java's fundamental character, making it only a marginal win at best.) To be sure, my minority opinion applies to any code base in any language. Bloat is bad.
But I find myself focusing on Java because I have this enormous elephant of a code base that I'm trying to revive this week. Can you blame me? Hopefully someone with a pet C++ elephant can come along and jump on the minority bandwagon with me. For now, though, I'll try to finish my explanation of bloat as a bona-fide problem using Java for context.
Can IDEs Save You?
The Java community believes, with near 100% Belief Compliance, that modern IDEs make code base size a non-issue. End of story.
There are several problems with this perspective. One is simple arithmetic again: given enough code, you eventually run out of machine resources for managing the code. Imagine a project with a billion lines of code, and then imagine trying to use Eclipse or IntelliJ on that project. The machines – CPU, memory, disk, network – would simply give up. We know this because twenty-million line code bases are already moving beyond the grasp of modern IDEs on modern machines.
Heck, I've never managed to get Eclipse to pull in and index even my 500,000-line code base, and I've spent weeks trying. It just falls over, paralyzed. It literally hangs forever (I can leave it overnight and it makes no progress.) Twenty million lines? Forget about it.
It may be possible to mitigate the problem by moving the code base management off the local machine and onto server clusters. But the core problem is really more cultural than technical: as long as IDE users refuse to admit there is a problem, it's not going to get solved.
Going back to our crazed Tetris game, imagine that you have a tool that lets you manage huge Tetris screens that are hundreds of stories high. In this scenario, stacking the pieces isn't a problem, so there's no need to be able to eliminate pieces. This is the cultural problem: they don't realize they're not actually playing the right game anymore.
The second difficulty with the IDE perspective is that Java-style IDEs intrinsically create a circular problem. The circularity stems from the nature of programming languages: the "game piece" shapes are determined by the language's static type system. Java's game pieces don't permit code elimination because Java's static type system doesn't have any compression facilities – no macros, no lambdas, no declarative data structures, no templates, nothing that would permit the removal of the copy-and-paste duplication patterns that Java programmers think of as "inevitable boilerplate", but which are in fact easily factored out in dynamic languages.
Completing the circle, dynamic features make it more difficult for IDEs to work their static code-base-management magic. IDEs don't work as well with dynamic code features, so IDEs are responsible for encouraging the use of languages that require... IDEs. Ouch.
Java programmers understand this at some level; for instance, Java's popular reflection facility, which allows you to construct method names on the fly and invoke those methods by name, defeats an IDE's ability to perform basic refactorings such as Rename Method. But because of successful compartmentalization, Java folks point at dynamic languages and howl that (some) automated refactorings aren't possible, when in fact they're just as possible in these languages as they are in Java – which is to say, they're partly possible. The refactorings will "miss" to the extent that you're using dynamic facilities, whether you're writing in Java or any other language. Refactorings are essentially never 100% effective, especially as the code base is shipped offsite with public APIs: this is precisely why Java has a deprecation facility. You can't rename a method on everyone's machine in the world. But Java folks continue spouting the provably false belief that automated refactorings work on "all" their code.
I'll bet that by now you're just as glad as I am that we're not talking to Java programmers right now! Now that I've demonstrated one way (of many) in which they're utterly irrational, it should be pretty clear that their response isn't likely to be a rational one.
Rational Code Size
The rational response would be to take a very big step back, put all development on hold, and ask a difficult question: "what should I be using instead of Java?"
I did that about four years ago. That's when I stopped working on my game, putting it into maintenance mode. I wanted to rewrite it down to, say, 100,000 to 150,000 lines, somewhere in that vicinity, with the exact same functionality.
It took me six months to realize it can't be done with Java, not even with the stuff they added to Java 5, and not even with the stuff they're planning for Java 7 (even if they add the cool stuff, like non-broken closures, that the Java community is resisting tooth and nail.)
It can't be done with Java. But I do have a big investment in the Java virtual machine, for basically the same reason that Microsoft now has a big investment in .NET. Virtual machines make sense to me now. I mean, they "made sense" at some superficial level when I read the marketing brochures, but now that I've written a few interpreters and have dug into native-code compilers, they make a lot more sense. It's another rant as to why, unfortunately.
So taking for granted today that VMs are "good", and acknowledging that my game is pretty heavily tied to the JVM – not just for the extensive libraries and monitoring tools, but also for more subtle architectural decisions like the threading and memory models – the rational answer to code bloat is to use another JVM language.
One nice thing about JVM languages is that Java programmers can learn them pretty fast, because you get all the libraries, monitoring tools and architectural decisions for free. The downside is that most Java programmers are X programmers, and, as I said, you don't want X programmers on your team.
But since you're not one of those people who've decided to wear bell-bottom polyester pants until the day you die, even should you live unto five hundred years, you're open to language suggestions. Good for you!
Three years ago, I set out to figure out which JVM language would be the best code-compressing successor to Java. That took a lot longer than I expected, and the answer was far less satisfactory than I'd anticipated. Even now, three years later, the answer is still a year or two away from being really compelling.
I'm patient now, though, so after all the dust settles, I know I've got approximately a two-year window during which today's die-hard Java programmers are writing their next multi-million line disaster. Right about the time they're putting together their next Problems/Requirements slide, I think I'll actually have an answer for them.
In the meantime, I'm hoping that I'll have found time to rewrite my game in this language, down from 500,000 lines to 150,000 lines with the exact same functionality (plus at least another 50k+ for unit tests.)
The Next Java
So what JVM language is going to be the Next Java?
Well, if you're going for pure code compression, you really want a Lisp dialect: Common Lisp or Scheme. And there are some very good JVM implementations out there. I've used them. Unfortunately, a JVM language has to be a drop-in replacement for Java (otherwise a port is going to be a real logistics problem), and none of the Lisp/Scheme implementors seems to have that very high on their priority list.
Plus everyone will spit on you. People who don't habitually spit will expectorate up to thirty feet, like zoo camels, in order to bespittle you if you even suggest the possibility of using a Lisp or Scheme dialect at your company.
So it's not gonna be Lisp or Scheme. We'll have to sacrifice some compression for something a bit more syntactically mainstream.
It could theoretically be Perl 6, provided the Parrot folks ever actually get their stuff working, but they're even more patient than I am, if you get my drift. Perl 6 really is a pretty nice language design, for the record – I was really infatuated with it back in 2001. The love affair died about five years ago, though. And Perl 6 probably won't ever run on the JVM. It's too dependent on powerful Parrot features that the JVM will never support. (I'd venture that Parrot probably won't ever support them either, but that would be mean.)
Most likely New Java is going to be an already reasonably popular language with a very good port to the JVM. It'll be a language with a dedicated development team and a great marketing department.
That narrows the field from 200+ languages down to maybe three or four: JRuby, Groovy, Rhino (JavaScript), and maybe Jython if it comes out of its coma.
Each of these languages (as does Perl 6) provides mechanisms that would permit compression of a well-engineered 500,000-line Java code base by 50% to 75%. Exactly where the dart lands (between 50% and 75%) remains to be seen, but I'm going to try it myself.
I personally tried Groovy and found it to be an ugly language with a couple of decent ideas. It wants to be Ruby but lacks Ruby's elegance (or Python's for that matter). It's been around a long time and does not seem to be gaining any momentum, so I've ruled it out for my own work. (And I mean permanently – I will not look at it again. Groovy's implementation bugs have really burned me.)
I like Ruby and Python a lot, but neither JVM version was up to snuff when I did my evaluation three years ago. JRuby has had a lot of work done to it in the meantime. If the people I work with weren't so dead-set against Ruby, I'd probably go with that, and hope like hell that the implementation is eventually "fast enough" relative to Java.
As it happens, though, I've settled on Rhino. I'll be working with the Rhino dev team to help bring it up to spec with EcmaScript Edition 4. I believe that ES4 brings JavaScript to rough parity with Ruby and Python in terms of (a) expressiveness and (b) the ability to structure and manage larger code bases. Anything it lacks in sugar, it more than makes up for with its optional type annotations. And I think JavaScript (especially on ES4 steroids) is an easier sell than Ruby or Python to people who like curly braces, which is anyone currently using C++, Java, C#, JavaScript or Perl. That's a whooole lot of curly brace lovers. I'm nothing if not practical these days.
I don't expect today's little rant to convince anyone to share my minority opinion about code base size. I know a that few key folks (Bill Gates, for instance, as well as Dave Thomas, Martin Fowler and James Duncan Davidson) have independently reached the same conclusion: namely, that bloat is the worst thing that can happen to code. But they all got there via painful things happening to them.
I can't exactly wish for something painful to happen to Java developers, since hey, it's already happening; they've already taught themselves to pretend it's not hurting them.
But as for you, the eager young high school or college student who wants to become a great programmer someday, hopefully I've given you an extra dimension to observe as your tend your code gardens for the next few years.
When you're ready to make the switch, well, Mozilla Rhino will be ready for you. It works great today and will be absolutely outstanding a year from now. And I sincerely hope that JRuby, Jython and friends will also be viable Java alternatives for you as well. You might even try them out now and see how it goes.
Your code base will thank you for it.
185 Comments:
If you could snap your fingers and bring Perl 6 on Parrot's implementation up to the level of Rhino's, would you still choose JavaScript?
I wish he could snap his fingers to get rid of his anti-Ruby collaborators. Ha. :-)
But, I'm training for decent athlete status myself.
And I'm hoping for a quick release of that Emacs + JavaScript IDE / byte-code compiler. :-)
I think you make a lot of good points in your post. I also, however, think you're confusing the problem of codebase size with code duplication. I'm all for removing duplication from my code, and keeping it as small as possible in the meantime. But there's a point where making code smaller by using macros, terse languages and the like actually harms the maintainability of the code as much as code bloat does.
Just so you know where I'm coming from, I've worked professionally only in C++ and Java. I think they're fine tools for a lot of jobs, but they do leave a lot to be desired as you've pointed out. My personal projects, however, range from Perl, Python, Java, C++, and Assembly all the way up to my most recent favorite language, which is Ruby. Ruby offers a lot of extra power, and I've got a personal project I'm working on right now that's using JRuby. (I agree wholeheartedly with you about VMs, by the way.)
But that doesn't mean that I like Ruby just for its terseness. As a good example of what horrors you can inflict on people in the same of reducing code size, go take a look at this MUD written in Ruby in only 15 lines of code. Would YOU want to try and add a feature to that? Granted, it's an extreme example, but I think it illustrates the problem with your thesis. Yes, codebase size can be a problem. But reducing size by decreasing readability is just as big of a problem.
Anyone dumb enough to have curly braces as a requirement is never never going to consider Javascript to be a legitimate language for developing "real" apps
Did you consider Scala at all? Ramming JRuby down people's throats still sounds like a good option too ;)
I tried to "learn programming" via Java 1.1 in high school back in the day, self-taught. My plan was to build a MUD server and client and use XML for data storage. It was... fun. Years later, in college, I did it again in Java and then in C#, always getting to the point of no longer liking what I was doing at pretty much the same stage of functionality.
Recently, as a way of evaluating the true ninja-like qualities of Ruby, I decided to re-write it using Ruby and YAML.
My personal experience, to your point, was that the Ruby version came out 1) massively smaller in code size, 2) far easier to debug and pick up after long periods of awayness, 3) in at least 1/3 of the time it took previously and 4) had ass-loads more functionality in that time frame.
Now, you could argue that my first go was as a green programmer, but the college versions were written while I was earning a degree in software engineering and (WORSE) the Ruby version was written after having worked in Marketing for a few years. Marketing!
I used Smultron. Not really scalability concerns, there. It turned out that the benefits of an IDE were only valuable because what I was doing was needlessly complex enough that an IDE became relevant. IDEs are not features of a lanaguage, in my opinion, but necessities of a language due to a lack of other features - at least to some scale.
Good stuff. As a poet, I loved the camel metaphor.
Can you say a bit more about why, exactly, you're up at 500,000 lines of code? I can guess, but the idea of one programmer working along and getting to that amount of code is a bit startling.
Good! Let's go with JavaScript, also known (to some) as Lisp/Scheme in the sheeps clothing.
And you have the right word to sell it: "Interpreter Pattern". You cannot sell anything to Java Community without first saying a spell from GoF book.
What about SCALA?
I find SCALA to be a pretty easy language to get excited about. Great expressiveness and integration into the JVM.
What about something like JPype:
http://jpype.sourceforge.net/
For some stuff you can use Python when convenient (like string processing) and for other stuff you can use Java (like library support).
@Alec,
> Yes, codebase size can be a problem. But reducing size by decreasing readability is just as big of a problem.
Lots of people assume software's length and clarity are linearly correlated. But consider this:
"I think what is wanted with programming languages and programs intended for human consumption is something along the lines of "the greatest meaning-based compression" (or "greatest semantic compression", to be more succinct). If greatest possible compression were the goal, then all that would be needed to write succinct programs would be to run the final program through a compression program like gzip."
That comes from this post, succinctness = power. Since I originally found it from Steve Yegge's Ten Challenges, I think that's probably what he's going for.
Just tossing an idea - beside reducing the size of the dirt you could also ship it to someone else. I mean someone who would actually welcome it.
Hmm I think I'll stop being metaphorical now - how about
turning parts of the program as libraries and publishing them with some Open Source license?
Sir C.A.R. Hoare agrees: "The price for reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay." (cribbed from Dijkstra's EWD1304)
Count me in the minority, too. Code size is a huge problem, and like you suggest, programmers who aren't familiar with templates, generics, lambdas, and tuples (I'd say macros, too, but I'll leave that for lispers) will be missing useful tools for compressing their code.
OTOH, today you don't get checks for writing with languages that have strong support for those approaches. That is, unless you start your own company and sell it for a hefty sum because you got in before anyone else had a chance to finish ;)
Amazon has 100 million lines of code? Are you sure? Unbelievable!!! Windows is around 50 million, Solaris is ~ 35 million.
I agree that size can be a big problem (I'm working on a very large Java project) but I'm not convinced that 200,000 lines of Ruby code really is more maintainable than 500,000 lines of Java. We programmers have a lot of experience in dealing with large Java/C/C++ codebases but has anyone got 200,000 lines of Ruby?
Jeff Atwood recently discussed the problems with Bugzilla and its large Perl codebase. Sometimes the LoC is not the problem but the functional complexity of the system.
@Jamie,
> ...I'm not convinced that 200,000 lines of Ruby code really is more maintainable than 500,000 lines of Java...Sometimes the LoC is not the problem but the functional complexity of the system.
Since ruby allows for greater semantic compression, I think a ruby system done well would take much longer to reach 200 kloc, than a Java system would to reach 500 kloc.
The trick is to maximize semantic compression, not just LoC.
Most programmers are Javascript users these days. That should not be taken to imply any fondness for curly braces and semicolons. I love the dynamicism and the object model. Unfortunately, most people proposing changes to the language want to keep the syntax and scrap the object model.
Just taken a look at Rhino. NBL is shaping up nicely.
Before ruling out Scheme, try
SISC. http://sisc-scheme.org/
Maybe you did this already, but I
never saw you mentioning it in
your blog.
It is a *great* JVM Scheme
implementation, with good
documentation.
I also like the JVM and this
combination of Scheme and the JVM
is working very well for me.
SISCWeb (http://siscweb.sourceforge.net/)
is also very nice.
That's a lot of words just to say "code is a liability". :-)
As opposed to everyone else here commenting on code size vs functional complexity vs blah, I really want to ask one question.
Can I join you in your venture of rewriting this game, or is this a solo effort?
Agreed that code size is a problem, and the language of choice can definitely lead to bloat.
That being said, I've worked with many smart folks - "good" programmers, who just don't raise their level of abstraction even when the language allows it. And that holds doubly so when time is "of the essence" - as it often feels when on a deadline.
Another vote for Scala. I like Javascript and Rhino sounds interesting, but Scala is a much more powerful, expressive language.
Suppose you structured your code in such a way that you only had to look at under 1000 LoC at a time in order to do useful work. Would the 500KLoC be as big a problem?
First of all I would agree that code-base size is as big a problem as you say. However, I would posit that the size of the problem be more of a function of the structure/design of the code, than just the number of LoC.
The latest system I put into production had roughly 500KLoC. Developers would have only a couple thousand LoC in their IDE open at a time. This came about as a result of a much architectural and design work.
[You can find some of the patterns and frameworks we used on my blog: http://udidahan.weblogs.us]
Thoughts?
Also, I have a different take on the 'millions of lines of code' problem:
If I leverage 'millions of lines of code' from Apache projects, Eclipse, and a few other third-party modules in a small app with a Maven POM, is that the same thing?
How about if I have a larger app, like a game, but the game engine, logic, etc. are modularized in a similar fashion?
Maybe millions of lines of code are inevitable. But as long as they're well-defined to the point you don't have to pull them all into your head at once, there's no problem.
I agree that reorganizing into new containers and shelves won't solve anything. Its all about defining good modules and keeping those modules current and relevant in the face of unanticipated changes that would cause them to otw grow strange appendages.
Why can't that be done in Java? Apache and Eclipse seem to pull it off well enough...
@Casey
I get the feeling that's why he chose the JVM. He had existing code as well as the multitude of code that makes up the JVM supporting libraries.
Steve, I'm stealing this quote: "Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly."
If you ever see it again (which is unlikely with my modest blog) you're welcome to retaliate.
I assume when measuring the size of your game, you didn't include the JDK. That suggests another way to reduce lines of code: write code for a more featureful platform. The lines don't disappear, but they become someone else's problem.
You know that the fearfulness of bloat applies equally well to documents in natural language, too?
Rhino is a Java implementation of an ECMA script interpreter, right?
How does it do speed-wise?
Are you envisioning a Java program that's just a thin wrapper around Rhino and the "real" program is written in ECMA?
Have you looked at Lua at all?
How about LISP on JVM? Have you looked at clojure?
http://clojure.sourceforge.net/
+1 Scala.
"Its a poor workman who blames his tools." -- Anonymous
Highly structured languages make complexity possible. Scripting languages fall over before you get anywhere near the same level of complexity. if you consider that a feature I suppose thats your business.
(And Yes, I've done a lot of Java and a lot of PERL in my day. And C, C++, Modula2... )
I think you probably missed the most important tool you could have used on this project, though now maybe its a bit late...
Effective Java
It seems to me that you've experience the power of Java and it's allowed you to create something much bigger than you imagined you could. It's probably bigger than you can manage to support. So, this is the point that you have to decide is it a multi-developer project or is it still just your little pet?
There was a lot of text and I skimmed a lot. Did you tell us what the name of your game is and where to find it when you get it back on line?
Wow, that was long...I read only the first couple of paragraphs. But I was pleased that I found the word "garden" when I searched the rest of the text, as I believe it is a useful metaphor: If you only "plant" and never "weed", you won't have a pleasing result.
"The Java community believes, with near 100% Belief Compliance, that modern IDEs make code base size a non-issue. End of story.
"
Actually, we believe that IDE's make us much more productive when we have to manage code bases of varying sizes. Big difference than making "code base size a non-issue". Code base size is always an issue in probably any language you write.
Finally, I don't think your 500,000 line Java game is a good example. Mainly because you are the only one maintaining it and you're the only one who has written the code. Try dealing with an OSS project where code contributions come from volunteers who may or may not stick around to maintain their code. Coming from that perspective an IDE is invaluable to data mine, debug, and refactor code for the OSS project leads. Reliable refactoring and code analysis are crucial.
Personally, I feel most of Java's problems lie not with the language itself but with the libraries and frameworks within the JDK and outside the JDK. Rails, is IMO, the only reason Ruby has gotten any traction. Its filling the productivity void that EE has generally failed on.
you people seem to have missed the coolest (and probably the fastest) JVM language of them all...
CLOJURE
google it.
functional programming, macros, software transactional memory, etc.
you can't lose.
+1 Scala
Right there with you. I have a seemingly smaller problem than you -- 250Klines of C# and a team of people -- but the code base is just not as malleable as I'd like it to be.
Over the last 6 months, I've started attacking it from multiple directions.
For one thing, I'm integrating an IronPython interpreter. Initially, this is just for application scriptability, but long-term, I'm hoping we can replace enormous chunks of C# with it in the upper layers. I picked IronPython because it's the closest thing to a mature language out there that runs on the .Net CLR.
On the flipside, I've started integrating F# (Microsoft's OCaml variant for .Net) in the lowest layers. Initially it's as the back end for some DSLs that we're using to simplify some messy and complex code, but I'd like to see it start flatly replacing C# in a lot of places. So far, the only real problem with F# is that writing F# libraries to be used from C#, while better than using C# itself, is more cumbersome than writing F# that only has to be called from itself.
The early returns are that this is helping. It seems like a multi-language strategy is at least helping to keep our code's growth rate contained. If it gets proven, I may be able to make a push for using the new languages to actually shrink the code base, which would be a nice win.
So what's the game??? I want to play.
I see where you're coming from, but I think you're seeing a symptom as the problem. Large code is monstrously hard to manage, but I don't know that the fact that it is large code is the issue.
Can you give more information about why exactly you think this is the problem with your code? How did you reach this conclusion exactly?
Let me prescribe you a better solution:
Java programmers will stop writing bloat codes if Author Stevey stops writing bloat article about bloat Java programs without giving any factual or statistical analysis to the cause (and why these other P* languages will solve the problem.)
Oh... let's not blame the author. Let's just blame English being a bloat and bulky language.
The code size of your rant is too big and bloated.
I tried to load it into my brain, but it just hung, indefinitely, until I force quit it.
I believe the game in question is Wyvern, which was hosted at http://www.cabochon.com/
It's sort of a mud but with a graphical interface as well.
I think most technologies suffer in their success - as more people participate in the community, more competing (and complementary) thoughts/patterns/practices vie for our attention and we understand more about what problems the technology didn't solve well. We then focus on that and lose sight what problems the technology did solve well.
Good software designers need to carefully weigh both sides (instead of getting caught up in the 'throw it out' mindset) and calculate how to bridge the gap. On that very tactical point, I totally agree that selective application of dynamic languages in an existing JVM-based application is absolutely the right way to go and achieve "semantic compression" without sacrificing too much existing investment. Further, such a decision needs to weigh factors like talent availability - the code most likely will be maintained by a third party.
I'm a Rails novice (and thus, in the 'honeymoon' period), but thoroughly expect the community to go through the same contortions in 5(+/-) years. Look at Perl. Look at JavaScript. Extrinsic factors drastically changed the fortunes of both languages in the preceding 5 years.
Semi-related: http://icedjava.blogspot.com/2007/12/dear-java-thanks-for-complexity-of.html
Steve, you sure discounted Lisp fairly quickly considering that you have made no mention that other people will be using your code base. Are you unwilling to say that you yourself dislikes Lisp?
I am writing a pretty heavyweight game in Lisp, using real graphics middleware like Gamebryo. I fail to see what a lack of a VM provides, and I wonder if the speedup from using sbcl/corman common lisp will outweigh the immediate performance hit from your application's dependance on the jvm threading model.
I think the core premise (your code is 500k lines; unit tests would double it; there is no hope!) is wrong.
Proper modularization, which really requires a good set of unit tests, would fix your problem by making one 500k project into, say, ten 10k projects, each of which acts somewhat independently. At that point, working on any of the areas of the application is working on a single 10k application with 9 additional libraries being used (in addition to the many standard and third-party libraries you are already using).
There is a certain level of verbosity in Java, and much of the required syntax doesn't do a good job of "fading into the woodwork" leaving just your functional code. Likewise, the lack of defined language constructs makes cut-and-paste (which is what patterns are) necessary, and you are right that that is bad. So, point taken there. I just don't see that as a 2-5x differential in apparent code complexity. I'd gauge that more as a 10% differential, of course varying depending on the particulars of the project.
If dynamic typing is desired in your application hacking that into Java isn't the best use of your time. Dead stop there. If you want/need dynamic types, move off Java.
Still, a 500k application need not be overwhelming, if it was designed "well" in the first place.
So, was there something keeping this from being modularly designed?
I agree that code size or bloat is an issue, but what is being discussed is really compression. Think of a picture or a movie: Each as some discrete values that are displayed. You could create a file for each for each describing all attributes of each attribute. This then properly captures the Meta data such as shapes and faces. What people have found over the years is that the size of these files is large. In order to reduce the size, you can do one of two things: 1) Reduce the number of attributes saved (resolution, depth, etc). 2) Enable some compression method. Compression methods typically look for patterns, but there are other types as well. Sometimes you do both things at once. Can you take three different pictures and reduce all to a single bit and still tell the differences between the original images? Nope, not possible.
You can think of lines of code as the same thing. Yes, using macros or another language might reduce the lines of code, but it probably does not reduce the complexity. Not unless you are also reducing the functionality (as many scripty/webby languages tend to do).
Suppose you have two languages X and W. X has 20 tags and W has 100 tags (instead of tags, I could say functions or operations or syntax or…). You can presume that something written in W will be smaller than something written in X, particularly if W has all the tags of X. Does this make the program less complex? Not really. Instead of the developer having to learn 20 things, she now has to learn 100 and that does not include knowing anything about the program itself, i.e. what it is suppose to do.
So you say the previous example does not apply, then lets try this. Suppose you have languages X and Y. X has 20 tags and Y has 20 tags but lets you use macros. You can presume that something written in Y is smaller than something in X. But is it less complex to understand? Once again, macros are just adding more tags to language Y.
To make matters worse, when you start to combine multiple languages, especially if you count XML as a language (it is really a meta language), then not only does the developer have to learn all the used parts of each language used, but has to learn how to use one over another and the complexities of moving from one to another.
Am I Java Developer? Definitely. Is Java a perfect language for every situation? Definitely Not!! Not even in Most Situations. The same could be said for any language. Any language is going to have issues (proponents, detractors, snobs, and blind followers).
So I said I agreed that code bloat is an issue, yet I went and proved that solving it by using a different language (or yet another language) does not solve the problem. I think the real problem is complexity. The more functionality you have, the more complex it becomes. If you have 500K lines that are easy to read (which your game is not), then 150K lines probably does not make it any less complex. Not unless you copied/pasted each line of code 3 times.
Lastly, developers should stop looking for the “silver bullet” of development. There is no such thing. The next language is not going to solve your problems, nor is the next design book (GoF = GOOFS?), nor the next management methodology (agile…make me walk instead of run).
Writing and maintaining a multi-million-line Java program is easy enough,
if you manage to split it up into a few hundred sub-projects (libraries)
that each have a very small "surface area" (public interfaces) that hardly
ever change.
Of course, definining these interfaces is hard, and keeping them non-"leaky"
is even harder. (In a sense, all that easy and obvious interface abstractions
have already have been defined and are now part of the runtime environment
(or event the operating system) -- hardly anybody has to implement their own
network stack, file system, or process scheduler anymore.
There are many factors in why a specific language is picked for a software project, my interpretation of what you're saying is that the size of resulting code base is not given enough priority in the decision making process.
I understand that the blog is a rant and so is more emotional rather than about rational analysis but I do think it might be nice to have a follow up where you look at specifics of how Java bloat is eliminated by languages with different features.
When the company that I work for did this we actually ended up creating our own language with the feature list we wanted. Similar to you, one of our desires was to have it work on the JVM.
This language (called Cal) has now been open sourced and can be found here:
http://openquark.org/
I've hit a few painful things over the years, but more importantly I saw the light ages ago when I got to work on an incredibly elegant bit of code. The language has some effect, but more often it is using brute force to pound out your solution that produces fat and ugly code.
Bloat and artificial complexity don't have to be problems. They often stem from the project's organization structure and/or its processes. Even a single developer can cause a problem. If you had been more disciplined, I bet your 500,000 lines could have easily been implemented in half, if not a quarter of the size. I've been ranting about that for years, but as you say nobody takes size seriously as an issue.
That said, I do find Java to be much fatter than languages like Perl or C. Some days it feels so close to VB that it makes me shiver.
Paul.
Boo. Ok, its CLR, not JVM but at least you get JVM support with IKVM. And Boo should be tried once, just for its logo if not for anything else.
(Oh, completely agree on bloat. Sure.)
Dare I mention Fortress?
I'm a mobile developer so I feel your pain. Code base is a real problem for us...as every byte we put out in code is a byte less that we can use for heap memory. Systems are getting easier, but it used to be a real problem. Now just imaging having to write a Java program in the least number of lines possible...and you start to see the problem with mobile dev. :-)
You know, Steve,
you kinda fail to explain why this is a problem in the larger scheme of things and how it fits into that scheme. For example, in the old days, car mechanics could take apart and fix an entire car by themselves. Same for television amateurs. But these days, the amount of computer technology and integration in such machines makes them into mere bystanders. You sound a bit like an old television amateur who bemoans the amount of complexity present day's devices represent, while not talking about the greater picture and at the same time quite enjoying the extra functionality or performance his device offers him.
The only thing that can beat complexity is simplicity; does expressiveness elicit simplicity or is it just a way to shrink code? You do not make it clear. If you can shrink your code down to even 20%, just wait a couple of years, and you'll be back to where you started. There's a basic tendency in human culture that bespeaks of this motion, and it is the desire for more and better. But in principle, size needs to be no enemy; the only requirement is that the amount of information you have to deal with at any level, is kept at a certain minimum. That is why OS X is so praised, and why Windows will for a long time be more popular than desktop Linux: the amount of options open to you at any point in time is so very limited and that is what makes it easy to use. The same holds for software. Compare the man page for 'rpm' with that of 'dpkg' and you'll see what I mean. Abstraction, small interfaces, limited options, modularisation, hierarchy, layering, you name it.
I am not speaking as a senior programmer, but as one of target audience, not yet familiar with scripting languages. But I also have a bit of a mental block regarding complexity: I cannot hold as many pieces of information 'in vision' and have a bit of a harder time digesting information than most people. Because of that, I'm perhaps a bit better qualified to speak of such things. And I will tell you this: most people - most programmers - do not appreciate the difficulty people can have with the digestion and navigation of information, and this applies to everyone. Introducing simplicity at every step and every boundary and every interface, and limited, self-contained areas of relevance for any given viewpoint, is the most important thing in making complex systems simple and manageable.
If you have a system of 7 layers each of which has seven more layers, and so on, into infinity, will it be hard to manage? No. You can easily understand the working of the system at any level from any viewpoint, and the levels/area's you cannot see are irrelevant to the understanding of that part of the system you can see at any time. You will need more people to do the management of course, and there may be more to manage than if you had not been so vigorous in modularizing, but it will be a lot easier to do.
Maybe my lack of experience with large projects is speaking through, but I just want to make a general point.
I said as much here recently:
http://beust.com/weblog/archives/000462.html
ie, big code bases talk to maintanancea a lot more than your pet language's typing features
But now I see Bob Lee responding to my non-existent comment, so looks like it got pulled. Oh well.
@Udi and Tom:
I'd conjecture that in a way, modularity/limited visibility can make the code "size" problem worse, for a very simple reason: You don't know what's in there that you aren't looking at.
It's a lot easier to make a small code base "modular", in the sense of orthogonality. If you suddenly realize that you should be abstracting out, say, your text-formatting routines into a text-formatting object or library, you can do that. If the project is small enough, you can probably just scan through the thing visually and find all the text-formatting code. Or you can go tell the three other guys who work on the code to start using your new library.
When a project has millions of lines of code - you just can't do that. Nobody knows where to find all the routines that use text formatting. Sure, search capabilities are helpful, but they're not all-powerful.
This is a bigger problem with large dev teams, where there are developers you probably don't even know exist who are, at this very minute, writing something that formats text. But even one-man teams can suffer from it, because you're not always going to remember something you wrote a few years back.
I had the recent joy of looking at code I wrote 15 years ago. The first thing I noticed was that I made extensive use of linked lists, but I repeated a lot of the code. At the time, it was intentional; function calls were too expensive. But now we have fast computers, so I resolved to write myself a linked-list helper library.
A little further into the project, I was looking through some other code, and - what do you know! - found a linked-list helper library. It seemed like just what I needed. And that's when I realized - it ought to, because *I* wrote that. I just hadn't looked at it in 15 years, and neither had anyone else. So if I can't even remember to use my own abstractions, how can I expect the rest of the company to?
There's an anti-pattern I keep seeing, but haven't found a great name for yet. I call it "stratification".
Basically, the more you write high-level libraries that wrap lower-level ones, the less you use the low-level libraries. Thus, the less you understand them. At some point, you forget they even exist. At that point, you will - inevitably - write an even-higher-level library that recreates the lower-level functionality ON TOP OF the high-level library.
And if you've ever scanned in a document so you could use fax software to transmit it to a remote bank of faxmodems which will make a voice call over an IP-based, packet-switched telephone network only to reach another faxmodem in the same data center which will take the fax, convert it to a PDF and send it via e-mail to your recipient, who reads it via a web-based mail interface that uses JavaScript to implement offline mail reading functionality... you know what I'm talking about.
Seriously, am I the only one who wants to have a go at this game, bugs or no bugs?
Come on, stick it back up! Post a link. Post a linnnnnk.
It's in Java, it's not like it's going to wipe my hard drive. Well, maybe, but that'd be damn impressive.
I don't see how you prove your main point, that code size is an issue. You spend a lot of time saying it is, but offer up little in the way of proof or examples. (EG: what was the bad thing that happened to you?) It's not hard to say "code size is a problem". It's hard to say why it's a problem and give examples of 1) code that is too big/bloated and 2) code that does the same thing more succinctly.
Simple code-size reduction is not a silver-bullet. C and C++ are capable of very compact code that is almost as hard to read as Perl code. (see the ioccc - http://www.ioccc.org/ )
What I think you're getting at is code size reduction while maintaining readability. An altogether different beast and one that requires good design as much as language choice.
None of that should be taken as disagreement. Huge code-bases are beasts and have their problems. I like the idea of what you're attempting to show. I just think you've failed to adequately prove your point.
Tangentially, I wonder how much reduction is possible. This drive to write smaller code sounds very similar to the ideas behind the infamous programmer-induced rewrite. Namely, the "I can do it better!" syndrome. A noble idea, and not always misplaced. But rewriting code because the original code is ugly/too big isn't a panacea either. Crufty, old code contains a lot of knowledge about the problem being solved, earned over many years (through many nasty bugs and hours of debugging them) and encoded in usually inelegant ways. Tossing that and rewriting from scratch means you have to go through that learning process again.
There's a balance between good and good enough and simply assuming any code base over some number of lines of code is bad is near-sighted to say the least. Large problems sometimes need large solutions.
The simplistic solution, breaking the program into smaller programs that work together to solve the larger problem, only masks the problem and pushes it to someone else's plate (the poor schmuck that has to get the programs to work together correctly).
+1 Scala. Steve, could you please comment on your thoughts about Scala compared to Mozilla Rhino?
Thanks.
Amen! As a (currently)Java programmer, this frustration is what's driving me towards my own JVM language. Luckily, it has no due dates or constituency, so I can be very, very "patient" with it.
It boggles my mind that bigger is seen as better -- maybe what we need is APL for the JVM? :-)
+1 Scala. My guess is that Stevey just prefers dynamic typing. But there's not much you'll need to do in practice that Scala's type system can't express. BTW, The Scala book is now out in PDF.
Great post, I agree in principle with the code bloat argument. However, I notice Steve doesn't count the lines of code comprising the code base of the OS, JVM, or the Java libraries he uses as part of the code bloat.
I believe this is because a stable and well-designed (or well ... designed in the case of the Java libraries) and well-tested component doesn't require the mental overhead of a poorly-designed component.
Now, it is true that these things tend to be services and infrastructure, not domain elements, and services tend to have much firmer interfaces over time than domain classes, which change as your understanding of the problem domain and customer changes. However, I suspect that in any application a good portion of the code are value objects and services, not domain code. Anything to make that portion smaller is almost certainly a good idea.
Perhaps that is the future: systems and infrastructure programming in C# or Java surrounding domain programming in your favorite scripting language, which is as mutable as domains tend to be.
So when are your tips for tech interviews coming up? I have one coming and could use some advice!
Could you explain why you think VMs are important and/or useful? I am curious about this.
I'm also going through a VM scripting language evaluation, but coming out with a different conclusion.
Coming from a Java background, Groovy is attractive in that it employs the same type system, scoping rules, library, etc. as Java. All my Java knowledge is portable into Groovy. Its a far more symbiotic fit with the VM than the other scripting alternatives. One can migrate from Java to Groovy over time, mix and match Java and Groovy is no problem.
Not sure when you gave Groovy a whirl. If it was before 2007, suggest you give it another look. There is now a well run team that has produced 2 major releases this year. Latest Groovy 1.5.1 brings a lot of functional and performance improvements.
Arghh!
Have any of you making all of your wonderful observations ever bothered to learn any of the dynamic languages mentioned? Do you think it's possible there are things happening in those languages that you can't do in Java and therefore can't know and make a clear argument about?
.............................
I am surprised no one has mentioned the central tension: typing or no typing. I am leery of writing a very large system in a language with no typing (e.g., Ruby, Lisp, etc). A compiler can help catch so many simple mistakes, which are often the ones we make. Two questions, then:
1- Do you think strong typing is useful for code reliability (like I do)?
2- Do you think strong typing is inherently an order of magnitude more verbose?
'Cause I agree Java is verbose and I'm not happy about that.
Dan
@df
Surely you know of Stevey's opinions on static type systems!
Wyvern was a blast to play and I've missed it. Use whatever language or approach you have to, just hurry up and get it back out there!
... pretty please?
If curly braces is all it takes, couldnt you just do a readtable hack in Common Lisp that changes (car x) to {car x}?
Sigh.
+1 Scala
even if it's statically typed, the compression rate is very high :
- Type inference => eliminates verbosity
- Pattern Matching => eliminates the need for Visitor DP and make you write decent recursive data structures
- closures => no more Kingdoms of Nouns
- implicit => an "equivalent" of Haskell type classes
- operators => just an example in the pi-calculus library: spawn < p | q | r >
this is object oriented code, spawn is a method that returns an object, < is a method, p is a process, | is a method, q is a process, r is a process, > is a method.
- support for currying
- support for mixins
- it also supports syntactic "tricks" that helps you write DSL (like the a.m(b) === a m b, à la smalltalk).
- There's no code generation like in Ruby last time I checked (even if there's an obscure package where you can build syntactic terms at runtime)
- No loss in performance
- a good type checker
I don't think that your opinion about code base size is necessarily a minority one,. As an experienced Java programmer, I've watched over the years as developers have come along with their non-emacs fancy graphical IDEs and they've cheerfully generated all sorts of convoluted and unnecessary constructs in a few button clicks, with the result that functionally mundane components are thousands of lines long, and virtually incomprehensible.
Having said that, I still think that at its heart, if you can remove all of the more idiotic and misjudged frameworks and libraries, that Java is still a good language and it is possible to produce elegant software using it. But, similarly to C++, it requires super-human discipline to do so.
I haven't used Java now for a long while: I simply don't require the functional richness and syntactic rigour that it provides. Whenever I want to bash out a few web pages or interact with a database table, I go straight to a scripting language, and it's done within a few minutes.
This is an interesting take on the subject:
http://dspace.mit.edu/bitstream/1721.1/6064/2/AIM-986.pdf
I'm not sure if Lisp still has any chances to become relevant again, but I think that the idea of having many layers of software, operating at different levels of abstraction and possible to understand seperately is very interesting. TeX is structured this way, and it certainly is big and complex.
Try Scala. It's compatible enough with Java that you wouldn't even have to start from scratch. You could just pick a section to rewrite in Scala and slowly convert.
"Surely you know of Stevey's opinions on static type systems!"
No, I don't. Not everyone who reads a blog post has read the entire blog or knows the person.
Anyway, no matter what his opinions, it seems centrally tied to this issue -- Java, code bloat -- although I'm not sure what the right position is.
It seems to me static typing helps large projects .. and causes code bloat.
Dan
Man, the people you work with like Javascript and not Ruby? Not that i'm some Javascript hater, but i didn't expect that!
Funny story. I "got religion" about code base size after a game i wrote grew to unmanageable levels, too! (Much smaller, but it was written in Hypercard(!) rather than Java, so there weren't as many tools to manage the size. It was one of my first projects, too!)
Ever since that point i sort of liked the dynamically typed, elegant, and powerful languages--Ruby is pretty much my spec language. For a long time the "real" language (Java/C/etc) people would sort of turn their noses up at that sort of thing (including Javascript, which i also liked--though not very much) and i just sort of nodded along and shrugged it off. I thought "Well, okay then. Maybe i'm just not interested in 'real' languages." Interestingly, i couldn't stand Perl either.
Then i found this blog and was like "Oh, so i'm not the only one who doesn't like 'real' languages!"
This post, here, made me think of my old game and the like.
Incidentally:
"I mean, they 'made sense' at some superficial level when I read the marketing brochures, but now that I've written a few interpreters and have dug into native-code compilers, they make a lot more sense. It's another rant as to why, unfortunately."
I hope you'll write that some day! I'd be interested--i'm much more on the "made sense at some superficial level" understanding, so getting a shortcut to a more complete understanding would be nice...
I can't believe I'm the _11th_ person to some along and suggest considering Scala. Now I'm beginning to worry that I'm part of a fad. Either that, or a group of people who understand language design and are seriously impressed with the design of Scala. For what it's worth, I am not jumping in just because everyone else is: I knew I was going to add a comment suggesting Scala even before I finished reading the post (unless Scala was what you picked).
One important suggestion. Regardless of what language you pick, make use of it's Java interoperability features to _convert_ your code base little by little rather than rewriting from scratch. Sure, you can refactor large chunks to make use of the more powerful features of the new language -- that's the whole point of the exercise. But I'm sure you've read Joel's article on rewriting from scratch (http://www.joelonsoftware.com/articles/fog0000000069.html) and it's one opinion of his I think is born out by the facts.
I'm not young, but I am new to OOP and have experienced cycles of frustration and confusion with eclipse (and its complexity) as well as a handful of other IDEs, Visual Studio notwithstanding.
The one question that wasn't answered by this blog is, "What does language/framework/IDE-specific bloat look like (examples?) and how do the solutions you propose solve this problem?" But to give you credit, I believe you forewarned readers that this question wouldn't be answered.
Wow - using Spring is considered bloat? I would say that you haven't really groked Spring. IMHO, managing large code bases becomes really challenging when code is not 1) testable 2) decoupled. Spring is a fantastic technology for enabling both of these. If start without Spring, adding it in usually means writing negative lines of code.
Thanks for the rant. It is a timely reminder that people who do research on programming languages and those who construct them in 'the wild' must revisit the 'size' issue, and do so time and again.
I have put 'size' in quote because I don't think it is sheer size. If it were, we'd be programming in APL. (Yes, I am old enough to know that it existed.) I conjecture you mean some form of 'complexity' as in 'N pieces are related to N pieces in my code base, and I can't recall how they are related when I maintain my code base.' And so on. I wish you took some more time nailing down what it really is about 'size' that bothers you. In other words, blog again and describe the symptoms, describe what is bad with big code base.
Two comments: We (PLT Scheme) is confronted with a similar growth of our code base (mzscheme, drscheme, various tools, user contributions) but somehow sheer size hasn't done 'it' to us. We have no desire to rewrite it in JavaScript ;-)
I am one of the few serious mainstream PL researchers who has stood up to 'type research' (as in types are all there is to a PL) for two decades now. Having said that, I still think that a knee-jerk reaction to static type systems is improper. When you maintain untyped code, it does become difficult what the three arguments to your method at line 123,765 represent: closures, objects, integers, and what 'invariants' they satisfy. I have watched a programmer struggle with just that issue, and it is a programmer whom I would trust with my life (as far as programming is concerned). If the types are written down (and checked and sound), you can save yourself some 15 minutes with every method you touch. Sure it adds a few token pieces of type info to the method header, but what's that compared to 20,000 methods touched times 1 - 15 minutes of 'manual type recovery.' And don't tell me about type inference for 'scripting languages'; I have also spent four dissertations and 20 years on this, plus a couple of years of trying to analyze Python, to know better. (It's kind of acceptable if the language is designed for it from the scratch.)
-- Matthias Felleisen
I used to think there was something wrong with me, because I never, ever, ever managed to use any IDE without giving up in disgust hours later. I plain could not stand them. To me they seemed like barriers to thinking rather than assistants.
Good post. Jeff Atwood posted a "sorta" conflicting view on the same day no less! Both make for interesting reads.
http://www.codinghorror.com/blog/archives/001022.html
"Yes, codebase size can be a problem. But reducing size by decreasing readability is just as big of a problem."
Yes, but if he wanted to shorten his code and make it less readable he could just use two letter names for classes, methods and variables.
Language choice is very important. Java forces you to insert so much visual noise into your code that you have to read everything twice before you understand what it does.
It's been said already, but I'll have to say it again.
Scala FTW
If you want to massively reduce the size of your Java code base and massively increase its maintainability then you need a language that supports functional programming. There are other benefits as well, e.g. concurrency is much easier.
As long as you're stuck with the JVM, your only technically-decent option is Scala but it has few users and little support.
If you can afford to ditch the JVM then you might like to look at more popular languages than Scala, like OCaml, Haskell, Erlang and F#.
I agree with your opinion about the code size.
But there are two types of code. First the code you write by hand. And then the code you can generate. I'm speaking of power plant simulators. You have a subset of well tested functions and build the simulator with data from the real plant.
Oh, and in my work I still use tcl/tk. Programs read and useable even after ten years.
Have a look at tclblend or jacl.
Regards
rene
Have you read http://www.equi4.com/moam/fourth ?
Groovy for me. I guess it has had a few rough spots in its past but now seems to have more momentum than the other JVM languages - at least in my customer base which is historically Java focused and predominantly pro-agile.
Have you looked at QBASIC?
I have been a professional QB programmer for many years, and can assure you that you have overlooked this gem.
It's use of SUBs is really quite fantastic.
Obligatory link: http://en.wikipedia.org/wiki/QBasic
Reminds me of Paul Graham's http://www.paulgraham.com/power.html. Might be something to this code size problem after all :). Enjoyed your article.
Steve, I need no convincing here -- Java has its moments, but terseness is not one of them, while ES4 is shaping up to be pretty nice.
That said, I think your argument would probably be more compelling to the general audience (ie, people who haven't been exposed to ES4, or other languages with similar features) if you provided some examples of the sort of code compression that's possible.
I first met Ruby after reading your NBL post. On queue I recoiled at the lack of curly braces (perhaps subconsciously primed by your post), at the def keyword, which doesn't have an equivalent in Java, variables starting with @ symbols, and at the pipes used to delimit block parameters because they're not a nice symmetrical pair. From where I sat on my high horse, I thought the lack of regularity (nested curlies) would forever leave a bad taste in the mouth.
Boy was I wrong.
A few days into playing with Ruby and I forgot all about curly braces. Now I wince whenever I see those redundant parentheses tacked on the end of every parameterless Java method call, not to mention the receding tails of closing curlies at the end of every method and class definition.
So you're right, the populace has been conditioned with curly braces, but I don't think it's a show-stopper for adoption of sensible languages. The trance might be real, but it doesn't take much to snap out of it.
I walked a similar path, found Scala, and have been in love with it for more than a year.
I've written a number of 100K+ LoC code bases including the Mesa spreadsheet (for NextStep, OS/2, and OS X) and the Integer spreadsheet (written in Java.) I understand the code bloat problem and have experienced it first hand. There's an order-of-magnitude productivity decrease when the 1 or 2 key developers in a project can't keep the code base in their head anymore. Your head must be darned large to keep a 500K code base in it.
In 1999-2001, I built a Java-based web framework that sidestepped the whole code bloat issue by allowing developers to define common web development tasks using much higher level semantics and the high level semantics where compiled down into Java code. A 3,000 line object description file would typically bloat to 200,000 lines of actual Java code. The nice thing is that the developer didn't need to maintain the Java code... it was byte-code for most purposes.
In 2004, I started doing Ruby/Rails work. I wrote a bunch of commercial applications in Rails. I found that, even with reasonable test coverage, changing up code bases of 10K LoC became challenging. Additionally, because of the weak performance aspects of Ruby and lack of good Unicode support, it was not useful for my projects.
In 2006, I needed a better answer than Ruby/Rails was offering. I cast about and asked many of the same questions that you've asked in your post.
I discovered Scala. It was love and first site and my passion about Scala the language and the Scala community has intensified over the last year.
I started working on a web framework for Scala: lift. lift's functionality is similar to Rails (a little more here and a little less there) and lift's code size is about have the code size for Rails. More interestingly, the lift feature set has grown significantly over the last 6 months, but the code size has remained fairly constant.
You may ask how that bit of paradox has happened. It's like this... as my mind thinks more and more "functional", I am able to reduce code size. As I go through and work on older parts of the code that look and feel like Java programs, I'm able to re-implement pieces more efficiently such that the core meaning of the code is the most prominent part of the given code block rather than the "setup" of control structures (for/if).
While Scala does not have Macros, its combination of an advanced type system, traits (interfaces with code attached), and implicits, one can define amazing blocks of composition. I find that I don't copy/paste Scala code.
A number of other folks have suggested looking at Scala. I'll one-up them... I'll happily spend a couple of days with you teaching you Scala and working on how to approach migrating your game to Scala. Just ping me.
Thanks,
David
I wrote a fairly large game in python.
To keep the code size small every couple of months I would try and make the code size smaller.
I would bring it down from 12,000 or so lines down to 7,000. Then I'd keep working, and repeat the process.
DRY is a good one to follow.
Also data driven programming. Putting complex things into data files, kind of like DSLs.
Separating out reusable libraries can help. Especially if you can add the functionality to a library someone else maintains.
Putting things into separate libraries also forces you to decouple things. Making the complexity smaller, because you can look at smaller pieces separately.
Look at your code and figure out how to make it smaller, and simpler.
Remove crappy features that take up space.
Stop using silly things like getters, and setters, and creating classes for all types of data.
Use data structures like lists, and dictionaries instead of classes. This can reduce code size amazingly.
How you structure events in games can mean that you get emergent behavior. This can save you *lots* of code. Think about how the complexity of interactions and figure out ways to get that to scale with minimal code.
Keeping code small is a process, and a skill. If you keep practicing to keep your code simple you learn techniques to do it.
-- ps. check out pygame.org the simple game library for python based on the Simple Direct media Library (SDL). That's what I work on, and have fun with. Making games should be about fun.
Steve, guys, I am looking for a piece of advice on a language to learn. I am torn between Haskell and OCaml, no surprise here. Doesn't necessarily need to be practical, it's more about deep culture and good taste. I don't care how steep the learning curve is, just about the ultimate intellectual payout. (I am a professional programmer, not a newbie).
I hope Stevey is not reading the comments. It can't be good for his blood pressure.
Oh, I'm reading them. Good stuff.
I'll comment on a few themes I've noticed.
Scala folks and Groovy folks: you're not big enough yet. For something as big as my game, I want a proven mainstream language. I picked Rhino as a complicated multidimensional compromise; the actual reasons are a full blog post. But the short answer is "you're not big enough." Sorry.
A slightly different answer for Scala, OCaml, Haskell and other static languages: my game is a dynamic system. I was astonished to find, circa 1999, that I had effectively implemented JavaScript's prototype system in Java (with no syntax). For dynamic systems you need dynamic languages. If I were writing a compiler, I'd use an H-M language (probably Scala). This game is a different beast; it's one of the most "living" systems in the world (notwithstanding its 1-year hibernation). You could do it in an H-M system, but it would be nearly as big as the Java system.
Folks wondering if dynamic typing can work in "big" systems: the answer is that they don't get all that big, because of the dynamic typing. Unit testing becomes crucial, but you should do that anyway, regardless of language. Some good discussion of the reasons for dynamic languages being so compressible can be found in http://www.amazon.com/o/asin/0262220695
Folks who wonder about how I could (re)write such a giant system without any "static" type annotations: http://tinyurl.com/26n2zq
Java folks who simply disagree with me: to paraphrase Nikita Khrushchev, "You will bury you."
Folks begging for Wyvern to come back: it'll happen before the end of this month, promise.
Folks advocating JRuby: you're really twising my arm. I miss Ruby, and will always be jealous of JRuby, even after ES4 comes around. *sigh*
+ 1 scala
I discovered it about 3 months ago and the more I know it, the more I love it.
If you come from a java world you tend to start coding a la java. Then you start to have some deeper understanding of the language, like thinking more functional, and you tend naturally to write shorter programs. And not only shorter are your programs, they are also more elegant and more readable.
With some other languages, making your programs shorter can lead to make them more cryptic. That's not my experience with scala.
I have a java background, have done a lot python, some C and ruby, and played happily with lisp. Scala combines the nurture for a joyful mind and the tool to make the job done.
I don't want to start a flame-war (famous first words)... but (famous conjunction)...
I did a fair amount if meta-programming in Ruby. I even gave some presentations on Domain Specific Languages and have implemented a few in my time.
Scala's traits and type system give me the same kind of power that I have with Ruby's meta-programing facilities, except that the composition of the classes happens at compile-time rather than run-time. But the ability to use a trait the same way that a Rails developer uses acts_as_xxx is there in Scala.
The ability to dynamically compose dispatch tables exists in Scala as well. Scala's Partial Functions (blocks of pattern matching) can be arbitrarily composed at compile time or run time. Scala's syntax is very flexible such that:
myobj ! ('my_message, param1, param2)
is a legal construct and would dispatch a message to myobj. You can use the Scala Actors library for asynchronous, Erlang style, dynamically dispatched components (yes, you can change the dispatch table on a call-by-call basis). If you want lighter weight synchronous components, you can build those as well.
Put another way, you can have dynamic, proto-type style components in Scala without sacrificing type-safety.
I can't speak to your criteria of a language being big enough, but I do posit that Scala has the appropriate technical resources behind it. I agree with your assessment of Groovy. I do think, however, that you may not have a complete picture of what's possible with Scala.
You may also be interest in The JVM Languages group.
Thanks (again),
David
Lua is big in the gaming world, and I see that an interface exists to Java:
Lua Java
I am always amazed at how more productive a good dynamic language makes a programmer...still, the jury is out on large systems where static type info is supposed to give you the edge. But these are exactly the dogs that need more dynamic code ;)
Very intersting post. The idea that a person can maintain a 500KLOC program single-handedly is pretty far-fetched. I am currently working on a code base that size, and our team consists of something like forty top-rank software engineers. Note: that's lines of Common Lisp, which is of course a lot less verbose than Java, so we are not comparing apples to apples exactly. And we use Lisp macros, which also helps save lines of code. Why so big? It's an airline reservation system and there are just a whole lot of things that it has to do.
The Lisp machine software (what Symbolics called Genera) could be looked at as one huge code base. Of course it was split up into pieces, which were split up into pieces, so it's hard to draw a line around some set of functions and methods and call it a "program". There must have been many millions of lines of code. We had a large team of awesome hackers developing and maintaining it.
I think having a Common Lisp running on the JVM, or the CLR/DLR in .Net, would make a lot of sense. What we have now is Armed Bear Common Lisp, based on the JVM. (See my survey paper at http://common-lisp.net/~dlw/LispSurvey.html.) I've never used it. I have seen some benchmarks that make it look rather slow, although that might not matter if you're spending all your computrons in the libraries, which is very possible. On the CLR/DLR front, there was an attempt called IronLisp, but they backed off because it was too hard and wrote IronScheme. Again, I know very little about it; you might want to track it down. I see the comment from "Igor" about sisc-scheme; I had not known about that. I have heard about Clojure but don't know much about it as I have been focusing on Common Lisp.
I don't think refactoring makes the size of the code greater. Done properly, it has the opposite effect: you do something once, rather than over and over.
Regarding not being able to fit your program into Eclipse: have you tried IntelliJ? It scales better. At BEA, where I used to work on WebLogic Server (a large Java program), everyone used IntelliJ. Attempts to use Eclipse seemed to founder on the size of the WebLogic Server code base. (Sorry, I don't know the number of lines of code.)
-- Dan Weinreb, co-founder of Symbolics
Why did you obfuscate your code from yourself?
The fundamental you are missing is readability. Martin would suggest breaking up a large method into smaller ones because is makes it more readable.
Consider this:
String uc = "blah" or
String userComment = "blah".
Which is easier to read? Which is longer? Your programming style sacrifices readability against all others, including size. So you obfuscated yourself. Size doesn't matter. What matters is how long it takes you to understand a hunk of code. Usually, more lines means longer to understand, except for self obfuscators such as yourself.
You are a self-deprecating and self-obfuscating programmer. You are in a bad funk. I would recommend health care as your next career. Challenging but not to dynamic.
Steve wrote that "folks who wonder about how I could (re)write such a giant system without any "static" type annotations: http://tinyurl.com/26n2zq".
As a co-creator of gradual transformations from dynamically typed to statically typed languages (see DLS 2006), I am perfectly aware of this idea. The problem is that Ecmascript's implementation of the idea is broken, unsound. So whatever advantages you would get from a sound system, such as ML's or even Java's, you won't get from ES 4. (The most you get is C's notion of types.)
Nobody has tested this notion on a big style. At PLT, we have ported some 10,000 lines to a sound system. Works like a charm and I firmly believe that soundness was important. When you get a type bug at run-time it's not in the typed part of your code. With Ecmascript, this is not true. You will need to search the whole thing.
;; ---
One more comment on size: my experience in teaching a programming course with free choice of language is that Ruby, Perl, Python beat Java and C# by a factor of 3, C and C++ projects usually fail (due to the lack of experience), and Scheme projects are always shorter and more complete than all others. This may or may not reflect the personal bias of the instructor, though.
-- Matthias Felleisen
"... I'm a bit skeptical about building a large production system in a weakly-typed language, for lots of reasons: refactoring tool support..."
"If you feel I'm wrong, please don't mail me with a long rant; publish your thoughts as a blog somewhere!"
Or 'somewhen' apparently.
I just think that the language flexibility has more impact over code bloat than static typing.
Ok, type things has some verbosity but C# 3.0 and F#, while still static, have very aggressive type inference mechanisms.
Comparing C#, especially C# 3.0, with java and putting them in the same group is just not fair. The new C# 3.0 is closer to ruby in verbosity, and is a Complete Language (you can write DX games, use pointers, windows applications, performant math libraries...).
About language simplicity vs code simplicity, I absolutely go for code simplicity. Having a complex language (as C#) has huge benefits over code simplicity, and you have to learn a language only once, while simplifying the language means you will deal with cumbersome techniques and libraries different on each project.
Thanks for writing another blog on a very important issue. Actually, I don't think it's about code base at all.
It is a language issue. There is a line drawn in the sand. You're trying, post after post to find good arguments as to why scripting languages in general - and js in particular - are better suited for most, if not all projects.
The size angle is very good, and was a reasonably good choice to get people thinking.
The problem seems to be that things that I take for granted, when you bring up this kind of discussion; That you have modularized your code as well you possibly can, and that you have the ability to actually compare the experience of coding very large projects in Java and JavaScript; is completely lost on most people. (Not the Scala people tough. :) I'll definitely try it out over Agnosticmas).
What I would like to see more of, is people commenting on language issues who have at least two years of experience with each language and having a couple of large projects under their belt.
I'm mostly referring to having experience with JavaScript, here btw :)
For the 400 pound Gorilla in the room is actually that it is kind of hard to describe why it is better to code *anything* in js (or actually any modern scripting language at all) instead of Java/C#.
The codebase issue is good, but as many people pointed out we need firmer evidence, smoking guns, silver bullets and a lot of patience :)
It is a big issue. I'm not in the honeymoon phase any longer (as far as I know), but I do my laps of pain in Eclipse fro customers every month, and as soon as I can get off to doing basically the same thing, in just Kate, or vi, in (incidentally) less lines of code, I'm as happy as a (grease)monkey.
We need a plan.
Measuring programming progress by lines of code is like measuring aircraft building progress by weight.
- Bill Gates
If your use of design patterns makes your code base bigger, it's because you're misusing the pattern, or you didn't need it in the first place. I refactored a bunch of cut and paste code a couple of years ago using the template method pattern (without previously knowing of this pattern), and the resultant code base was absolutely smaller
My favorite language is Forth, and it addresses the problem of code bloat quite aggressively. Forth is not only terse, and allows to compress code in every way you can imagine, but it's essence for this is that the compiler itself is extensible. Forth was strong in the early MUD game development, so it fits for that purpose.
BTW: I noted that one of the requirements you list for the Java replacement is "must scale for large programs". Hey, NOT! It must NOT scale for larger programs (large in the number of LoCs), because that will keep you disciplined, and experience the stopgap point where your program falls apart much sooner.
There was one comment which I can only second strongly: Keep your program around 10kLoCs. That should be some sort of upper limit - less is better, but more is awful. Think about the way your inner program logic works, and restart from scratch if you find too many verbose tasks.
My Forth system contains a GUI library, similar (I'd say superior) in functionality as e.g. Motif, but with just about 10kLoC code. It's build directly on Xlib and uses mostly gdi on Windows (and OpenGL for the 3D canvas), and that's the next "anti-pattern": Avoid unnecessary abstractions. If you interface some external component, do with the lowest level there, and do it right to what your task actually is. Interfaces impose some logic onto your program, and the higher level the interface is, the more likely is the logic the wrong choice.
The downside of Forth is that unless your system of choice comes with the libraries you need, you either have to write them yourself, or you have to interface with foreign linkable libraries (e.g. written in C) that don't fit well to the logic of the main language. There's no abundance of public abstraction stuff as with Java, and it's partly because Forth isn't that widely used, and the other part is because these abstractions are shunned, anyway.
The other disadvantage of Forth is its strange syntax ;-).
Very nice post. You write well and you seem to like writing, and I suppose you code well and like coding. But it's this creative energy that is ultimately the source of your problems. Write less!
But writing less requires a basic change in your behavior, which is not as easy as loading the next silver bullet into the chamber.
Perhaps there will be benefits from a language change and other refactorings, but "features" must be considered as the primary cause of source code. Rip some of them out!
As good as this post of yours was, the followup on the results of your rewrite will be even more education for us. Thanks.
I think this blog entry is quite possible an elaborate excuse to make a joke about a spitting zoo camel.
Here's the problem with Mr. Yegge's thesis. He identifies lines of code as a cause of complexity rather than what it really is -- an effect, a symptom.
The solution to complexity is not reducing the number of lines of code. Neither is taking an ice bath a solution to high fever. Nor does printing Finnegan's Wake in a smaller font make that work a more manageable read.
After all, if lines of code were a *cause* -- not an effect -- of complexity than Mr. Yegge should counsel everybody to rid their code base of all comments, make all variables public (eliminating accessors), and make heavy use of the ternary operator.
It's amazing how many people don't seem to understand what was said the main entry.
To those who keep saying: "You need static typing to manage a large code base." The point is the other way around: "Static typing makes you try to manage a large code base."
I haven't been in the industry anything like long enough to be able to take an authoritative stand on this issue but from what I've seen of Ruby, JS, and other dynamic languages this seems to be the way to go. From my own experience with a small-medium sized C# code base at work the points about code size and static typing ring true. I've found myself several times wishing I could call a method based just on its name without resorting to cumbersome reflection apis. I've got a few days off here around Christmas and I plan to play around with Ruby and other dynamically typed languages. Maybe I'll even start a blog and post my experiences.
Oh, one more thing. For the people making condescending comments about how he just needs more modular code and several sub-projects are simply making his point about code size (more modular just moves the dirt around) stronger.
Brad:
> make all variables public (eliminating accessors),
Good point actually. Why isn't there an @accessible modifier to a member that automatically creates the accessors? Would save a lot of code and not lose the functionality. Java is in dire need of a decent preprocessor.
> and make heavy use of the ternary operator.
Java semantics suffers from the distinction between expressions and statements; see the crazy way to only run the initializer of a final variable within a try.
Code size is a symptom, one of using a language with insufficient abstraction capabilities. This is just what the article says. (For an elaborate distincion between abstraction and mere compression see the recent stuff on weblog.raganwald.com)
Steve, the problem isn't the language, it's the programmer.
I cannot even begin to imagine how you got anywhere near 500,000 lines of code for any single-person project. For the past couple of years I've been working on a system in Java for a government agency here in the UK which does pretty sophisticated rules processing on XML documents. The codebase hit about 12,000 lines of code in the first release, including the hand-coded rules classes. Since then we have expanded the functionality, added a system which compiles rules from Schematron into Java classes and dynamically loads them and the codebase is up to around 15,000 lines of code.
The only way you can get to 500,000 lines of code is by simply not paying attention along the way. The way you dismiss patterns and other useful ways of thinking about programming gives me some idea of why you've hit this problem.
Get off your high horse about Java and take a good look at how you can develop your programming skills to keep you from getting into this mess on future projects.
But hey, I'm just an 'experienced programmer' and your post wasn't aimed at people like me. After all, god forbid that some of the experience I've gained along the way should have any bearing on writing software.
I almost always agree with you, which made today’s post doubly interesting. Thank you.
For the sake of argument, let’s assume that you have to write an unreasonable amount of Java code to get anything done. Let’s further assume that Rhino is exactly five times more expressive than Java is – that is, 20 lines of Rhino do exactly the same amount of work as 100 lines of Java.
Imagine that all your money is sitting in a big pile in your apartment. In pennies. Sucks, doesn’t it? Jars all over the house, full of pennies. Want to go buy fifty bucks worth of groceries? I hope you like lugging 28 pounds of pennies through the store. Everything you do with your money is slow and awful and stupid, and you will probably never even know how much money you actually have.
So you’re looking at this big mess of pennies, trying not to stand near it because you’re afraid the floor might give out, and you say, “God, what a mistake this was. I’ve got to convert all this to nickels!”
You’re right that nickels would make your life five times easier, but switching from pennies to nickels doesn’t actually attack the real problem, which is that all your money is sitting in a big pile in your apartment.
Switching gears: You say that you wrote 500,000 lines of code for your game, and that that’s way too many too maintain properly. So let’s say you switch to Rhino, and you knock it down to 125,000. Is the code really simpler now, or is it just shorter? It’s still expressing the same functionality and the same number of ideas, right?
Your game has way more than 500,000 lines of code. And those 500,000 lines of Java code are completely worthless without the code for the Java library and the code for the JVM and the code for the operating system that the JVM is running on and the code for all the god-knows-what that’s going on in your computer’s hardware. Oh, and your game is web-based, right? So let’s go ahead and count all the code that my browser uses to render it, as well as all the code for my operating system, and the code for both our TCP/IP stacks, and the code for all the networking hardware between here and there. I’m willing to bet that there are tens of millions of lines of code in play when your game runs, and that those same tens of millions of code (minus the relatively few that you wrote) are brought to bear on “Hello World”. It just really seems like the size of your codebase can’t be your real problem.
It actually seems like you might be the problem. I’m willing to bet that if you had access to the JVM code or to the browser code, you’d be hacking away at that stuff, too, because that’s just your nature. God knows it’s mine.
Maybe if you want to make that codebase maintainable, the real answer is to break off some of the back-end pieces and stuff them down into a black box somewhere you won’t be tempted to tweak them. Get those pennies into the bank.
If you can reduce the number of lines of code from 500,000 to 125,000, I'm not sure whether it should be called "simpler" but I'm pretty confident that it would be easier to understand and therefore to maintain, assuming that it was nice clean code (and not one of those silly examples written with single-letter variables and no whitespace).
In your analogy, you talked about the problem of lugging 28 pounds of coins around. Well, in nickels, that's a lot fewer pounds. I'm not sure why you mentioned the 28 pounds at all if the "real problem" is that your money is in a big pile in your apartment. In nickels, it's a much smaller pile.
I'm not sure I see your point in counting up the "millions of lines of code" in the operating system and so forth. The question is how many lines you have to maintain, don't you think?
@Daniel - Yes, intelligently reducing the number of lines will make the code easier to understand, in exactly the same sense that switching from pennies to nickels will make $50 less painful to lug around.
The Big Point I was trying to make is that if you are willing to compromise with yourself and not be a perfectionist about everything, you can reduce the number of lines of code that you maintain without reducing the number of lines of code total. You just tear off a big chunk of working low-level code and say, "I am not going in here again." It feels horribly wrong when you do it, but only because it's a choice. (In reality, we do it all the time, unknowingly forcing tens of millions of lines of other people's code into horrible contortions, mostly because we just don't know what's down there.)
Seems like there should be a limit not dissimilar to the Dunbar number, hitting which, the codebase should be broken up or rebuilt. My personal limit seems to be around 25K. Once a code base reaches this number, I start to notice a few things like engineers specializing in portions of the codebase, refactorings taking longer to apply and general tendency not to do anything drastic.
Ass that I am, I replied to the comments without noticing that Matthias Felleisen had visited my uncultured backwater of a tech blog: an event equivalent to Gregory Chaitin stepping in and correcting the math in an XKCD comic.
Matthias, I hope you'll forgive me for having grossly oversimplified my case, and for having wasted so much space for the sake of rhetoric.
I understand, of course, that ES4's type system cannot be sound. My decision to use ES4 (and back it, and promote it) is deeply pragmatic. The resistance to Scheme in the industry is truly astounding. The ES4 designers (Schemers to the last, if the rumors are correct) are doing the best they can with JavaScript. I believe, admittedly without any concrete practical experience to back me up, that ES4's type system may still prove quite useful. Of course it won't have the guarantees of a sound type system, but I think it can still help catch errors, clarify code and offer some purchase for optimization. Time will tell.
As far as blogging with more concrete detail about code 'size' as a proxy for bigger problems, so to speak, well - it's something I've attempted many times over the years, and I've never succeeded. The problem is that small examples fail to convince, and large examples are too big to follow. I've resigned myself to talking around the edges of the problem.
And thank you for your wonderful books and your awesome work on PLT Scheme.
By the way: http://www.amazon.com/o/asin/0262561158 - that was a great joke. I laughed from the first page to the last. :)
Well, I think your "minority" opinion is shared by thousands of Microsoft developers who are saddled with a 90MLOC Windows code base.
90% of what these devs do is maintaining it, day to day, year to year. A statistic: an average Windows developer writes 1500loc (yes, that's loc, in singles) per year, because most of their time is spent fixing bugs in what they aready have.
So the code size has a nature-imposed maximum :-) - at some point capabilities of dev org just tops out maintaining this max code base.
Anyway, I am not sure if fighting the code base size problem by replacing a less expressive language with more expressive one would work. The problem is that in a more expressive language there is higher potential to make a bug per line, and the bugs are worse.
I can write (prototype) a solution in JavaScript faster than I can do it in Java, and I can do it in Java faster than I can do it in C++ (let's say on Windows, Linux C++ development is stuck in a stone age, so it's unfair to compare Java and C++ there. I have a post on it: http://1-800-magic.blogspot.com/2007/12/c-development-on-linux.html).
But when it comes to shippable code, when you actually have to get the software to have reasonable startup time, performance, responsiveness, memory consumption, etc, etc, etc, you're suddenly hit by a bunch of problems that are extremely hard to debug, sometimes, impossible to debug on a higher-level language.
So then you start to introduce kludges, and your beautiful algorithm written so clearly gets its own memory management (the very thing you were going to avoid in Java), and parts of it migrate into native code for speed... etc, etc, etc. And what used to be 2000 lines in a very expressive language now looks more like 5000 in a mix of languages, and it's as unreadable and hard to maintain as it was before :-(...
I wonder which language Steve will use after ES4 to rewrite the game. My guess is that he will write a JMud DSL )))
Thanks for the post Steve.
The complexity problem you describe is an area where service orientation would shine. It can simplify the coupling between subsystems and provide huge gains in maintainability and reusability.
The issue of duplicated code and coding patterns can be solved via a domain specific language. DSLs can eliminate the bad code created by copy & paste style coding. Just surface repetitive elements of the architecture into a meta language oriented toward the problem space.
I believe that is the best way to combat code bloat & complexity.
I can't say as I'd call myself a professional programmer - though I have had occasion to write tools for work.
I've railed for years against software bloat.
Xen - the issue isn't complexity. The complexity of the application isn't going to change significantly no matter what language you chose to use.
The issue is maintenance and resource consumption.
I could have used Java for my tools - in one aspect,it would have been better than PERL or Python.. but I ultimately opted against it because it would take me 30 lines of code to get somethign done that would take only 10 in Python.
As a result, my tool was ugly to look at in operation (I should care about this ... why?), but was a darn sight faster than an almost identical tool written by our Java guru.*
When we opted to add features .. er.. additional tasks .. to the tool, adding a thousand lines of code to my tool was a lot easier than adding 4 thousand to the Java tool.
Bug hunting or making planned changes is simply a lot easier when the code is more concise. This leads directly to less time involved, and, therefore, to less cost of maintenance.
* why was there a second, identical tool? because the powers that be wanted others to have the tool, but didn't want one written in a "non-standard language like Python" running around the company. sigh
Complexity:
Fred Brooks distinguishes "inherent" and "accidental" complexity of a problem (in "The Mythical Man Month" - a must read for any serious software architect). The inherent complexity is the one you can't get rid off, even when you have a perfect solution. The accidental complexity however comes from the environment and how you approach the problem.
So if you are good and use whatever trick is necessary to solve the problem, the complexity should be more or less an invariant over different languages.
However, as David Pollak has already mentioned, domain specific languages are one key element to reduce the accidental complexity of a language, and meta-programming is important to create a domain specific language.
So choose your complexity-avoiding tool by how good it supports creating domain specific languages. And actually try to create domain specific languages where appropriate.
Hi steve,
i just read one of your blog:
Code's Worst Enemy, 2007 December 19, at
http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.html
Jesus, thru intricacies and excursions, from vacation preamble to the enticement of a mysterious game to code size theory and IDEs, with over 5k words, the bottom line is really just how java sucks and you were looking for another lang on JVM for your game?
Five thousand, two hundred, and fourty six words. I counted it.
If you said java lang bloats in the outset and i'll just fullheartedly agree and be done with it.
Though, i gotta say it's quite effective writing — sucking in your readers and give them a clout on head before they can get away.
Lol! ^_^
My prediction is that low level langs (or the qualitity of such lang, such as statically typed, or compiled), such as Java, C++, C#, will continuously to have smaller users until maybe reaching as little recognization as assembly today in 2017.
This is a inevitability due to hardware's exponential growth. This is particular already noticable in the past decade. Before 1995, people are still talking about C, Pascal, memory allocation, linked list, “data structures” in the sense of C's structs... these days, i bet that the majority of professional programers don't know what these are. They all work in PHP and Javascript or flash.
(majority = great than 50%. “Professional programer” is defined as those who write code as their primary income source. This includes HTML, visual basic, or any specialized script in different areas, say, 3D modeling...)
High level langs, such as Javascript (and lots more based on it such as Flash etc), Python, Ruby etc will become more and more popular.
Lisps will probably just hang on, maybe a bit more popular. I don't see lisp will ever become mainstream. The parens is a problem, and others due to age, but if i can just say one thing about lisp that i deem unsavable, is the cons business. (am aware many lisp experts here, also that the nature and brevity i stated my opinion about lisp problems are likely to make them think am idiot. To express my opinion about paren and cons will require a page or more... which this isn't the place. (however, the paren i've already expounded here: http://xahlee.org/UnixResource_dir/writ/notations.html ) )
Not sure about Haskell... on one hand it's high-level, but then it's compiled lang quality with the types thingy which necessarily bring it down.
of all langs i know, i think Mathematica beat them all by far.
----------------------
some comment on your essay...
you wrote long, and it is a very skilled writing. But i think the words/content ration is not very high, if there can be any criticism at all.
for example, you discuss vacation, your game, but these are just long leads to your main content, about code size and looking for a replacement for java. But even the code size theory, which you say are a problem, but i don't see much support or evidence. Code size is proportional to the complexity of the software in general. No matter what lang you use, its going to grow as your software grow. Games is one of the more complex software, especially if it is networked and graphical. Saying code size is a problem is like saying being too rich is a problem, or being too beautiful is a problem, or being big and powerful is a problem.
you say large code size cannot be solved by IDEs. But of course, when given a huge code size (and presume the code is lisp/mathematica/haskell written by top experts), really nothing can solve it.
So if i understand your essay's intention correctly, then your writing about code size, design patterns, IDEs... are just side effects of getting to your main point: how java is verbose and you are looking for a java replacement for your game.
... the above is just some random thoughts. Of course you didn't intend the essay to be a formal exposition... etc.
Xah
xah@xahlee.org
∑ http://xahlee.org/
From the dirt point of view languages with strong typing imply structured dirt like houses and bridges while languages without type system imply just piles of dirt.
Hi, Xah. I feel compelled to reply to your comments about Lisp.
Will it ever be "mainstream"? Probably not, but I hope we can make it more popular and acceptable than it is right now.
I read your paper on problems with parens. I find it quite interesting that in order to demonstrate the problem with nesting, you resort to showing ugly examples in OTHER languages than Lisp! That's not fair. What, you could not come up with a Lisp example?
Your third point is that Lisp syntax "seriously discourages frequent or advanced use of inline function sequencing on the fly." I completely don't see the point of your argument. The Lisp functional notation looks perfectly clear to me, and much more uniform with the rest of the language.
So what do you mean by "the cons business"?
P.S. Your web site is awesome!
hi Daniel,
It is my honor to reply to you!
I started to write a reply about my thoughts about lisp's list problem (the cons business) but in my rambling nature it turned into 900 words.
am not sure i should spam steve's blog since it's slightly off tangent (and badly written). I put it here for now: http://xahlee.org/emacs/lisp_list_problem.html
Happy holidays,
Xah Lee
Hey,
I did actually have a legitimate question - how can we track the progress of the rhino side of things without pinging you directly?
I've not been very successful thus far in turning anything up that I can directly attribute to being ecma4 by looking through the mozilla repositories. (unless there is a sekret key sequence that shows how to navigate to http://lxr.mozilla.org/mozilla/source/js/tests/ecma_4 )
=( Not fair. Are you just going to dump a magical ecma4 rhino patch on mozilla at some point in the future when it's "perfect" or ...?
Steve your post was fun to read and very long. What is the essence of your message? I keep hearing you say that if you are going to start a big project, you had better choose a language that lets you provide the best abstractions for the problem you are solving. You really threw in a twist to your post when you start talking about JVM languages and code generation and build it all the way up to… choosing a language to make everyone else happy. Wow, I didn’t see that coming. That said, why don’t you just stick with Java but compile down to it in the places that give you the most bang for your buck? Your post seems to imply that you are already familiar with these concepts.
ilume, DRY is a good idea but the challenge folks face is not only to DRY, but to provide abstractions that make the problem easier to manage. Predominantly refactoring focuses on structure, not abstractions.
daniel, re "Will it ever be "mainstream"? Probably not, but I hope we can make it more popular and acceptable than it is right now."
Fame is fickle. The important thing is not that people use Lisp at work, rather that Lisp removes barriers to folks learning and mastering techniques of abstraction that make our jobs easier. That way they can benefit from whatever language is currently in vogue.
I'll start by disclaiming that I didn't read the entire post--damn it is verbose. Was that some sort of intentional ironic joke considering your position on LOC? I'll also speculate that you can probably code circles around me, as can most of the other commenters. And I never bothered to learn Java (the horror).
I was a C++ programmer for about 7 years (back in the 90's), spent some years away from coding (with a very short stint dabbling in Objective C), and now a Ruby programmer for just under 2 years.
All that said, I'm not sure how you can make such a blanket statement that large LOC are bad. Some projects are bigger than others, and perhaps yours is not meant to be maintained by a single person! I like small lean teams--but doing it all alone isn't always realistic.
Also, refactoring almost always reduces my LOC (I can't think of a time when it didn't). It leads to my favorite style of coding, which I call the "Select and Delete" style of coding. I find nothing more satisfying that deleting a bunch of code and having everything still work as designed.
To Xah Lee: Oh, the problem that lists are an emergent phenomenon instead of a collection object. Yup, no argument from me about that. It's all extremely historical.
I had a similar realization recently. I though "Java is just awesome" and that I should implement stuff in java because I know it and it's easy.
Then I was introduced to Ruby On Rails... Then to Jython (Python)... And well... I don't like java anymore.
I noticed that a task which seemed "too much" for just me, writing a web site that uses some pretty complex and funky stuff, and is pretty dynamic while remaining HTML-based is not so difficult for one person to make on his spare times on some evenings for maybe 6 hrs a week if done in the right language.
I wrote functionality at the same speed as I would in java, except that I knew NOTHING about Ruby or RoR when I started and I was pretty proficient in java. That means that with experience I can write programs faster and smaller in Ruby (and easier to understand) than I could in Java. The only difference is that not every programmer on the street can work in Ruby (though I don't know why)
Working in the company that Stevey worked in, first hand dealing with one of the biggest code bases, I've concluded that code bloat has little to do with the language but rather that its impossible to kill useless features. If I had a nickel for every time that I heard "I can't get rid of this legacy code because there is a legacy client still in PROD.... I think".
Even if Google / Amazon / Microsoft moved everything to Ruby, it would still be millions of lines of Ruby.
I believe developers have to tackle large code bases for the following reasons:
1. Bad modularization - usually due to legacy code preventing any refactoring short of complete rewrites
2. Buggy code - if you can't trust the modules you depend on, you gotta look at the modules. Usually, I never think about the millions of lines of operating system code because I know it "just works".
3. Hacks upon hacks upon hacks - nuff said
The number of lines of the original blog posting smells like bloat to me.
Think you should try the "play" programming language. Its semantic is quite rich : the program "Play;" will run a game all the features of your 500k game.
Ok, I'm just kidding but you have some very valid points : trying to factor (I'm not saying refactor) your code, and reusing instead of duplicating is a mandatory thing.
What I'm not buying is your focus on the "LoC" metric. Your 500k lines of code break eclipse, and this is good thing. For me it means that the way the code is architected, eclipse must load all the code to figure it out (I mean compile / index it) : exactly like if the code was a single 500k line script, where line 500.000 depends on line 1.
Divide to win
Let's think about it for a few seconds, what do you have in a game : the AI, the graphics engine, the physics engine, the audio engine, lots of data file and a "scripting" to glue everything together. It is pretty clear that your AI does not care about the colors of the textures on your polygons.
Now, would you consider a technology like OSGi / Eclipse plugins : you focus on writing the smallest possible piece of code. Everytime you're willing to add a new feature, you ask yourself : "Can I add an extension point here and program my new feature as an external plugin ?".
Now my feature is a separate project Y in my IDE. I create a new project Ytests, that depends on Y, and unit test the code in Y. When Y performs correctly, I package it (right click / export / export as deployable plugin & fragment), push the code to git/svn/you name it, and close the project Y in eclipse.
Another point is, when I start writing a program, do I use plugin / modular technologies from the start ? Can I know in advance that my little project will be successful, and will grow to X lines of codes ?
I agree that bloat is bad, but just one small point: I work on an IDE that _is_ a 20 million line codebase, Oracle's JDeveloper 11. IDEs can scale if they must.
Your article is long on dissatisfaction and short on satisfaction. What is it about JavaScript that will bring about the 3X code reduction? Why will this (only) 2/3 shrunken code be more maintainable than the code it replaces?
all those words about Java and not one mention (or even a hint) of the term "object oriented" .. and *you* are trying to tutor the youngsters - shame!
object oriented languages result in more code because they present a different approach to solving the problem. But the code is not all in one place .. it's encapsulated in classes and components and frameworks (or, at least it is *supposed to be*). However, the longer view shows much less code to write by leveraging those classes and components and frameworks. It also means less complexity through the inherent breaking down of the problem (complexity is, ultimately, what you are talking about right? .. I hope you at least understand that).
You see, I only had to arrive at your "500,000 lines of code" game to see you don't understand. I am getting tired, and very discouraged when I read these sorts of posts .. As Grady Booch once wrote: "it's going to take a lot of software [to build everything]", and I personally am losing faith that there exists enough talent to pull it off. I am sorry you are blaming Java for your 500,000 line program, but I agree with your conclusion to move on ... have a good life
You are wrong. The size of a code base is not the problem.
Consider the case of a database system. Extremely complex, used by millions of users and required to be blindingly fast when accessing millions of records.
These systems are very large but maintainable as the design is good. Whether it's written in Lisp, Java or assembly language is not the deciding factor - the design's the thing.
You are the perennial wandering programmer looking for the perennial silver bullet to solve all your problems. And you are in the initial phase of extacy at how good your next system is going to be when you start writing it with your new toy.
Give us a call when it's finished.
Folks (and Jeff Atwood in particular) -- looking at these comments, are you convinced yet that my opinion is, in fact, a minority one? We're starting to hear more from typical programmers, and the polls are definitely swinging in favor of "Steve's full of crap."
Well, you know how some people like to say: "I'm not the kind of person who says 'I told you so'?" Well, I'm not one of those people. :-)
I'm leaving comments on for now (until genuine spam kicks in) because I want people to understand just what I'm up against. This is why I continue blogging about this stuff, and why I often repeat myself, albeit hopefully in different ways. I'm throwing buckets of sand into a tide of ignorance. Will any of it stick? Time will tell.
Putting stuff in boxes is often a first step to figuring out what you can throw away and what you can't. My main complaint about Eclipse isn't that it can't handle half-million-line codebases (I'm looking at over a million here), it's that it doesn't include IDEA's "Safe Delete" refactoring.
That said, I wish the people who've worked on half-million to million-line codebases and say size isn't a problem would share their secrets with the rest of us. (And no, "architect it right in the first place" doesn't count. That's not a secret, it's a wish for a magic pony.)
I'm a little confused by the slam against design patterns, since the purpose of design patterns is to improve maintainability. Properly applied, they either reduce code size (by refactoring operations into a common framework) or simplify maintenance by reducing dependency between modules (in which case the absolute number of lines of code might go up slightly, but the overall complexity of the system goes down.)
I suppose part of the thesis is that, on balance, applying a pattern that increases code size is a net loss. I would counter that patterns are often needed to supply predictable semantics across a software system. Without patterns you might be able to get away with fewer absolute lines of code, but the code is likely more brittle.
The fundamental problem with Java specifically seems to be the lack of metaprogramming facilities like those found in virtually every other modern programming language. This makes Java a sort of "lowest common denominator" language for imperative programming, but offers virtually no direct support for other programming styles.
You do approve of the "interpreter" pattern as a way of shrinking code size. I would add code generation as a similar metaprogramming approach, where you compile your domain specific language into the underlying general purpose language. After reflection, this is probably the most common metaprogramming technique used in Java development. It is a very heavyweight solution, but so is writing an interpreter.
Thanks a lot.
I'm a college Computer Science graduated, and I'm looking for the definition of "good programming". I'm in love with lisp now, and after your read I know one more point to it.
Thank you for sharing your experience.
Reading your blog gives a nice warm feeling, I am not alone!
But my point of view is even more radical. New tools will not solve this problem. Code lines are not the only measure of evil. Complexity is just as much of a problem.
I think the future will prove you right, and neccessity will force us to change our attitudes, away from the technology and towards the problem at hand.
William of Ockham could teach modern day programmers a thing or too.
Anyways, thanks! you made my day.
I wouldn't say its Lines of Code, but rather the lack of anonymous types, closures, tuples, first order functions, etc that is the problem with Java.
Also working with collections is pretty common so having built-in dynamic sized arrays and maps in scripting languages is big plus.
Aggregation is a lot of boilerplate in java compared to Ruby's mixin approach.
Regarding you saying Scala and co are not big enough. Why does that matter when its a single developer project? And ECMAScript 4 is not even out yet. I would of thought something like Scala is a lot more proven then the next gen of ECMAScript.
Can you also explain in more detail why you need dynamic typing?
"The average industry programmer today would not find much wrong with my code base, aside from the missing unit tests (which I now regret) that would, alas, double the size of my game's already massive 500,000-line code base. ... If I'd done things perfectly, according to today's fashions, I'd be even worse off than I am now."
I've yet to see a 500,000 line code base that can't be shrunk considerably. Having comprehensive unit tests means you'd be able to apply more cleanup more frequently.
The last similar experience I had was with a code base that developers thought was clean. It shrunk from 200,000 lines to 60,000 over the course of 10 months, all the while incorporating 10 months worth of new functionality. Lines of refactored code + lines of unit tests < lines of prior mess.
I'd be fascinated to see what half a million lines of ultra clean code looks like. Care to post the code, or even just sections of it?
Hopefully someone with a pet C++ elephant can come along and jump on the minority bandwagon with me.
Better late than never. About seven years ago I worked on a large clearing and settlement engine for a European bank. The prime contractor was Andersen Consulting. All code was in C++.
AC, in their infinite wisdom, didn't trust their developers to actually write code, so they forced them to use a tool called Design/1. This consisted of a large and growing number of C++ macros and an interface in Microsoft Word (I so wish I was making this up). Even main() was wrapped in a macro that expanded to a macro that . . . you get the idea.
After approximately a year we hit the million lines of code mark. AC threw a party to celebrate the event. Two or three other sane people and I sat in the corner and treated the event appropriately -- as a wake.
I completely disagree with the codesize/complexity argument (and the subsequent java bashing)
Large codebases need *not* be complex!!. Infact, I have had the pleasure of working on a 40GB java codebase (the codebase for the worlds most popular auction site) as a junior developer and my experience there completely removed my fears from every diving into large-scale systems. It was a pleasure (with the occasional warts), to work on that codebase. Even as a person new to the system, it didn't take much time to figure out where a piece of functionality lay, how to change it, what effects would it changes have, what tests to write, etc.
Subsequent to that, I have also had the unfortunate experience of working on a system probably a 100th of that size but with crazy complexity due to bad engineering. (an commercial ESB bus). Infact, I had to rewrite many components which eventually made the codesize BIGGER (by 2x) but made the code EASIER - easier to read, easier to maintain.
I shudder to think of that sort of code being in Ruby/Python/Perl!
I have been an erstwhile Perl programmer and I am quite certain your 500,000 LOC could be shortened to 5000 LOC... of totally UNREADABLE code.
Java code is easy to human parse. Probably *longer* to write and a *pain* to write sometimes, but *much* easier and *faster* to read and very-very easy to debug. That last point is crucially important, especially when you are hunting for the root-cause of a bug that needs to be fixed right NOW.
There is a very high comfort factor in maintaining well-designed java systems. I have worked in C++ and I have always prayed before a bug-fix (very true). I have worked in large (old) Perl systems too and it was more of a 'lets change and guess what happens now' fixing.
But, if you are going to be the only person maintaining your game, I think it makes sense to do it in the programming language you like. :)
That is 4.0GB and not 40GB. Typo. :) 40GB would probably need definitely need superman. :)
I agree that the mere fact that a code base is that big does not, in and of itself, necessarily mean that it is too big. I've been in the industry now since 1977, and only recently have I started working at a job that has a really, really big hunk o' software. We're building an airline reservation system. It has load o' features, which the customer really needs (about 1/10 of what the customer has asked for!), and the airline industry is just plain ridiculously complex. I've never worked on anything nearly this big; it's quite an experience. It really does have to be that long. A big piece is in relatively succinct languages like Common Lisp and Python; the GUI layer is in Java.
I've got a unique perspective on code base size. Bascially I believe that a big code base at a microISV is a symptom of a microISV that's slow and inflexible. MicroISVs need to keep their code base size down because this forces them to think agile and be more flexible. I've blogged more about this: A huge code base WILL kill your microISV
Hi Steve,
your article is an interesting and I feel that at least some of the points are valid. Nevertheless, I find that I must disagree with the underlying premise. Granted, some languages may be more verbose than others. But in all instances you should never need to deal with all the code at once.
One of the most important tools in software engineering (and in many other domains) is "compartmentalisation". Gone are the days when we need to reimplement a linked list class for every new project. If we take the example of Java, a good programmer will leverage the base libraries provided in the JRE. If functionality is still missing, they will turn to open source or commercially available libraries. The only code which should need to be maintained is the code specific to the business problem at hand. Even for this code, much of it can be factored out into libraries or frameworks which can be maintained separately from the main project.
Computers are incredibly complex devices, yet nobody suggests that we should throw them away and replace them with something simpler. We can create these complex devices because we build on layers of complexity. An ASIC engineer uses existing libraries for implementing sections of a chip. A hardware engineer doesn't look at the internals of a chip but merely uses its published specifications to build a device which is more powerful than the sum of the individual parts. Object oriented languages allow similar compartmentalisation to be applied to the domain of software engineering.
Search not for a new language to solve issues with code size but rather break the problem up into manageable and distinct chunks.
Regards,
Dominic.
I'm really surprised that Eclipse indexing never finished with only 500,000 lines of code. It's my usual set-up in my workspace and never got such an issue.
As I work on this part of Eclipse code, I also often verifies that there's no problem to get indexing finished in a reasonable amount of time (approx. 1 min) with heavier setups:
1) All the source files of Eclipse which is now (eg. since 3.3) over than 3 millions lines of code
2) All the source files of Europa which is more than 17 millions lines of code...
First of all I would agree that code-base size is as big a problem as you say. However, I would posit that the size of the problem be more of a function of the structure/design of the code, than just the number of LoC. See.
Sigh.. javascript? scala? doom I say, doom..
Give Nemerle a try. It is what Groovy and Scala should have been.
Count me in that minority, too. But a half-million lines? You have my sympathy.
I've built GUIs and databases and virtual machines, and worked on one of the world's biggest codebases, and have never had to deal personally with that much. But years of MS-DOS coding (meaning tiny space and slow chips to run your software, and primitive tools to write it) and an allergy to of needless complexity have taught me that tight code rules, lose code drools.
I've never found choice of language to be an obstacle. I've coded professionally in FORTRAN, Pascal, C, C++, Java, Scheme, and more. It doesn't take that much sometimes to increase the amount of code you don't have to write. C macros can generate a lot of clean boilerplate if used well - look at what Boost.org does with them. You can build a nice object system with C macros. C++ templates are even more magical code generators, and Scheme and other Lisps are metamagical.
So sometimes I like to suffer a few lines of whatever I'm stuck in to write a Scheme interpreter and bust on out. From there I can architect my structures and logic in Scheme and customize my Scheme dialect to fit my architecture. And I can write my runtime in whatever I'm stuck in so at to leverage its runtime and provide interfaces to my colleagues left behind in Whateverland. I did that on a paid job once. I would have been fired if hadn't already quit.
Another way to try to get fired, and one with some advantages over an interpreter, might be to use an available Scheme environment to write Scheme code that takes Scheme descriptions of your structures and logic and translates them to code in Whateverland. But we didn't get fired for that, we got lots of rapidly-appreciating stock options for having successfully intergated a Whateverland virtual machine into one of the world's largest codebases. Thus enabling its continued metastatic growth.
In response to Snow:
Consider the case of a database system. Extremely complex, used by millions of users and required to be blindingly fast when accessing millions of records.
These systems are very large but maintainable as the design is good. Whether it's written in Lisp, Java or assembly language is not the deciding factor - the design's the thing.
I once worked on one of those beasts, and indeed it was huge and complex. And written in C. I never did grok enough of the code to judge the design, but it had held up to years of maintenance by thousands of programmers, so it must have gotten something right.
But it seemed the main reason it was still working was that there were so many of us programmers that none of us had to deal with more than a small piece of the beast, and every group had a very senior architect who made sure his group's piece fit into the big picture.
Also, the daily build was always followed by the daily regression tests, and breaking the build or the tests was a big no-no.
So again, choice of programming language is a small part of the picture.
When you write your own half-million-line code base, you can't dodge accountability. I have nobody to blame but myself, and it's given me a perspective that puts me in the minority.
500.000LOC = 65 years for coding for one person!
A quick look at the matrix from Steve McConnells book Rapid Development says that the schedule according to the data he collected from the industry for business projects 500.000 LOC would require 780 man-month, that is 65 year.
The data is collected from:
1) Software Engineering Economics (Boehem 1981)
2) An Empirical Validation of Software Cost Estimation Models (Kemerer 1987)
3) Applied Software Measurment (Jones 1991)
4) Measures for Excellence (Putnam and Myers 1992)
5) Assessment and Control of Software Risks (Jones 1994)
Either all the reports above are wrong or your code base stinks with duplication/autogenerated bloat code. I suggest you post fact about how long time it took to create that codebase of 500.000LOC and how many LOC that are duplicated using a duplicated finder tool.
Its weird, but you don't have a choice except Java (or C#?) in commercial programming world. I have played around with Ruby, Haskell, OCaml etc. Back in the days i used to be into C with some basic mixed assembly. There has been so much invested in Java for past 10 years by firms, that even if one wants to take a hard look at code and say enough is enough you cant do it. You simply cant.
I hate to say this to all Ruby people, but i don't like it at all. Scala ... now thats something.
Also, there is a difference between "what "kind" of code you write", "How you write" and "How you maintain". So not all code written in all companies is bloated (Leave out some of the financial firms where they have to do quick fixes ;)); And honestly, I don't really care / zealous about what anyone says. I liked Java back in the day because of its simplicity, and because i had to do a lot C with pointer arithmetic.
Even James Gosling said (and rightly so, despite the fact that he tried to play down his statements or worst recant it back) "Sometimes you have to take a hard look at a language and then again start with a clean slate" (i don't quite recall the words right now but yea (especially since its 12:33 in the night :)).
Regards
Vyas, Anirudh
Nice post, very enlightening. Just one question, why are you looking to get stuff that works in the JVM ? If there are languages out there that reduce the code bloat why not run them "au naturale" without the "taint" of the JVM. I say taint in the nicest possible way as I dont really hate java.
Hallo Folks. I am an avid user of blogs but just beginning to use tools to send comments to my favorite blogs.
I am seeking some hints on where to start building code to deliver the same text to multiple posts on a single blog or on multiple blogs. I don't want to send to every thread - just the ones where the context fits but sooner or later I wind up sending something to each and every post so I want to build robustly. Anyone found a way to send the same messages to many posts with code doing all the work?
+1 Scala
It seems to me that oo languages (in particular) make refactoring expensive. So much so that it is often put off for too long. It isn't just changing one function any more, it's changing a whole object tree of dependent functions (i'm assuming here the changes are required because the design has problems - if it doesn't why are you changing it?). Because of that cost most fixes don't take the holistic view and end up polluting the design purity it may have started with.
Good OO design is just harder than decent procedural design. It's easier to make more limiting mistakes.
But FWIW I find most of my refactoring (as opposed to 'rewriting') efforts involve deleting lots of code and moving bits of it to shared areas. Infact I wouldn't bother if I wasn't - what would be the point?
Hi
Code's Worst Enemy which makes a pretty sensible claim that his half million line of a game is too big. I think that it is based on the languages. I also indicates that wants to continue the code's worst enemy.
================
henry
There are a lot of sites out there showing book video. BookVideoTV, BookTelevision and of course CSPAN, but I like how BN.com and Reader's Entertainment TV have specific genre channels and original shows. There's just more to see and I can be specific in what genre I'm interested in. Anyone else watch online tv?
Reader's Entertainment
I don't know if I agree (or get your point).
As far as I'm concerned, zby had it right -- what about abstracting the majority of the code away in libraries? Isn't that what software development is all about anyway? You're already working on a codebase of millions of lines as soon as you start a new Java project. The only reason you don't consider these lines part of your application is that they are nicely encapsulated and abstracted away.
So in the end of the day abstraction is always the solution to your code size problems.
But, I'm training for decent athlete status myself.
I don't think you're going to read this, but if by chance you do, please drop me a line.
I started reading the article and liking what I was reading, it talked about something that I, a young student eager to learn and be a better programmer, always fought against: too much code. Not that I was taugh to see it as problem but nevertheless I felt it was one I could try to eliminate.
But I was disappointed to find that the article is nothing more than a bashing on Java language and Java programmers, and the fact that you're game is made in Java doesn't really justify it. If it was written in C++ would you just end bashing C++ language and C++ programmers?
I really felt like you ranted away from the point and made this just another "I don't like Java and Java programmers are dumb" post.
I would like to see a post on code size that could talk about the problem without a specific language in mind, a post that talked about the problem and did not hide it in the language X or Y features, or the lack of them.
Rhino may shrink the code base but the run-time will still suffer from Java's memory bloat. Here is an example from a 2003 blog post:
"First, consider Java's String class. Everytime you create a new String class, you are paying:
* 12 bytes for the String's object overhead
* 12 bytes for the String's member variables
* 12 bytes for the character array's object overhead
* 2 bytes for each character
So "hello world" consumes 58 bytes of memory, not counting the reference to the String object itself.
Now consider Java's HashMap class. For every entry in the map, you pay:
* 5 bytes for the reference to the Entry (taking into account the empty entries in the Map, given the default load factor of 75%)
* 12 bytes for the Entry's object overhead
* 16 bytes for the Entry's member variables
Now what do you think happens if you create a HashMap whose keys and values are both Strings? A 5 megabyte file will bloat into 50 megabytes of RAM!"
I thik Stevey is missing an important option... You are sometimes better off to make a program that *generates* souce code for the end program... You are then not tied to the weaknesses of the programming language because *you* are creating the programming language. So for example you could add #define and #if and even stuff c++ doesn't have like #while or #for
You can also create your own GUI for developing code, so for example a split screen, where the left side is a custom form that changes appearences depending on the function in the current line of code, and the right side is ordinary text view... so for example when you wish to code command to "open" a file, you get browse button, combo boxes and full documentation to optionally help you write the code.
Interesting that you mention, that you did a game on your own.
Because nowadays popular games are often created by huge teams. These enourmous costs demand that games have to be mainstream.
So some people which back the old days where 1 person could do a game.
I don't understand. You started with bloated language and said you made massive engineering mistakes in the past. You wrote the game 10 years ago, so I reckon that you must have made serious mistakes in your code for it to reach that big + that was 10 years ago. Things have improved now. I think you should stop blamming the tools. If you use the wrong tools for the wrong job, do not blame the tool.
Have you looked at REBOL? There is nothing else like it when it comes to reducing bloat. Take a look at http://re-bol.com and http://rebol.com .
Steve's right.
Personally I think C# is a large improvement over Java; it's not as concise as some languages, but LINQ, the var keyword, extension methods, lambdas and a non-broken generics implementation all help. Fundamentally, the C# culture doesn't think that concision is a dirty word.
Trying to master the C# type system, I took an interest in Scala, but I came back disappointed. Type erasure is a "broken window" that effaces all of the good things in the Scala type system. C# taught me what I could do with a decent generics system, and time and time again I'd discover that type erasure would get in my way. Other 'clever' ideas in Scala, like the way constructors work, just seem naive to me.
My feeling is that any large system is going to have certain sections that need to be built with efficiency in mind, and in those places, static typing has major benefits. Large systems are always surrounded by an aura of things that are scripty, and would benefit from dynamic typing. A combination of Java and something like Rhino seems like a good idea.
In the big picture, I don't see the problem as being as much about languages as about programming paradigms. Object-oriented programming has been useful, and will continue to be useful, but we're still finding that it's too expensive to develop software. If we want to build systems that are cheaper and do more, we need to move towards approaches that are declarative and ontology driven. Although I'll never accept Lisp's BLUBiness (the fact that it rejects 1960's developments in parser theory) I'm personally mining the 1970's and 1980's AI literature for ideas.
A little irrelevant but: The entire time I was reading your post, from design patterns on, I was hearing "Little Boxes" (The theme song to the excellent showtime show "Weeds").
Rhino sounds great - when will it be finished? I downloaded the latest version and the file I/O is in the "examples" folder, with a comment that says "This is what file I/O might look like if we ever get around to it."
I have a question about understanding large code bases. I seem to have no difficulty reading and understanding code that is say upto about 10,000 lines. Beyond that, I seem to lose track of things - I have tried various things to try and understand whats going on - keep note of the call stack as I read through code, look carefully at the documentation to get the general flow of things, try and understand the code by first breaking it down into small chunks, understand that part properly, and then see how it fits with the rest of the the code.
This works in some cases - like for example the Php scripts that run mediawiki (there are about 200 of them). But once I get to something thats larger, it is very hard to understand whats going on - I tried to understand MySQL code and Apache code - and it was hard to figure out what was going on.
Would you have some general advice on how to read and get a general idea of what some particular file is doing quickly? It seems that I am always working with large codebases - not being able to deal with the scale is slowing down my work.
Thanks!
Why do people who have emotionally abused me tell me that I am my own worst enemy?
That was a really good article.
Being a Computer Science student at a university that teaches java, I thought it a bit odd when one of my lecturers said "if you're just one guy, you want to use c++, and if you're thousands of guys, you want to use Java, and if you're in between, you want to use Python. So there's almost never a reason to use anything but Python."
But I guess what you're saying not only makes sense, but encourages me to make sure my code bases (when I, you know, actually get any) remain small.
Given I only really have *any* experience with Java, if one were to try and keep a codebase small, what techniques are you suggesting? Ignoring the fact that I wouldn't be using Java, that is :P
Is it a case of be sure to generalize everything so there's no duplication? Just plan the darn thing better, writing the code by hand?
That was one thing I thought odd about the difference between DrJava (notepad with colours, compilation and commandline runtime built in) and NetBeans. It seemed I could learn Java and write it myself, or I could learn to use NetBeans. Do things like NetBeans not detract from the ability of a programmer, as opposed to a computer user?
Thanks
PAz
> Unfortunately, a JVM language has to be a drop-in replacement for Java
JVM lisps are mostly just lisps that happen to run on the JVM. Clojure, on the other hand, is great for java interop. It doesn't shun standard java classes and I would recommend it for your next project (you appear to have chosen Rhino for this one).
Great Post. It's rare I read through the long ones, but I really enjoyed it.
Nice article. I know the pain... You might also want to check this one from Michael Feathers:
http://michaelfeathers.typepad.com/michael_feathers_blog/2011/05/the-carrying-cost-of-code-taking-lean-seriously.html
In LEAN terms: code is inventory, and should be minimized!
Well, it's been over two and a half years since you wrote this post. How did things work out?
Post a Comment
<< Home