Search This Blog

Finding out if I've Improved as a Programmer: Part 1

Do you ever wonder how far you've come as a programmer? Well, I sometimes do, and today, I've decided to take a look and see if I've made any progress. I think that I have. I certainly have spent a lot of time studying and practicing to be a better programmer, not to mention the years of projects I've worked on in my career, making a living of this fascinating endeavor of instructing computers to perform automated tasks. One hopes that all of that effort hasn't gone to waste.

If I look back at one of my early projects, I should be able to easily see ways to improve it to make it clearer and cleaner. If I can't, well, what have I been doing all of these years? I decided to pluck a project from my operating systems course in college, good old CS537 from UW-Madison. It's a simple introductory project that implements a basic shell in C. The shell can execute commands either in an interactive mode from a prompt, or read commands from a file given to the shell as an argument when it starts. It's a small enough project that I can evaluate it in a couple blog posts, but not so small as to be trivial, like FizzBuzz or the Sieve of Eratosthenes.

I also picked this shell project because I remember having trouble with it. I didn't have it working when I turned it in because of a bug that was causing a segmentation fault. The professor was nice enough to debug it with me during the demo (we had to demo our projects and discuss what went right and what went wrong with the professor), and I was thoroughly impressed with how quickly he grokked my code and proceeded to fix it. I remember asking him in astonishment how he could move around someone else's code so fast. He grinned at me and said, "It just takes practice, and besides, your code's not that bad." Um, yeah, right. We'll see about that.

Since this is my own code from days gone by, I can say what I want about it and tear it apart as I see fit. No restraints necessary. I'm going to be brutally honest as I walk through it and refactor it into something closer to what I think is good code today. I'll follow roughly the same process that I used for my last few posts, when I refactored some code that I found on the Internet for a rainflow counting algorithm. To kick off that process, I'll add a new GitHub repo where you can follow along with the changes that I make. I was totally ignorant of version control at the point in my college career when I did this project, so we only get to see the (mostly) working version of the code that I had after debugging it with my professor, not the full development of the project. Anyway, version control is important, so the program goes into git first.

Next, let's scan through the code and get an idea of the issues that we want to fix:
  • It doesn't compile
  • Minimal tests
  • Inconsistent indenting and formatting
  • One long function in main()
  • Poor structure and organization
  • Major memory management issues
Wait, what? It doesn't compile?! Yes, the environment has changed a bit since I did this in college, so we'll have to do a little work to get it running at all. It figures, I guess. I was also surprised at how poorly it's formatted, although it was likely due to using an editor that inserted tabs instead of spaces. That will be easily fixed. Let's get started with getting it up and running so we can poke and prod it a bit.

Getting shell.cpp to Compile


The first issue is that the program includes <iostream.h>, but it should be simply <iostream>. After doing that change, it uncovered another problem in that the lines that print to stdout using cout and endl do not have the name space std:: attached to them. I could either attach the name space or add a using namespace std line after the includes to fix the problem, but there's a better option. Since I only used <iostream> for a few lines of printing, and I even used printf() in other places, I can just replace the stream output lines with printf() or puts() and remove the #include <iostream>. I have no idea why I used both print methods originally, but now I'll stick to C functions since the rest of the program is in C.

Finally, this line:
  static char *PROMPT="koblensk> "; //Prompt to appear for user
Throws a warning in the compiler that the conversion from string constant to char* is deprecated. That's easily fixed by changing the declaration to a char[]:
  static char PROMPT[]="koblensk> "; //Prompt to appear for user
Now the program compiles, so let's play around with it. I ran shell and tried to run ls through it with this resulting output:
koblensk> ls
Command Line : ls
   0 : ls
  0 arg ls
ls
ls
Doing execv ls
  0 ls

Command Not Found.
Please check your path and filename.
Note: programs needing user input cannot be executed with this shell.
Ew, yuck. A bunch of ugly debug statements are showing. We'll fix that up later as we walk through the code. For now, why didn't ls work? It appears I have to enter the full path because shell doesn't have an environment with a set of paths for executables. We can fix that later, too, as an enhancement. For now, let's try the full path to ls:
koblensk> /bin/ls
Command Line : /bin/ls
   0 : /bin/ls
  0 arg /bin/ls
/bin/ls
/bin/ls
Doing execv /bin/ls
  0 /bin/ls
batch.txt  shell  shell.cpp
koblensk> 
Command Line : 
Segmentation fault (core dumped)
Hmm, the call to /bin/ls worked, but when I entered a blank line I got a segmentation fault. It looks like it happens fairly early in the debug printouts, too. Let's see if we can fix that right away. After searching for the 'Command line :' debug statement, I found the relevant lines:
    checkEOF=fgets(commandLine, LENGTH, inputSrc);
    printf("Command Line : %s", commandLine);
    commandLine[strlen(commandLine) - 1] = '\0';
The problem is that I do no input checking. The code will happily take in an empty line and continue on its merry way until it crashes and burns. I can do a minimal check and continue with the next iteration of the infinite while loop that I'm in, and at least I won't crash on an empty line:
    checkEOF=fgets(commandLine, LENGTH, inputSrc);
    printf("Command Line : %s", commandLine);
    if (strlen(commandLine) <= 1) continue;
    commandLine[strlen(commandLine) - 1] = '\0';

Playing Whack-a-Mole with Bugs


The empty line bug is fixed, but the program still crashes on a line with only spaces in it. To fix that, let's make a little function to check for empty lines with only whitespace:
bool is_empty(const char *str) {
  while (*str != '\0') {
    if (!isspace(*str)) return false;
    str++;
  }
  return true;
}
Then we can change the check to continue on any empty line:
    checkEOF=fgets(commandLine, LENGTH, inputSrc);
    printf("Command Line : %s", commandLine);
    if (is_empty(commandLine)) continue;
    commandLine[strlen(commandLine) - 1] = '\0';
This change makes the shell much more tolerant of input errors. I could keep hammering on different unexpected inputs, but I saw something else interesting in the code just above the lines I fixed—a check for a SHELLPATH environment variable. Maybe this shell does support a little bit of environment. Let's try setting the SHELLPATH and see if we can call ls without a full path. Here's the output:
$ export SHELLPATH="/usr/bin:/bin"
$ ./shell
koblensk> ls
Command Line : ls
   0 : ls
  0 arg ls
ls
ls
Doing execv ls
  0 ls
Doing execv /usr/bin/ls
  0 ls
Doing execv /bin/ls
  0 ls
batch.txt  shell  shell.cpp
koblensk> 
Sweet! It worked. Okay, the next thing we want to check is the batch mode because that's how we're going to implement some tests before digging into refactoring. But first, let's not get too far ahead of ourselves. We should do a quick commit before going any further. Next, let's try out the batch.txt file with the contents:
/bin/ls -a
/usr/bin/gvim
exit
And we get this output:
...
koblensk> Command Line : koblensk> Command Line : koblensk> Co
mmand Line : koblensk> Command Line : koblensk> Command Line :
 koblensk> Command Line : koblensk> Command Line : koblensk> C
ommand Line : koblensk> Command Line : koblensk> Command Line 
: koblensk> Command Line : koblensk> Command Line : koblensk> 
Command Line : koblensk> Command Line : koblensk> Command Line
 : koblensk> Command Line : koblensk> Command Line : koblensk>
 Command Line : koblensk> Command Line : koblensk> Command Lin
e : koblensk> Command Line : koblensk> Command Line : koblensk
> Command Line : koblensk> Command Line : koblensk> Command Li
ne : koblensk> Command Line : koblensk> Command Line : koblens
k> Command Line : koblensk> Command Line : koblensk> Command L
ine : koblensk> Command Line : koblensk> Command Line : koblen
sk> Command Line : koblensk> Command Line : koblensk> Command 
Line : koblensk> Command Line : koblensk> Command Line : kob^C
$
Oh boy, that's not right. I must have introduced a bug in my last change by ignoring empty lines. Let's check the code further down a bit (ignoring the weird indenting for now):
    if (checkEOF == NULL){
      printf("\n");
      exit(0);

    } else if (strlen(commandLine) >= (unsigned)LENGTH-1) {
      puts("\nInput exceeds valid command length.");
      puts("Input must be at most 512 characters.");
 while(strlen(fgets(commandLine, LENGTH, inputSrc)) >= (unsigned)LENGTH-1);
 commandLine[0] = '\n';
    } //else if
Sure enough, the EOF check no longer happens because the empty line check happens first, and an EOF is seen as an empty line. If we move the empty line check after the EOF check, everything should be happy:
    if (checkEOF == NULL){
      printf("\n");
      exit(0);

    }
    if (is_empty(commandLine)) continue;
    if (strlen(commandLine) >= (unsigned)LENGTH-1) {
      puts("\nInput exceeds valid command length.");
      puts("Input must be at most 512 characters.");
 while(strlen(fgets(commandLine, LENGTH, inputSrc)) >= (unsigned)LENGTH-1);
 commandLine[0] = '\n';
    } //else if
Notice that changing the else-if to just an else and sticking another if statement in between doesn't change the flow of the program because if the execution path enters the first if block, it's going to exit the program. Now the interactive mode still ignores empty lines, and the batch mode stops at the end of the file. However, there's another problem with batch mode. It doesn't execute the commands correctly. Take a look at this output:
koblensk> Command Line : /bin/ls -a
   0 : /bin/ls
   1 : -a
  0 arg /bin/ls
  1 arg -a
/bin/ls
/bin/ls
Doing execv /bin/ls
  0 /bin/ls
  1 -a
'bin/ls: invalid option -- '
Try '/bin/ls --help' for more information.
koblensk> Command Line : /usr/bin/gvim
   0 : /usr/bin/gvim
  0 arg /usr/bin/gvim
/usr/bin/gvim
/usr/bin/gvim
Doing execv /usr/bin/gvim
  0 /usr/bin/gvim
Doing execv /usr/bin//usr/bin/gvim
  0 /usr/bin/gvim

Command Not Found.
Please check your path and filename.
Note: programs needing user input cannot be executed with this shell.

koblensk> Command Line : exit
   0 : exit
  0 arg exit
exit
exit
Doing execv exit
  0 exit
Doing execv /usr/bin/exit
  0 exit

Command Not Found.
Please check your path and filename.
Note: programs needing user input cannot be executed with this shell.

koblensk> Command Line : exit
   0 : exit
  0 arg exit
exit
exit
Doing execv exit
  0 exit
Doing execv /usr/bin/exit
  0 exit

Command Not Found.
Please check your path and filename.
Note: programs needing user input cannot be executed with this shell.

koblensk> Command Line : 
It took a while to figure out what was going on here because interactive mode works, but the exact same commands in batch mode do not work, failing with "Command Not Found." It turns out that the batch file had a carriage return and a line feed at the end of each line, instead of the expected new line character. The code that removes the new line character was leaving the carriage return, causing execv() to fail on this unexpected trailing character. Really, all trailing whitespace should be stripped from the command line string, so I made another little function to do that:
void rstrip(char *str) {
    char *end = str + strlen(str) - 1;
    while (end > str && isspace(*end)) {
      *end = '\0';
      end--;
    }
}
After replacing the old line of code that removes the new line character with a call to this function, we have this:
    checkEOF=fgets(commandLine, LENGTH, inputSrc);
    printf("Command Line : %s", commandLine);
    rstrip(commandLine);
That makes the later check for an empty line a little redundant because is_empty() no longer has to check for spaces. They'll have already been removed! I'm going to leave the function alone, though. It's more generic the way it is, and I might use it again. I don't want to make a mistake and assume the function is doing more than it is because I changed it.

Tests With Batch Mode


Now that we have a working batch mode, we should make a reasonable batch test file. The original file didn't test very much, so we want to add some things that flex the other parts of the program, like commands without the path specified, blank lines, and commands that are too long for the shell. Here's my attempt at a more complete batch test file:
/bin/ls -a
cat batch.txt
this is an invalid command


echo 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
exit invalid
exit
Shocking as it may be, this test file uncovered another bug in my program. The problem happens at the execution of the second command. It should be a valid command, but the shell doesn't find it in batch mode when the SHELLPATH is set to "/usr/bin:/bin". It should find cat in /bin, but it fails to do so because the SHELLPATH parsing happens inside the loop that executes all of the commands. Why should this be a problem, other than it's inefficient? Well, the parsing is done with strtok(), and strtok() modifies the input string as it tokenizes, so the second time through the loop for the second command, the SHELLPATH string is different, and only the first path is picked up. The command is not in the first path; it's in the second path, so it's not found. VoilĂ , we have our bug. To fix it, I moved the block of offending code up to before the loop. I'm not going to show it here because it's a lot of crappy code that we're going to work through next time. It has a bunch of other problems that we're not going to get into, yet. Suffice it to say, the shell path parsing was moved, and now the batch file works as expected.

There's a problem with using this batch file for a test case. Normally, I would capture the output from the batch mode run and use it as a golden sample to compare against future runs after doing some refactoring. In this case, the output isn't going to be completely stable, though. It contains a bunch of debug statements that we're going to strip out as we refactor the code, and that will purposefully change the output. We can still compare the previous output with the new output in these cases, but we can't rely on a simple diff with no changes to show that we haven't messed something up. We're going to have to pay attention to the changes in the diff output to make sure that only the changes that we want to happen are happening. To simplify these comparisons, we should only ever change the debug output or refactor other parts of the program, but never both at the same time. My commits will reflect this process.

Why don't I use a testing framework here? Like the last program I refactored, this is a small, self-contained program that's hard to test with a framework because everything happens in main(). It would be much more work to use a testing framework than to do the simple comparisons I'm doing, and for little gained. The point of this exercise is to take a look back at my programming history and see if I've improved at all with experience. (It already seems that I have.) Refactoring to show how I would do things better today—while using as simple a way as possible to make sure I don't screw anything up—is the goal. If this were a production system, my goals would be different.

Wrapping Up


With the batch test file done and a couple more bugs fixed, it's time to commit again. After that, I want to quickly fix up the formatting because I've had about enough of looking at all of the messed up indenting. I'll just run it through clang-format and be done with it:
$ clang-format -style=llvm -i shell.cpp
Ah, much better. That huge formatting change requires another commit, and then we're done for today. We've already accomplished quite a bit with fixing a few bugs, adding extra tests, and getting the formatting to be readable. We still have a lot to do, though. Looking through this code, I saw extremely poor program structure, disorganization, magic numbers, abysmal memory management complete with memory leaks, and even some bad naming. That's a lot of opportunities packed into a 240-line program. We'll tackle those opportunities next time.

No comments:

Post a Comment