C-Style Strings
C++ has two different "string" types; the string class from the standard library makes string manipulation easy, but is complex, since it uses dynamic memory. The original "string" type, inherited from the C language, is much simpler.
Though simpler, older C-strings are more difficult to work with. Sometimes more efficient, they are also more error prone, even somewhat dangerous. However, as C++ programmers, you can't ignore them.
Why should you dedicate any time to studying C-strings? There are several reasons:
- Efficiency. Library string objects use dynamic memory and the heap. C-strings are built into the language, so you don't need to link library code.
- Legacy Code. To interoperate with pure C code or older C++ code that predates the C++ string type.
- Library Implementation. You may want to implement your own string type. Knowing how to manipulate C-strings can greatly simplify this task.
- Embedded Programming. Programs written for embedded devices like those in your automobile or toaster, frequently use C-strings.
- Platform O/S Programming. For native Linux or Windows programming, you will need to use C-strings.
We will encounter many of these cases in the remainder of this course.
C-String Basics
The library string type works as if it were built into the C++ language. It uses C++ features to allow a string to act as a built-in type. C-strings are more primitive:
- C-strings are char arrays with a sentinel terminator, the NUL character '\0'.
- C-strings can be passed to functions without overhead.
- "String literals" automatically include the terminating NUL.
The literal "Hello, CS 150" contains 13 characters—12 for the meaningful characters plus one extra for the terminating NUL. The compiler generates:
H | e | l | l | l | o | C | S | 1 | 5 | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|
C-string functions all assume that this NUL exists; some insert it for you. Without a NUL, functions don't know when the string stops, either returning garbage or crashing. The length of a C-style string is not stored explicitly; the NUL serves as a sentinel, and your program loops through the characters, counting them when it needs to find the size.
Don't confuse '0' with '\0'. One has the ASCII value 48 and the other 0.
Array-based C-Strings
How you create a C-string determines where the characters are stored in memory. To copy characters into user memory where they can be modified, write this:
char s1[] = "String #1";
The C-string s1 contains exactly 10 characters; the 9 that appear in "String #1" and the terminating NUL character. Space for these characters is allocated on the stack or static storage area. The actual characters are copied into this "user space". This declaration is shorthand for:
char s1[] = {'S','t','r','i','n','g',' ','#','1','\0'};
Because the characters have been copied into memory that you control, you can change them if you like using the normal array subscripting operations.
s1[0] = 'C'; // OK; all characters are read-write
const size_t kLen = 1024; // small strings
char s2[kLen] = "String #2";
The declaration for s2 is slightly different. While the effective size of the string is also 9 characters, its allocated size is set by kLen or 1024 in this case. Use s2 if you want to add information to the end of the string, similar to partially-filled arrays.
Pointer-based C-Strings
Pointers to NUL-terminated character array literals can be used as C-strings, provided you don't attempt to modify them:
const char *s3 = "String #3";
The character array itself is not copied into your user space. The characters are stored in the static storage area when the program loads. Attempting to change a character in s3 is a compiler error, because of the const.
In C and in some older C++ code, you may see this declaration:
char *s4 = "String #4"
The declaration for s4 is obsolete in modern C++, but may be found in older code (and is legal in C). The compiler will probably compile your code (with warnings), but your program will probably crash if you attempt to modify the string in any way. The portion of the static storage where string literals are stored is effectively read-only.
C++ strings vs C-Strings
C-strings are not first-class types like the C++ string type. They do not work like the built-in types. Look at this example, which tries to assign, compare and concatenate two strings:
string str1 = "Hello", str2 = "World";
char cstr1[] = "Hello", cstr2[] = "World";
str1 = "Goodbye"; // assignment OK
cstr1 = "Goodbye"; // ILLEGAL
if (str1 < str2) ... // comparison OK
if (cstr1 < cstr2) ... // INCORRECT
str1 += ", "; // OK
cstr1 += ", "; // ILLEGAL
For the C++ string class, assignment, comparison and concatenation work in the same manner as the built-in types. Use the assignment operator, the relational operators, and the +=. Not so for C-strings, where you must use functions from the <cstring> header to perform the same functionality.
- strcpy(dest, src) is used instead of assignment
- strcat(dest, src) is used instead of +=
- strcmp(cstr1, cstr2) is used instead of the relational operators
In addition, in place of the member function size(), you use the strlen(cstr) function which counts the number of characters before the '\0'.
C-String Assignment
Assignment means "copy the thing on the right into the storage on the left". Instead of the assignment operator, used by the built-in types, C-strings use the strcpy() function, from the standard library header <cstring>, as shown below:
const size_t kMaxLen = 4096;
char dest [kMaxLen];
// Assume src is a C-style string
strcpy(dest, src);
Both src and dest are C-strings. (src is a common abreviation for source, where the characters are copied from, while dest stands for destination, where the characters are copied to). strcpy(dest, src) copies the characters, one by one, from src into dest, stopping the '\0' is copied. However:
- You don't know if the actual size of the C-string source is less than 4095 characters (+1 for the null character). Thus this code contains a security flaw.
- You normally won't need anywhere near 4096 characters allocated for destination, so the code is inefficient.
It is up to you to ensure that there is enough space in dest to hold a copy of src. The icon used in front of the code does not mean that the code is buggy; instead, it means that the function itself is intrinsically dangerous; it's like the symbol found on rat poison.
The library function itself makes no attempt to check whether the destination has enough room to hold a copy of the source string. Even if there is not enough memory the function keeps copying, possibly overwriting other data; this called a buffer overflow.
The strncpy Function
The possibly safer strncpy() function copies only a specified number of characters from src to dest. Here's how it works:
- Call strncpy() with a dest, a src, and a count of characters.
- If the '\0' in src is found before the specified number of characters have been copied, then strncpy() will fill the remainder with '\0'.
- If the NUL character is not found in src before the number of characters have been copied, then you must manually append a terminating NUL.
Here is a semi-safe copy, given the previous example that avoids overflow (although it doesn't ensure that all of src was actually copied; for that you need a loop).
dest[kMaxLen - 1] = '\0'; // pre-terminate
strncpy(dest, src, kMaxLen - 1);
If I've seemed somewhat equivocal about using strncpy, you should know that it's widely regarded as a still unsafe function. If you want to know more, here are some links:
The strcat Functions
Concatenation is the province of the strcat() (completely unsafe), and the strncat() (marginally safer) functions. Here is a (buggy) example using the functions:
const size_t kLen = 10;
char cstr[kLen] = "Goodbye";
strcat(cstr, " cruel world!"); // OOPS
cout << strlen(cstr) << " " << cstr << endl;
When you run, you'll likely see:
The C-string cstr has room for 9 characters, but you appear to have stuffed 21 characters (including the NUL), into that smaller space. Not really, of course: this is a buffer overflow and the actual results are undefined.
The strncat() function is marginally safer, if fairly tricky to use correctly. If used incorrectly, it overflows just like strcat(). Here is the prototype:
char * strncat(char *dest, const char *src, size_t count);
The tricky part is that count is not the maximum size of the result, but the maximum number of characters to be copied; you must first calculate the correct combined maximum, before calling the function.
const size_t kLen = 39; // max total characters
const cstr[kLen + 1] = "This is the intial string";
const char *str2 = "Extra text to add to the string";
strncat(cstr, str2, kLen - strlen(cstr));
This isn't efficient (since you need to count the characters in cstr first), but it does stop copying when the destination string is full.
Security Note: strncat() does not check for sufficient space in dest; it is therefore a potential cause of buffer overruns. Keep in mind that count limits the number of characters appended; it is not a limit on the size of dest.
Comparing C-Strings
Do not use the relational operators (<, ==, etc.) to compare C-strings. Instead, use the library function strcmp(), which compares s1 and s2 lexicographically and returns an integer indicating their relationship:
- Zero if the two strings are equal.
- Negative if the first string lexicographically precedes the second string. (Lexicographically simply means "in dictionary order").
- Positive if the first string lexicographically follows the second string.
To use strcmp() correctly:
- Call the function and save the int it returns.
- Use the returned value with a relational operator.
- Don't treat the return value from strcmp() as a Boolean expression.
- Don't repeatedly call strcmp() on the same strings (inefficient).
Here's a quick example. The C-strings s1 and s2 are initialized elsewhere. Since we don't need to modify either argument, we can use "pointer-style" C-strings.
const char *s1 = ..., *s2 = ...;
int result = strcmp(s1, s2);
if (result == 0) ... // equal
else if (result < 0) ... // s1 < s2
else ... // s1 > s2