Enhance Text Handling Capabilities
In this post, we’ll examine the text handling capabilities of the .NET Framework. We’ll review three major objectives of 70-536:
- StringBuilder class
- Regular Expressions
- Encoding and Decoding text
Strings in the .NET Framework are immutable. Memorize it, immutable. They cannot be altered. Instead, every time you modify a string a new instance of String is created with the new value, and the pointer will be assigned to the memory location of your new string. So when you need to dynamically build strings (for example, adding it line by line), you should use the StringBuilder class.
StringBuilder lives in System.Text, and you can initialize it by passing a desired capacity as an integer, or a starting string. Note that the StringWriter class also uses StringBuilder as a container.
StringBuilder has the following methods:
|The Methods of StringBuilder|
|Append||Appends the specified value (has overloads for most value types) or object to the string.|
|AppendFormat||Appends a formatted string to the stringbuilder.|
|AppendLine||Appends a new line to the stringbuilder.|
|CopyTo||Copies the content of the stringbuilder into a char array.|
|EnsureCapacity||Guarantees a minimal capacity to the strinbuilder instance.|
|Insert||Inserts a value or an object into the specified position.|
|Remove||Removes the specified section from the stringbuilder.|
|Replace||Replaces occurrences of a char or a string with the specified value.|
That’s for the StringBuilder. Use it, love it. Now head forward to Regular Expressions.
First I must confess that I don’t know a damn thing about how to write a regular expression and I never used them. For those who are still reading me: I’ll cover the classes which work with regular expressions. The ability to write a regular expressions (from here Regex) is certainly good when you are taking the exam, but it isn’t an objective. When dealing with Regex on the exam, try to spot string errors in them (for example, Regex uses a great deal of escape characters “\”, when there’s no @ character before the Regex pattern string, these escape characters must be duplicated.
The classes of Regex live in the System.Text.RegularExpressions namespace. The main class here is (what a coincidence!) Regex. It has static and instance methods, let’s take a glance at both of them.
|Static Methods of Regex|
|IsMatch(string input, string pattern)||Returns a Boolean value whether or not a string matches the specified Regex pattern.|
|Match||Returns the first found occurrence in a given string as an instance of the Match class.|
|Matches||Returns a strongly-typed collection of Match classes, all occurrences.|
|Replace||Replaces occurrences in a string to the specified value.|
|Split||Splits the string into an array of strings.|
The instance methods differ from the static in the following ways:
- You don’t have to include the pattern, because you’ll do so in the constructor.
- You don’t have to include a RegexOptions enumeration, do it in the constructor.
That’s all I think. Now the members of the RegexOptions enumeration: CultureInvariant, IgnoreCase Multiline, Singleline. Of cours, there are more of them. Refer to MSDN for the full list.
The last objective is encoding and decoding text. The most important encoding formats the .NET Framework supports are as follows:
- ASCII encoding: encodes the characters of the Latin alphabet as 7 bit ASCII characters. You should use this class only for backward compatibility.
- UTF-8 encoding: uses 8-bit, 16-bit, 24-bit and up to 32-bit to encode. Provides compatibility for ASCII. Supports most languages.
- UTF-16 encoding: uses 16-bit integers to represent Unicode characters.
- UTF-32 encoding: uses 32-bit integers.
Encoding classes are good when you’re dealing with relatively small amount of information. When this is not the case, you should call their GetEncoder or GetDecoder methods, which returns the appropriate Encoder or Decoder class. These classes are only capable of encoding or decoding, but they are more efficient in that than the Encoding classes. They use the following methods:
- GetBytes: encodes a set of characters (strings) into the specified format.
- GetByteCount: returns the number of bytes in a given text.
- GetChars: decodes a set of bytes into a text.
- GetCharCount: returns the number of characters in a given byte array.
- GetString (Encoding-only): returns the content of a byte array in string format.