Introduction
This article will attempt to help you better understand the System.String
type and how it works behind the scenes.
A string is an immutable set of characters. Yes, it's true - when we make any changes to a string, we don't actually get back an altered version of the same string we started with, but rather a new string which includes our changes.
A string is an Object
; it says string is reference type, not a value one. But when we compare string with the operators ==
or !=
, we are actually comparing content. But it's important to remember that when we use the >
or >=
operators, we are comparing references. The reason for this behavior is that the operators ==
and !=
are overloaded and actually use the Compare
function to do their work.
The string type has 8 public constructors, with those of them that take pointers marked as not CLS compliant - this is important in terms of future language interoperability. Info about the rest of the constructors is less interesting upon deeper analysis.
Now, a little bit more about the String.Compare
function. All overloaded compare functions are based on the CultureInfo
class. Please, don't be confused by the AssemblyCulture
attribute designated to distinguish between main and satellite assemblies. Comparison results may be different for the same case based upon the selected culture.
For example, this is how CultureInfo
is passed as an argument if we are using case sensitive comparison:
return culture.CompareInfo.Compare(strA, strB, CompareOptions.None);
and for case insensitive one:
return culture.CompareInfo.Compare(strA, strB,
CompareOptions.IgnoreCase);
Another interesting point is the implementation of comparing strings without considering of culture or language. For a comparison with the case insensitive option, it uses CaseInsensitiveComHelper
function (written in C++). If the string includes characters that are greater then char (0x80) then it will always return false.
Interesting how strings bring to the same case: low case characters and upper case characters differenced only by the 0x20 bit. So, when by XOR operation, a character is known not to be lower case, by "bitwise OR" operation, it is brought to lower case and only afterwards is the comparison performed. This comparison is performed by trivial increment of character array pointers. If any of the characters is greater then 0x7F then we'll get an Argument Exception.
Comparing case sensitively resulted in loop characters comparing. The number of iterations is defined by the shortest string length if the compared strings' lengths are not equal.
In C#, string concatenation is realized in a more sophisticated method than compared to Visual Basic 6 concatenation. The first step is the allocation of memory for a character array with length equal to the sum of the concatenated string lengths. Then the result array is filled by the string's content.
The last thing I'd like to discuss is the Replace
function, more exactly the Replace(string, string)
function implementation. The first step is to perform some error handling to check that the new string's length is greater then zero in which case the function returns without any action.
The next step is building an index of all needed replaces and storing it in an integer array. Now, we simply walk through the array and copy characters into the result array until we get to an indexed location. Here the new value is inserted, the counter incremented, and iteration continued. Of course, the whole thing is performed on a low level with memory allocation.
A little more about replace: when your job needs a simple and frequently repeated replace operation, try to use Regular Expressions instead of the Replace
function. Performance differences can be tremendous. Here are examples of simple code which may help you to see it better:
- Using
Replace
function of the String
(C#):
DateTime t1=System.DateTime.Now;
for(int i=0;i<100;i++)
{
String digitregex = "9";
String before =new String('9',65000);
String after = before.Replace(digitregex, "");
}
DateTime t2=System.DateTime.Now;
MessageBox.Show(Convert.ToString(t2-t1));
This code performed at 0.38 seconds on average.
- Using dumb regular expression (C#)
DateTime t1=System.DateTime.Now;
for(int i=0;i<100;i++){
Regex digitregex = new Regex("(?<digit>[9])");
String before =new String('9',65000);
String after = digitregex.Replace(before, "");
}
DateTime t2=System.DateTime.Now;
MessageBox.Show(Convert.ToString(t2-t1));
This code performed at 17.5 seconds on average. Conclusion - don't use regular expressions in this type of cases.
Now, a little improvement will reduce time to 0.38 seconds:
Regex digitregex = new Regex("(?<digit>[9])*");
And last improvement will bring it to 0.24:
Regex digitregex = new Regex("(?<digit>[9])+");
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.