Click here to Skip to main content
15,893,190 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
So I am trying to extract all the words between
<title> and in a string and it gives this error with this code

string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here

int Tstart = LesserImputerText.IndexOf("<title>") + "<title>".Length;
int TEnd = LesserImputerText.IndexOf("</title>");


//
string  PureTitleText= LesserImputerText.Substring(Tstart, TEnd - Tstart);


and with that last line of code it gives the error
Length cannot be less than zero.'

and I have no idea how to fix it

also is there a way to compare 2 large sums of text to eachother to see if they have the same words In them

What I have tried:

crying, having a rest and then crying some more
Posted
Updated 8-Aug-19 3:15am
Comments
phil.o 8-Aug-19 0:44am    
You should postpone the crying part to after having performed a basic debug session.

Simplest way is to use a regex:
C#
private static Regex regex = new Regex("(?<=\\<title\\>).*?(?=\\</title\\>)",
    RegexOptions.Singleline |
    RegexOptions.CultureInvariant |
    RegexOptions.Compiled);
...
    string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";
    Match m = regex.Match(ImpureTitleText);
    if (m.Success)
        {
        string inside = m.Value;
        ...
        }
 
Share this answer
 
Comments
Patrice T 8-Aug-19 6:20am    
Hi Og, I guess this RegEx "(?<=\\<title\\>)(.*?)(?=\\)" would be better.
OriginalGriff 8-Aug-19 6:27am    
Only if you don't want to capture anything. :laugh:
There is no "\" after the "title" - it's a "/" - and even if you fix that, it leaves the closing "less than" in the result.
Patrice T 8-Aug-19 6:40am    
Grrr, the end got killed in copy/paste:
"(?<=\\<title\\>(.*?)(?=\\</title\\>)"
basically, I just added a capture group
OriginalGriff 8-Aug-19 6:45am    
It happens!
You don't need to group the text in the middle, since you are using the "excluded prefix" and "excluded suffix" match groups - they aren't included in the output text, so adding a group just adds an extra layer of indirection that you don't strictly need.
HamzaMcBob 8-Aug-19 9:19am    
so I figured it out, the code I showed above was missing a line of code that was in the original , it was a regex expression and that code seemed to be mucking it up , it looked like this before



string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my nickname was.";// insert text file here
string LesserImputerText = Regex.Replace(ImpureTitleText, @"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ").ToString(); //Gets rid of non ASCII values (from http://luisquintanilla.me/2018/01/18/real-time-sentiment-analysis-csharp/) but does seem to do anything when putting non ASCII character such as upsidedown exclamation mark.
int Tstart = ImpureTitleText.IndexOf("<title>") + "<title>".Length; //From here to //this regex breaks it
int TEnd = ImpureTitleText.IndexOf("");
string PureTitleText = ImpureTitleText.Substring(Tstart, TEnd - Tstart);
Is it reasonable ?
C#
// you are storing the sentence in a variable
string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here

// and you search the Keywords in another
int Tstart = LesserImputerText.IndexOf("<title>") + "<title>".Length;
int TEnd = LesserImputerText.IndexOf("</title>");
// checking if both keywords were found is a good idea too.


Your code do not behave the way you expect, or you don't understand why !

There is an almost universal solution: Run your code on debugger step by step, inspect variables.
The debugger is here to show you what your code is doing and your task is to compare with what it should do.
There is no magic in the debugger, it don't know what your code is supposed to do, it don't find bugs, it just help you to by showing you what is going on. When the code don't do what is expected, you are close to a bug.
To see what your code is doing: Just set a breakpoint and see your code performing, the debugger allow you to execute lines 1 by 1 and to inspect variables as it execute.

Debugger - Wikipedia, the free encyclopedia[^]

Mastering Debugging in Visual Studio 2010 - A Beginner's Guide[^]
Basic Debugging with Visual Studio 2010 - YouTube[^]

Debugging C# Code in Visual Studio - YouTube[^]

The debugger is here to only show you what your code is doing and your task is to compare with what it should do.
 
Share this answer
 
v2
Comments
Patrice T 8-Aug-19 6:22am    
To anonymous down voter: I am curious to know what is the reason of down vote.
What is wrong in the answer ?
Here is another solution that uses string split based on <title> and and finds the pure title in between.

The idea is that first we check if the string contains elements <title> and . This can be relaxed if needed.

We add a pad to the front and end to handle the edge case. If we have both < title> and as start and end of the pure title, let's add a pad of '.' to the start and end of the text. This is needed to identify if the text contains only the valid title.

We then split the string into three segments.

Once the string has been split, we can remove the pad. By design, the padded character exists in the first and last index of the original text. The removal may not be needed if the text is a temporary variable and not needed later on.

Finally, the second element of the array contains the title that is pure.

We can refine the logic to handle cases where if ending is missing. In that case, certain assumption must be made.

C#
//Define a separator
string[] separatingStrings = { "<title>", "</title>"};

string text = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here;
System.Console.WriteLine($"Original text: '{text}'");

//check if the string contains "<title> and </title>"
var hasStart = text.Contains(separatingStrings[0]); 
var hasEnd = text.Contains(separatingStrings[1]);

if (hasStart && hasEnd) //if we have valid title
{
	//add pads
	text = "." + text+ ".";
	//now split the text into three segments, i.e., before <title>, between <title> and </title> and after <title>. The second element in the array will contain the valid title.
        //split the text
	string[] splitText = text.Split(separatingStrings, System.StringSplitOptions.RemoveEmptyEntries);
	
	//remove the pad
	text = text.Remove(text.Length -1,1).Remove(0,1);  
	
	
	System.Console.WriteLine($"{text.Length} substrings in text:");

	foreach (var split in splitText)
	{
		System.Console.WriteLine($"	{split}");
	}

	//We will always have title in the second
	System.Console.WriteLine($"Final Title: \n\t{splitText[1]}");
}
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900