I am stuck on this code , I am trying to extract a string withing a text string in C#

Question

0.00/5 (No votes)

See more:

So I am trying to extract all the words between
<title> and in a string and it gives this error with this code

string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here

int Tstart = LesserImputerText.IndexOf("<title>") + "<title>".Length;
int TEnd = LesserImputerText.IndexOf("</title>");


//
string  PureTitleText= LesserImputerText.Substring(Tstart, TEnd - Tstart);

and with that last line of code it gives the error

Length cannot be less than zero.'

and I have no idea how to fix it

also is there a way to compare 2 large sums of text to eachother to see if they have the same words In them

What I have tried:

crying, having a rest and then crying some more

Posted 7-Aug-19 12:10pm

HamzaMcBob

Updated 8-Aug-19 3:15am

Add a Solution

Comments

phil.o 8-Aug-19 0:44am

You should postpone the crying part to after having performed a basic debug session.

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2019-08-07T19:13:00

Solution 2

Simplest way is to use a regex:

C#

private static Regex regex = new Regex("(?<=\\<title\\>).*?(?=\\</title\\>)",
    RegexOptions.Singleline |
    RegexOptions.CultureInvariant |
    RegexOptions.Compiled);
...
    string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";
    Match m = regex.Match(ImpureTitleText);
    if (m.Success)
        {
        string inside = m.Value;
        ...
        }

Posted 7-Aug-19 19:13pm

OriginalGriff

Comments

Patrice T 8-Aug-19 6:20am

Hi Og, I guess this RegEx "(?<=\\<title\\>)(.*?)(?=\\)" would be better.

OriginalGriff 8-Aug-19 6:27am

Only if you don't want to capture anything. :laugh:
There is no "\" after the "title" - it's a "/" - and even if you fix that, it leaves the closing "less than" in the result.

Patrice T 8-Aug-19 6:40am

Grrr, the end got killed in copy/paste:
"(?<=\\<title\\>(.*?)(?=\\</title\\>)"
basically, I just added a capture group

OriginalGriff 8-Aug-19 6:45am

It happens!
You don't need to group the text in the middle, since you are using the "excluded prefix" and "excluded suffix" match groups - they aren't included in the output text, so adding a group just adds an extra layer of indirection that you don't strictly need.

HamzaMcBob 8-Aug-19 9:19am

so I figured it out, the code I showed above was missing a line of code that was in the original , it was a regex expression and that code seemed to be mucking it up , it looked like this before

string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my nickname was.";// insert text file here
string LesserImputerText = Regex.Replace(ImpureTitleText, @"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ").ToString(); //Gets rid of non ASCII values (from http://luisquintanilla.me/2018/01/18/real-time-sentiment-analysis-csharp/) but does seem to do anything when putting non ASCII character such as upsidedown exclamation mark.
int Tstart = ImpureTitleText.IndexOf("<title>") + "<title>".Length; //From here to //this regex breaks it
int TEnd = ImpureTitleText.IndexOf("");
string PureTitleText = ImpureTitleText.Substring(Tstart, TEnd - Tstart);

Patrice T · Answer 2 · 2019-08-07T12:36:00

Is it reasonable ?

C#

// you are storing the sentence in a variable
string ImpureTitleText = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here

// and you search the Keywords in another
int Tstart = LesserImputerText.IndexOf("<title>") + "<title>".Length;
int TEnd = LesserImputerText.IndexOf("</title>");
// checking if both keywords were found is a good idea too.

Your code do not behave the way you expect, or you don't understand why !

There is an almost universal solution: Run your code on debugger step by step, inspect variables.
The debugger is here to show you what your code is doing and your task is to compare with what it should do.
There is no magic in the debugger, it don't know what your code is supposed to do, it don't find bugs, it just help you to by showing you what is going on. When the code don't do what is expected, you are close to a bug.
To see what your code is doing: Just set a breakpoint and see your code performing, the debugger allow you to execute lines 1 by 1 and to inspect variables as it execute.

Debugger - Wikipedia, the free encyclopedia[^]

Mastering Debugging in Visual Studio 2010 - A Beginner's Guide[^]
Basic Debugging with Visual Studio 2010 - YouTube[^]

Debugging C# Code in Visual Studio - YouTube[^]

The debugger is here to only show you what your code is doing and your task is to compare with what it should do.

Benktesh Sharma · Answer 3 · 2019-08-08T03:15:00

Here is another solution that uses string split based on <title> and and finds the pure title in between.

The idea is that first we check if the string contains elements <title> and . This can be relaxed if needed.

We add a pad to the front and end to handle the edge case. If we have both < title> and as start and end of the pure title, let's add a pad of '.' to the start and end of the text. This is needed to identify if the text contains only the valid title.

We then split the string into three segments.

Once the string has been split, we can remove the pad. By design, the padded character exists in the first and last index of the original text. The removal may not be needed if the text is a temporary variable and not needed later on.

Finally, the second element of the array contains the title that is pure.

We can refine the logic to handle cases where if ending is missing. In that case, certain assumption must be made.

C#

//Define a separator
string[] separatingStrings = { "<title>", "</title>"};

string text = "I was very <title> proud of my nickname throughout high school. but today I couldn’t be .any ¡ different to what my </title> nickname was. ";// insert text file here;
System.Console.WriteLine($"Original text: '{text}'");

//check if the string contains "<title> and </title>"
var hasStart = text.Contains(separatingStrings[0]); 
var hasEnd = text.Contains(separatingStrings[1]);

if (hasStart && hasEnd) //if we have valid title
{
	//add pads
	text = "." + text+ ".";
	//now split the text into three segments, i.e., before <title>, between <title> and </title> and after <title>. The second element in the array will contain the valid title.
        //split the text
	string[] splitText = text.Split(separatingStrings, System.StringSplitOptions.RemoveEmptyEntries);
	
	//remove the pad
	text = text.Remove(text.Length -1,1).Remove(0,1);  
	
	
	System.Console.WriteLine($"{text.Length} substrings in text:");

	foreach (var split in splitText)
	{
		System.Console.WriteLine($"	{split}");
	}

	//We will always have title in the second
	System.Console.WriteLine($"Final Title: \n\t{splitText[1]}");
}

I am stuck on this code , I am trying to extract a string withing a text string in C#

3 solutions

Solution 2

Solution 1

Solution 3

Add your solution here

Preview 0