Click here to Skip to main content
15,881,380 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have 26 files with text inside them and I want to remove some [special groups of words] from all of them. I have a specific group of text to remove, right now.
I'm comfortable with other solutions different than using regex, but I wish though to find a solution in this direction(if possible).
---------------------------------------
sample:
< I>(î áëþäàõ â ðåñòîðàíå)< /I>
< I>(÷åã,î-ë. — of)< /I>
< I>n< /I>
< I>áèáë.< /I>
---------------------------------------

I am thinking at using RegularExpressions on it but I need a regex formula for finding < I>, any word inside ,and stop After finding < /I>.
I know I can use @"< I>\w*" but further I can't imagine any combination possible...

C#
//obs: there is no space between < and I>; 
//i put it here because interfere with this  html page.
                     if (line[1].Contains("< I>"))
                     {
                        string[] segment = Regex.Split(line[1], "< I>");
                     }

(PS- my English is not as good as a native one; also my level in c# is not so advanced. Thank you for understanding.)

---continued:
I found a nice regex snippet that look promising:
"[^"]*"  [solution to match any string within double quotes]

Right now I am delving into regex, and it will took some time until I will familiarize with it. Until then this case will remain open unfortunately. In the end I will close it. If you will find something useful in the meantime, I will look over it. Thanks.
Posted
Updated 30-Jan-12 1:36am
v2
Comments
Sergey Alexandrovich Kryukov 29-Jan-12 0:36am    
Something I always wanted to know but was afraid to ask:
Where this Russian text in archaic Cyrillic Windows-1251 comes from? Yes, I know it was Windows encoding before Unicode and NT, a proprietary one. Where is comes from these days? :-)

Thank you,
--SA
Amir Mahfoozi 29-Jan-12 3:04am    
Are they ASP.NET files ?
_Q12_ 29-Jan-12 6:36am    
SAKryukov - i want to make a personal translator (en-ru/ru-en)to be able to learn a little faster the russian language...for this purpose these characters are appearing in my samples. I am struggling to make this mini dictionary from some months now...and the problem is not the code in itself or what I use, the BIG problem (and the time consuming one) is how much close possible for MY needs i can narrow it possible. I am in the same time write down with pencil the words to learn them...besides programming...the final result must remain in my head after all, but i dream and imagine that the software may can help a bit (im not sure 100% if its true). I made 5 projects only for this [ big project] alone, in different variants, and I'm learning from mistakes, because only mistakes I made so far(and it's very frustrating-believe me). Unfortunately i did not learn all the basics in programming that are there to learn, I just cope with the lacks and press on; in the final, something will crack and I will obtain what i need from it(i hope).
The response was a bit long because its a bit complicated to resume at few words.
Sorry about the boring explanation.
The words and grammar (as you find for yourself until now), I learn by myself and with the help coming from specific sites(linguistic ones)---again, for my pleasure and curiosity only. This forum I use solely for programming purposes only(not linguistic ones). I talk much, don't I? :-)
BTW, the Cyrillic words you see there, I personally don't see them at all... that's why this form of putting there, in the thought that nobody will notice theirs origin...but now im discouraged from what little i know.
Resuming my original problem, how do I make that regex - because I sincerely don't know to make it (Im medium in programming), and with my hand on my heart I don't mean anything else than knowledge (believe me).
Amir Mahfoozi 29-Jan-12 6:42am    
Dear _q12_. Are you telling this to me or SA ? if SA then you have mistakenly replied to my comment. BTW, I think that your problem can be solved by using xml loaders :)
_Q12_ 29-Jan-12 6:51am    
Amir- sorry for not responding to you faster, the files are *.txt. They can not be used in conjunction with "xml loaders". They contain a lot of text that right now is very much modified at the point that they are not in the default format for an easy manipulation.
The text is scrambled a lot.
I need to clean the remaining "garbage" from it and to leave it in a accessible format for future use.
So I need basic string manipulation for it.

I hope this give you the general idea for doing the job :
C#
string pattern = @"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
Regex regex = new Regex(pattern, RegexOptions.Multiline);

StringBuilder sb = new StringBuilder();
sb.Append(@"<I>abc</I>");
sb.Append(@"<I>def</I>");
sb.Append(@"<I>gfi</I>");
sb.Append(@"<I>jkl</I>");
var input = sb.ToString();

var matches = regex.Matches(input);
for (int i = 0; i < matches.Count-1; i+=2)
    Console.WriteLine(input.Substring(matches[i].Index + matches[i].Length, matches[i].Index - matches[i].Index + matches[i].Length));


Hope It helps.
 
Share this answer
 
Comments
_Q12_ 29-Jan-12 7:37am    
hmmm...it looks very complicated. I am very new in regex, and I learn only the use of \w \* \s ...I was imagine a simpler solution than this... so I see now its very complicated business.
Do you know another way (much more simpler) than this?
In the end I will try it,and see what I can come up with,but its brain squeezing. I will give you the accept after all but sincerely right now i dont understand squat from what you wrote there... maybe 10% i understand.
Amir Mahfoozi 29-Jan-12 7:45am    
I didn't invented it ;) I just copy and paste it from stakoverflow. But the whole code is mine :)
This person has described it to some extents : http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
BTW, I feel that HTML Agility Pack will solve your problem : http://htmlagilitypack.codeplex.com/
Give it a try when you had time.
Sergey Alexandrovich Kryukov 30-Jan-12 1:40am    
_q21_,

Sorry, you really need to review your attitudes. We already had to have some unpleasant discussion when I answered your other question, but believe I do it only to help you.

You keep saying: "too complex", "very complicated". As soon as you need to get anything good, anything at all -- this is never easy. There are many easy things, but they usually have very little value.

--SA
_Q12_ 29-Jan-12 7:47am    
In pseudocode I was thinking like this:
search for < I>, when find it, make an index of it.
search for < /I>, when find it, make an index of it,too.
in between those 2 indexes,all text= "".
finally, remove the text from those indexes.
Right now I dont care if it can be done with or without regex, i want it simple - not complicated.
Amir Mahfoozi 29-Jan-12 7:54am    
You can do it with both regex and HTML agility pack. The above code will do what you have mentioned here. But if you guess that there may be some syntactically incorrect occurrences then it needs some modification. But if not it solves your problem, I think.
Hello _q12_,

your question is a bit ambiguous. Assuming that you have a predefined list of redundant entries, Regex does not help a lot. But nonetheless, the followong might help:


C#
static void Main(string[] args)
{
    List<string> redundant = new List<string>()
    {
        "abc",
        "xyz",
        "...",
    };
    string file = "datafileX.txt";

    string data = File.ReadAllText(file);
    data = ReplaceRedundantContent(data, redundant);
    File.WriteAllText(file, data);

}

private static string ReplaceRedundantContent(string data, List<string> redundant)
{
    string result = data;
    foreach (string remove in redundant)
    {
        // all characters to be taken literally
        string pattern = Regex.Escape("<I>"+remove+"</I>");
        result = Regex.Replace(result, pattern, "");
    }
    return result;
}


If you want to search for any text between the <I> and </I>, you may use the following pattern:
C#
"<I>.*?</I>"

This matches all text by taking as little as possible, indicated by the question mark. If the question mark was not there, the match would be "greedy", meaning, that as much as possible is taken.

Cheers

Andi
 
Share this answer
 
_q12_ wrote: "Right now I dont care if it can be done with or without regex, i want it simple - not complicated."

Okay, now that you've opened that door :) : Try this:
XML
private string testString = @"
    < I>(î áëþäàõ â ðåñòîðàíå)< /I>
    < I>(÷åã,î-ë. — of)< /I>
    < I>n< /I>
    < I>áèáë.< /I>";

private string[] stringSeparators = new string[] { "< I>" };

private char[] charsToTrim = { '<', '/', '>' };

private List<string> cleanStrings = new List<string>();

// assumes you have a Button on a Form named 'button1
// with this Click EventHandler "wired-up"
private void button1_Click(object sender, EventArgs e)
{
    string[] splitTestString = testString.Split(stringSeparators, StringSplitOptions.RemoveEmptyEntries);

    foreach(string theStr in splitTestString)
    {
        cleanStrings.Add(theStr.Trim().TrimEnd(charsToTrim));

        // seeing is believing
        Console.WriteLine(cleanStrings.Last());
    }
}
p.s. I have no doubt one of our "virtuosos" here will simplify this even further !
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900