How to Parse a text file

Question

3.67/5 (3 votes)

See more:

Hello,

What you see below stars-line is the type of data i am working with.
I have a lot of file like that. Initially, i don't know what the files look like.
It is the user of the application that has to input the files.
The user first gives example data to the system. ie. The user can give as example:
Name: Jhon Smith
Address: 36, abcd avenue, Paris

My application has to know how to parse the file, so it puts the data in a database.
As you can see, using simple delimitters (eg: space, comma,...) or regular expressions will not work.

Does anyone has an idea how to appoaroch this problem?

I am opened to any type of suggestion.

Regards,

Herve
********************************************

txt

name	id	beer_style	first_brewed	alcohol_content	original_gravity	final_gravity	ibu_scale	country	brewery_brand	color_srm	from_region	containers
Bramling	/m/0cttpqn			4.0					Buntingford Brewery			
Dark Star Hophead Extra	/m/0dl8hjd			5.8					Dark Star			
Brewers Gold	/m/0cttps0	Bitter		4.0					Crouch Vale Brewery			
Wem Brewing Company Cascade Bitter	/m/0dlfhn2								Wem Brewing Company			
Friedrich Dull Krautheimer Urtyp Dunkel	/m/04dqd7m			5.4					Friedrich Dull		Germany	/m/04dr2qr
Nethergate Umbel Ale Coriander Beer	/m/04dqf7b			3.8					Nethergate brewery		United Kingdom	/m/04dr00w
Skinner's Cornish Gold	/m/04dqmzy			5.1					Skinner's Brewery		United Kingdom	/m/04dr1kh
Brouwerij Martens Damburger Export	/m/04dqrz0			5.1					Brouwerij Martens		Belgium	/m/04dr6cb
Concord Brewers Rapscallion Premier	/m/04dqt4r			6.75					Concord Brewers		United States of America	/m/04dr2fm
Federation High Level Strong Brown Ale	/m/04dqhp1			4.5					Federation		United Kingdom	/m/04dr2n_
Chiltern Brewery Glad Tidings Spiced Milk Stout	/m/04dqv4g			4.6					Chiltern Brewery		United Kingdom	/m/04dr57q
Huisbrouwerij Klein Duimpje Hillegoms Tarwe Bier	/m/04dqhd9			5.0					Huisbrouwerij Klein Duimpje		Netherlands	/m/04dqztr
Wickwar Infernal Brew	/m/04dqfy6			4.8					Wickwar		United Kingdom	/m/04dq_9r
Schöfferhofer Hefeweizen	/m/04dqtxc			5.0					Schöfferhofer		Germany	/m/04dr0zh
Woodforde's Nelson's Revenge	/m/04dqqg3			4.5					Woodforde’s Brewery		United Kingdom	/m/04dr6c6
Ridgeway Santa's Butt Winter Porter	/m/04dqlpv			6.0					Ridgeway		United Kingdom	/m/04dqzdh
De Proefbrouwerij Kapel van Viven blond	/m/04dqjhh			6.8					De Proefbrouwerij		Belgium	/m/04dr28r
Ventnor Wight Spirit	/m/04dqkb9			5.0					Ventnor		United Kingdom	/m/04dr6h2
Wye Valley Brewery O'er The Sticks	/m/04dqkxg			4.5					Wye Valley Brewery		United Kingdom	/m/04dr6w2
Cannery Blackberry Porter	/m/04dqbfw			8.0					Cannery		Canada	/m/04dr4s5
Maclay Thistle MacKinnon's Curse (Asda)	/m/04dqfzr			4.1					Maclay Thistle		United Kingdom	/m/04dr58z
Alcazar (Sherwood Forest Brewery Co) Maiden's Magic	/m/04dq9h7			5.0					Alcazar (Sherwood Forest Brewing Co)		United Kingdom	/m/04dq_hm
Molson Stock Ale	/m/04dqqkm			5.0					Molson		Canada	
Hirter Privat Pils	/m/04dql0d			5.2					Hirter		Austria	/m/04dr2tr
Lodzkie (subsidiary of Kaltenberg) Glob Premium	/m/04dqd05			5.5					Lodzkie (subsidiary of

Posted 27-May-11 2:20am

The_Real_Chubaka

Updated 20-Jul-11 0:19am

Nagy Vilmos

v2

Add a Solution

5 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

YvesDaoust · Accepted Answer · 2011-07-19T11:17:00

If I am not too wrong, I think that the rows in your sample file can be described by the following grammar:

(word) /m/ word (word) number . number (word) [/m/ word]

Where

word stands for a non-empty sequence of letters
number a non-empty sequence of digits
(item) a sequence of item (possibly empty)
[item] zero or one item

other characters taken verbatim.

This can be parsed by a context free compiler; in this particular case, I even guess that a regular expression can do.

By careful analysis of the data rows, you can identify the fixed parts and the optional parts and turn sample input into a simple grammar rule.

This can require a bit of compiler theory for a general treatment...

TorstenH. · Accepted Answer · 2011-05-27T02:31:00

Solution 1

The delimiter must be defined - or has to be entered by the user (which is not the best sollution...).
If you do not have a certain delimiter for the parts of the given lines, you're pretty much f**ked up.

You might think about the approach for the user to pass such files to your app. Might be better to get the data direct (from some MS Excel datatsheet?).

regards
Torsten

Posted 27-May-11 2:31am

TorstenH.

Comments

The_Real_Chubaka 27-May-11 8:45am

I was thinking of using a machine learning approach:

1) Use simple delimiters to parse the file
2) Use a decision-tree classifier (or any other classifier) to correct what has not correctly parsed.

But all this is still a bit vague for now

TorstenH. 27-May-11 8:49am

hmmm, but in that case you need to define what's a right output of the parser and what's a wrong output. How do you decide on this? Is it really needed to use such txt-files? looks like some kind of export from another app to me.

The_Real_Chubaka 27-May-11 9:57am

Yes, it is an output from part one of this application.
What part one of the app does is:
1)Takes HTML page concert them to XML
2)Parses those XML files
3)The output is what you see in my post.

Part two of the application is what i am trying to do.
The next step will be to put everything in a database

TorstenH. 27-May-11 15:41pm

so you have a XML-File in the step before? perfect! Use that one, it can't get any more comfortable than that.
If there is a need to parse that file from XML into something strange - keep it at least semantic and don't trash it like the output above.

Yvan Rodrigues · Accepted Answer · 2011-05-27T02:32:00

I'm not sure I understand just how different the incoming data might be. From your description there could be any number of fields, in any order, with any meaning, and you don't know any of this beforehand. Even the delimeter could vary.

This is the kind of job that computers aren't particularly good at. Since you are open to suggestions I would recommend Amazon Mechanical Turk[^].

It is an API that you can use to post your data on Amazon's turk website. You would probably group your data into 10 or 25 records. At the other end, humans will follow your instructions to decipher the data and determine the delimiters, field names, whatever you want. You will pay them for this. Perhaps $0.01 per record.

You will get your data processed efficiently by real humans and hungry people in third world countries (and a few closer to home) will make a couple of dollars an hour and will be able to feed their families.

Nagy Vilmos · Accepted Answer · 2011-07-20T00:21:00

Solution 5

Further to other answers, it looks like the file is tab delimited and you can use StringTokenizer to break it up.

Posted 20-Jul-11 0:21am