Click here to Skip to main content
15,886,110 members
Please Sign up or sign in to vote.
3.67/5 (3 votes)
See more:
Hello,

What you see below stars-line is the type of data i am working with.
I have a lot of file like that. Initially, i don't know what the files look like.
It is the user of the application that has to input the files.
The user first gives example data to the system. ie. The user can give as example:
Name: Jhon Smith
Address: 36, abcd avenue, Paris

My application has to know how to parse the file, so it puts the data in a database.
As you can see, using simple delimitters (eg: space, comma,...) or regular expressions will not work.

Does anyone has an idea how to appoaroch this problem?

I am opened to any type of suggestion.

Regards,

Herve
********************************************
txt
name	id	beer_style	first_brewed	alcohol_content	original_gravity	final_gravity	ibu_scale	country	brewery_brand	color_srm	from_region	containers
Bramling	/m/0cttpqn			4.0					Buntingford Brewery			
Dark Star Hophead Extra	/m/0dl8hjd			5.8					Dark Star			
Brewers Gold	/m/0cttps0	Bitter		4.0					Crouch Vale Brewery			
Wem Brewing Company Cascade Bitter	/m/0dlfhn2								Wem Brewing Company			
Friedrich Dull Krautheimer Urtyp Dunkel	/m/04dqd7m			5.4					Friedrich Dull		Germany	/m/04dr2qr
Nethergate Umbel Ale Coriander Beer	/m/04dqf7b			3.8					Nethergate brewery		United Kingdom	/m/04dr00w
Skinner's Cornish Gold	/m/04dqmzy			5.1					Skinner's Brewery		United Kingdom	/m/04dr1kh
Brouwerij Martens Damburger Export	/m/04dqrz0			5.1					Brouwerij Martens		Belgium	/m/04dr6cb
Concord Brewers Rapscallion Premier	/m/04dqt4r			6.75					Concord Brewers		United States of America	/m/04dr2fm
Federation High Level Strong Brown Ale	/m/04dqhp1			4.5					Federation		United Kingdom	/m/04dr2n_
Chiltern Brewery Glad Tidings Spiced Milk Stout	/m/04dqv4g			4.6					Chiltern Brewery		United Kingdom	/m/04dr57q
Huisbrouwerij Klein Duimpje Hillegoms Tarwe Bier	/m/04dqhd9			5.0					Huisbrouwerij Klein Duimpje		Netherlands	/m/04dqztr
Wickwar Infernal Brew	/m/04dqfy6			4.8					Wickwar		United Kingdom	/m/04dq_9r
Schöfferhofer Hefeweizen	/m/04dqtxc			5.0					Schöfferhofer		Germany	/m/04dr0zh
Woodforde's Nelson's Revenge	/m/04dqqg3			4.5					Woodforde’s Brewery		United Kingdom	/m/04dr6c6
Ridgeway Santa's Butt Winter Porter	/m/04dqlpv			6.0					Ridgeway		United Kingdom	/m/04dqzdh
De Proefbrouwerij Kapel van Viven blond	/m/04dqjhh			6.8					De Proefbrouwerij		Belgium	/m/04dr28r
Ventnor Wight Spirit	/m/04dqkb9			5.0					Ventnor		United Kingdom	/m/04dr6h2
Wye Valley Brewery O'er The Sticks	/m/04dqkxg			4.5					Wye Valley Brewery		United Kingdom	/m/04dr6w2
Cannery Blackberry Porter	/m/04dqbfw			8.0					Cannery		Canada	/m/04dr4s5
Maclay Thistle MacKinnon's Curse (Asda)	/m/04dqfzr			4.1					Maclay Thistle		United Kingdom	/m/04dr58z
Alcazar (Sherwood Forest Brewery Co) Maiden's Magic	/m/04dq9h7			5.0					Alcazar (Sherwood Forest Brewing Co)		United Kingdom	/m/04dq_hm
Molson Stock Ale	/m/04dqqkm			5.0					Molson		Canada	
Hirter Privat Pils	/m/04dql0d			5.2					Hirter		Austria	/m/04dr2tr
Lodzkie (subsidiary of Kaltenberg) Glob Premium	/m/04dqd05			5.5					Lodzkie (subsidiary of
Posted
Updated 20-Jul-11 0:19am
v2

If I am not too wrong, I think that the rows in your sample file can be described by the following grammar:

(word) /m/ word (word) number . number (word) [/m/ word]

Where

word stands for a non-empty sequence of letters
number a non-empty sequence of digits
(item) a sequence of item (possibly empty)
[item] zero or one item

other characters taken verbatim.

This can be parsed by a context free compiler; in this particular case, I even guess that a regular expression can do.

By careful analysis of the data rows, you can identify the fixed parts and the optional parts and turn sample input into a simple grammar rule.

This can require a bit of compiler theory for a general treatment...
 
Share this answer
 
v6
The delimiter must be defined - or has to be entered by the user (which is not the best sollution...).
If you do not have a certain delimiter for the parts of the given lines, you're pretty much f**ked up.

You might think about the approach for the user to pass such files to your app. Might be better to get the data direct (from some MS Excel datatsheet?).

regards
Torsten
 
Share this answer
 
Comments
The_Real_Chubaka 27-May-11 8:45am    
I was thinking of using a machine learning approach:

1) Use simple delimiters to parse the file
2) Use a decision-tree classifier (or any other classifier) to correct what has not correctly parsed.

But all this is still a bit vague for now
TorstenH. 27-May-11 8:49am    
hmmm, but in that case you need to define what's a right output of the parser and what's a wrong output. How do you decide on this? Is it really needed to use such txt-files? looks like some kind of export from another app to me.
The_Real_Chubaka 27-May-11 9:57am    
Yes, it is an output from part one of this application.
What part one of the app does is:
1)Takes HTML page concert them to XML
2)Parses those XML files
3)The output is what you see in my post.

Part two of the application is what i am trying to do.
The next step will be to put everything in a database
TorstenH. 27-May-11 15:41pm    
so you have a XML-File in the step before? perfect! Use that one, it can't get any more comfortable than that.
If there is a need to parse that file from XML into something strange - keep it at least semantic and don't trash it like the output above.
I'm not sure I understand just how different the incoming data might be. From your description there could be any number of fields, in any order, with any meaning, and you don't know any of this beforehand. Even the delimeter could vary.

This is the kind of job that computers aren't particularly good at. Since you are open to suggestions I would recommend Amazon Mechanical Turk[^].

It is an API that you can use to post your data on Amazon's turk website. You would probably group your data into 10 or 25 records. At the other end, humans will follow your instructions to decipher the data and determine the delimiters, field names, whatever you want. You will pay them for this. Perhaps $0.01 per record.

You will get your data processed efficiently by real humans and hungry people in third world countries (and a few closer to home) will make a couple of dollars an hour and will be able to feed their families.
 
Share this answer
 
Comments
The_Real_Chubaka 27-May-11 8:45am    
I was thinking of using a machine learning approach:

1) Use simple delimiters to parse the file
2) Use a decision-tree classifier (or any other classifier) to correct what has not correctly parsed.

But all this is still a bit vague for now
Further to other answers, it looks like the file is tab delimited and you can use StringTokenizer to break it up.
 
Share this answer
 
I was thinking of using a machine learning approach:

1) Use simple delimiters to parse the file
2) Use a decision-tree classifier (or any other classifier) to correct what has not correctly parsed.

But all this is still a bit vague for now
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900