Click here to Skip to main content
15,886,787 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Using python, i want to separate names(along with initials) and titles from a text such as this:

-Kuhn, R. Molekulare Asymmetrie in Stereochemie, 1933, 803.
-Miyashita, A.; Yasuda, A.; Takaya, H.; Toriumi, K.; Ito, T.; Souchi, T.; Noyori, R. Synthesis of 2,2'-bis(diphenylphosphino)-1,1'-binaphthyl (BINAP), an atropisomeric chiral bis(triaryl)phosphine, and its use in the rhodium(I)-catalyzed asymmetric hydrogenation of α-(acylamino)acrylic acids. J. Am. Chem. Soc. 1980, 102, 7932-7934.

What I have tried:

I've learnt various tutorials related to machine learning,sci-kit learn for more than ten days,and also visited various websites.Most of it was either theoretical or was focused on working with data related to numbers.I don't want to explore in to topics that might not be related to my work(I'm a beginner).I was unable to find a proper starting point to solve this problem
Posted

There are two problems here:

  1. Identify the format.
  2. Define the parsing method.

The main problem is the first one because there are different formats where some parts may be not present and there may be multiple authors. To know about commonly used formats, search the web for "scientific publication reference format".

Once the format has been identified, using Regular expression operations[^] can be used to split the string.

Using some kind of self learning here might be a difficult task. A more practical solution would use predefined formats and check if the input string matches. When a matching fails it can be reported and analysed to add a new format.

To automate the task, the format checkers can use some kind of tokens for specific elements that are then translated to corresponding regular expressions.
 
Share this answer
 
Comments
Sascha Lefèvre 17-Feb-16 9:12am    
+5
Member 10188486 26-Feb-16 0:52am    
Thank you for your suggestion.Once i have researched and found that that there were around 8700+ formats.Also it might increase in future.Is not there a way in terms of machine learning(supervised/unsupervised) to segregate the punctuation marks and get the data from it?
Jochen Arndt 26-Feb-16 3:04am    
I don't think that there is an existing solution for this specific problem.
So you would to have it implemented yourself.

A possible method:
The input must be split into parts using different rules (e.g. the period may be a delimiter or abbreviating something like a name or word). Then each part is weighted for probable content like name, title, source. The highest weight defines the detected content. Upon successful detection the pattern and weights are then added to a database for further use. With low weights detection fails and the pattern can be stored for manual analysis. Once the algorithm is implemented it must learn using a set of patterns where the result is known (usually matching patterns and wrong ones that should not generate a match).

But you must be aware that you will still get false positives and miss patterns.
I don't have a whole solution for you but two suggestion which, I think, could be helpful:

1) As Jochen suggested, you could check different formats. You could implement those checks in a way that they don't produce a plain true/false but a certainty of a match (e.g. a float value between 0 and 1). That way, even if there is no 100% match, you could still choose the format that yielded the highest certainty.

2) One element of determining the certainty of a format-match could be automated googling of the determined title. If you find the title in the search results surrounded by different characters than in your input it would increase the certainty of a match.
 
Share this answer
 
Comments
Jochen Arndt 17-Feb-16 9:20am    
+5 for you too.
Good idea to search for the title used in a different context.
Member 10188486 26-Feb-16 0:55am    
If i limit the number of formats,then my application can only cater to a certain number of formats.But i want to universally accept any kind of format and segregate the data.Then only will my application be more usable.Thank you for your suggestion.
Sascha Lefèvre 26-Feb-16 8:40am    
Seeing Jochen's recent reply to you I can't really add much to it. With 8700 different formats no automated solution will yield perfect results. If it's that important to you you will have to try different approaches and compare their effectiveness.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900