How do i separate text(like names and topics) from puctuation marks

Question

0.00/5 (No votes)

See more:

Using python, i want to separate names(along with initials) and titles from a text such as this:

-Kuhn, R. Molekulare Asymmetrie in Stereochemie, 1933, 803.
-Miyashita, A.; Yasuda, A.; Takaya, H.; Toriumi, K.; Ito, T.; Souchi, T.; Noyori, R. Synthesis of 2,2'-bis(diphenylphosphino)-1,1'-binaphthyl (BINAP), an atropisomeric chiral bis(triaryl)phosphine, and its use in the rhodium(I)-catalyzed asymmetric hydrogenation of α-(acylamino)acrylic acids. J. Am. Chem. Soc. 1980, 102, 7932-7934.

What I have tried:

I've learnt various tutorials related to machine learning,sci-kit learn for more than ten days,and also visited various websites.Most of it was either theoretical or was focused on working with data related to numbers.I don't want to explore in to topics that might not be related to my work(I'm a beginner).I was unable to find a proper starting point to solve this problem

Posted 17-Feb-16 2:02am

Member 10188486

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Answer 1 · 2016-02-17T02:51:00

Solution 1

There are two problems here:

Identify the format.
Define the parsing method.

The main problem is the first one because there are different formats where some parts may be not present and there may be multiple authors. To know about commonly used formats, search the web for "scientific publication reference format".

Once the format has been identified, using Regular expression operations[^] can be used to split the string.

Using some kind of self learning here might be a difficult task. A more practical solution would use predefined formats and check if the input string matches. When a matching fails it can be reported and analysed to add a new format.

To automate the task, the format checkers can use some kind of tokens for specific elements that are then translated to corresponding regular expressions.

Posted 17-Feb-16 2:51am

Jochen Arndt

Comments

Sascha Lefèvre 17-Feb-16 9:12am

+5

Member 10188486 26-Feb-16 0:52am

Thank you for your suggestion.Once i have researched and found that that there were around 8700+ formats.Also it might increase in future.Is not there a way in terms of machine learning(supervised/unsupervised) to segregate the punctuation marks and get the data from it?

Jochen Arndt 26-Feb-16 3:04am

I don't think that there is an existing solution for this specific problem.
So you would to have it implemented yourself.

A possible method:
The input must be split into parts using different rules (e.g. the period may be a delimiter or abbreviating something like a name or word). Then each part is weighted for probable content like name, title, source. The highest weight defines the detected content. Upon successful detection the pattern and weights are then added to a database for further use. With low weights detection fails and the pattern can be stored for manual analysis. Once the algorithm is implemented it must learn using a set of patterns where the result is known (usually matching patterns and wrong ones that should not generate a match).

But you must be aware that you will still get false positives and miss patterns.

Sascha Lefèvre · Answer 2 · 2016-02-17T03:10:00

Solution 2

I don't have a whole solution for you but two suggestion which, I think, could be helpful:

1) As Jochen suggested, you could check different formats. You could implement those checks in a way that they don't produce a plain true/false but a certainty of a match (e.g. a float value between 0 and 1). That way, even if there is no 100% match, you could still choose the format that yielded the highest certainty.

2) One element of determining the certainty of a format-match could be automated googling of the determined title. If you find the title in the search results surrounded by different characters than in your input it would increase the certainty of a match.

Posted 17-Feb-16 3:10am

Sascha Lefèvre

Comments

Jochen Arndt 17-Feb-16 9:20am

+5 for you too.
Good idea to search for the title used in a different context.

Member 10188486 26-Feb-16 0:55am

If i limit the number of formats,then my application can only cater to a certain number of formats.But i want to universally accept any kind of format and segregate the data.Then only will my application be more usable.Thank you for your suggestion.

Sascha Lefèvre 26-Feb-16 8:40am

Seeing Jochen's recent reply to you I can't really add much to it. With 8700 different formats no automated solution will yield perfect results. If it's that important to you you will have to try different approaches and compare their effectiveness.