There are two problems here:
- Identify the format.
- Define the parsing method.
The main problem is the first one because there are different formats where some parts may be not present and there may be multiple authors. To know about commonly used formats, search the web for "scientific publication reference format".
Once the format has been identified, using
Regular expression operations[
^] can be used to split the string.
Using some kind of self learning here might be a difficult task. A more practical solution would use predefined formats and check if the input string matches. When a matching fails it can be reported and analysed to add a new format.
To automate the task, the format checkers can use some kind of tokens for specific elements that are then translated to corresponding regular expressions.