NLP algorithm for clustering dictionary words into different categories and classifying new words into those clusters. What could be the code?

Question

1.00/5 (2 votes)

See more:

, +

I have to write a NLP algorithm for grouping words from dictionary into different clusters eg. technology, food, color etc. and then able to classify new words into these clusters. what could be possible code for this simple algorithm? I am new to NLP please help me out.

What I have tried:

I tried K-means clustering but was unable to do so as i am new to NLP.

Posted 20-May-21 3:24am

Member 14844003

Updated 20-May-21 3:41am

Add a Solution

Comments

Dave Kreskowiak 20-May-21 10:37am

If you think someone is going to just hand over code, you are sorely mistaken.

The only code you're going to get is the code YOU write. In order to do that, you have to understand NLP. That's why you got the link from Richard that you did.

Member 14844003 20-May-21 11:02am

obviously i know no one is gonna give me the code i just wanted to know the approach as to weather i have to prepare training data to the different clusters or is there any direct approach.

Jarek Szczegielniak 20-May-21 11:33am

You have a few options here, but in any case you will need to start with converting the words into numeric vectors (and one-hot encoding will not work in this case).
Some approaches you can try are (from the simplest one):
1. Rely on some pre-calculated embeddings (GloVe, word2vec, Bert, etc.), taking advantage of the fact, that words with similar meaning should be relatively "close" together (in some dimensions at least). Anyway, having these embeddings, you can try some classic unsupervised (e.g. clustering) or supervised (e.g. classification) algorithms on such vectors (depending if you have labels in your dataset).
2. Alternatively, you can attempt to obtain a definition of each word (e.g. from Wikipedia / online dictionary), and then use a pertained model (e.g. Bert) to calculate sentence embeddings for each definition. With such vectors, the rest is the same as in option 1 above.
3. Finally, if you have a labeled dataset, you can try to train end-to-end classification model using a pre-trained model (e.g. Bert) on it.

If you are completely new to NLP, I recommend to use a library like SpaCy (https://spacy.io) or SparkNLP, which let you work with multiple models, including GloVe, word-2-vec and transformers networks (such as Bert). Note, that while transformers networks are really good, they are very, very slow (unless you have a GPU/TPU cluster at your disposal of course ;-)).

Member 14844003 20-May-21 11:47am

Thank you so much!!!

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard MacCutchan · Accepted Answer · 2021-05-20T03:41:00

Solution 1

NLP - Google Search[^]

Posted 20-May-21 3:41am

Richard MacCutchan