Click here to Skip to main content
15,879,326 members

Comments by Jarek Szczegielniak (Top 1 by date)

Jarek Szczegielniak 20-May-21 11:33am View    
You have a few options here, but in any case you will need to start with converting the words into numeric vectors (and one-hot encoding will not work in this case).
Some approaches you can try are (from the simplest one):
1. Rely on some pre-calculated embeddings (GloVe, word2vec, Bert, etc.), taking advantage of the fact, that words with similar meaning should be relatively "close" together (in some dimensions at least). Anyway, having these embeddings, you can try some classic unsupervised (e.g. clustering) or supervised (e.g. classification) algorithms on such vectors (depending if you have labels in your dataset).
2. Alternatively, you can attempt to obtain a definition of each word (e.g. from Wikipedia / online dictionary), and then use a pertained model (e.g. Bert) to calculate sentence embeddings for each definition. With such vectors, the rest is the same as in option 1 above.
3. Finally, if you have a labeled dataset, you can try to train end-to-end classification model using a pre-trained model (e.g. Bert) on it.

If you are completely new to NLP, I recommend to use a library like SpaCy (https://spacy.io) or SparkNLP, which let you work with multiple models, including GloVe, word-2-vec and transformers networks (such as Bert). Note, that while transformers networks are really good, they are very, very slow (unless you have a GPU/TPU cluster at your disposal of course ;-)).