15,885,546 members
Sign in
Sign in
Email
Password
Forgot your password?
Sign in with
home
articles
Browse Topics
>
Latest Articles
Top Articles
Posting/Update Guidelines
Article Help Forum
Submit an article or tip
Import GitHub Project
Import your Blog
quick answers
Q&A
Ask a Question
View Unanswered Questions
View All Questions
View C# questions
View C++ questions
View Javascript questions
View Visual Basic questions
View Python questions
discussions
forums
CodeProject.AI Server
All Message Boards...
Application Lifecycle
>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
Artificial Intelligence
ASP.NET
JavaScript
Internet of Things
C / C++ / MFC
>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices
>
System Admin
Hosting and Servers
Java
Linux Programming
Python
.NET (Core and Framework)
Android
iOS
Mobile
WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
CodeProject Stuff
community
lounge
Who's Who
Most Valuable Professionals
The Lounge
The CodeProject Blog
Where I Am: Member Photos
The Insider News
The Weird & The Wonderful
help
?
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
About Us
Search within:
Articles
Quick Answers
Messages
Comments by Jarek Szczegielniak (Top 1 by date)
Jarek Szczegielniak
20-May-21 11:33am
View
You have a few options here, but in any case you will need to start with converting the words into numeric vectors (and one-hot encoding will not work in this case).
Some approaches you can try are (from the simplest one):
1. Rely on some pre-calculated embeddings (GloVe, word2vec, Bert, etc.), taking advantage of the fact, that words with similar meaning should be relatively "close" together (in some dimensions at least). Anyway, having these embeddings, you can try some classic unsupervised (e.g. clustering) or supervised (e.g. classification) algorithms on such vectors (depending if you have labels in your dataset).
2. Alternatively, you can attempt to obtain a definition of each word (e.g. from Wikipedia / online dictionary), and then use a pertained model (e.g. Bert) to calculate sentence embeddings for each definition. With such vectors, the rest is the same as in option 1 above.
3. Finally, if you have a labeled dataset, you can try to train end-to-end classification model using a pre-trained model (e.g. Bert) on it.
If you are completely new to NLP, I recommend to use a library like SpaCy (https://spacy.io) or SparkNLP, which let you work with multiple models, including GloVe, word-2-vec and transformers networks (such as Bert). Note, that while transformers networks are really good, they are very, very slow (unless you have a GPU/TPU cluster at your disposal of course ;-)).