Click here to Skip to main content
15,884,906 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I want the similar item names to be returned when I use Soundex function. For example if I input "2 inch tape" then it should return values like
2" tape
2 inch tape
2 in tape


from matching existing records. Following is my current query which gives hundreds of unnecessary records along with the required records. Or is there any other way like C# library which can help in achieving this as well so that would be great.

What I have tried:

SQL
select * from InvItem where SOUNDEX(Name) = SOUNDEX('2 inch')

https://i.postimg.cc/PqWpJ1xS/001png.png[^]
Posted
Updated 6-Apr-21 0:01am
v4
Comments
Richard MacCutchan 6-Apr-21 6:35am    
You should not use SOUNDEX for such a query, but LIKE or a regex. SOUNDEX converts a string to something that represents how the characters sound in English, which will not be selective enough for your needs.
Hamza Jawed 2021 6-Apr-21 6:59am    
I have learned that Soundex is good fit with the alphabetic string but if you have a special character such as comma or numeric value then the encoding of soundex would become 0000 and it will bring all the data with numeric value or special character because encoded index would match, that's why I am looking for the better alternatives to Soundex function. If you could help with that.
Richard MacCutchan 6-Apr-21 7:01am    
Exactly why I suggested that SOUNDEX is not a good choice.
Hamza Jawed 2021 6-Apr-21 9:35am    
what do you suggest in this scenario for me to implement?

1 solution

If you are only using Soundex then you will get all the records that have the same Soundex - it doesn't matter what language you use.

Without showing us any of the "unnecessary" values that are being returned it is difficult to advise any specific improvements.

However, you might want to apply a Levenshtein distance algorithm to the soundex results to further filter them down e.g. Levenshtein Distance Algorithm – SQLServerCentral[^]
 
Share this answer
 
Comments
Hamza Jawed 2021 6-Apr-21 5:19am    
@CHill60 I have updated the question and added a link to the image result of query, please look into it. I do not prefer to use Levenshtein Distance Algorithm as it contains multiple loops and I have tens of thousands of records in my table to check, it will make it slow process.
CHill60 6-Apr-21 6:09am    
Unfortunately because you have used an image rather than just pasting the data into your question I am unable to help further to understand why you are getting seemingly unrelated data in your results. I do not have the time nor the inclination to transpose that information into a temporary table to trial any solutions. Be aware that there is a much faster implementation of the LD algorithm at 35x Improved T-SQL LevenShtein Distance Algorithm...at a cost[^]

Instead of trying to "fix" the problem whilst querying data have you considered alternative approaches ...

If you are getting tens of thousands of records returned that all have the same soundex then consider including additional criteria e.g. CategoryId.

Another approach to attempt would be to limit the variety of input - i.e. validate the data on entry so that it always reads "2 inch tape" or whatever. This would of course require a one-off data cleansing exercise.

If you have 10's of thousands of records with that text in, then perhaps further normalisation of your database would help.

I also note that you example includes the word "tape" whereas your query does not. That could also be a factor
CHill60 6-Apr-21 6:19am    
Something else I have just discovered is that the SOUNDEX for both '2 inch tape' and '80 tilapia' is being returned as '0000' on my instance here. I don't know why and don't have time just now to investigate. If this is happening for you, you could try implementing your own version of a soundex function OR extract the non-numeric parts of the value using regex then reapply the first n numerics on the compare
Hamza Jawed 2021 6-Apr-21 6:59am    
I have learned that Soundex is good fit with the alphabetic string but if you have a special character such as comma or numeric value then the encoding of soundex would become 0000 and it will bring all the data with numeric value or special character because encoded index would match, that's why I am looking for the better alternatives to Soundex function. If you could help with that.
CHill60 6-Apr-21 8:15am    
Sounds like you really do need to implement your own Soundex-alike function - one which removes non-alpha characters, soundex the rest, then put back any numeric that "starts" the string.
Or convince your client to tidy up their data.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900