![]() Currently, I am getting those synonyms from database. How can I get similarity in synonyms? while keeping the control on which words are considered synonyms. Sum2 = sum( ** 2 for x in list(vec2.keys())])ĭenominator = math.sqrt(sum1) * math.sqrt(sum2) Sum1 = sum( ** 2 for x in list(vec1.keys())]) Numerator = sum( * vec2 for x in intersection]) Intersection = set(vec1.keys()) & set(vec2.keys()) I implemented cosine similarity in the following way: from collections import Counter I need to get a similarity below 1.0 because restaurant is not considered the same word, but is synonym. However, if I apply this strategy, I ended up with similarity 1.0 (0.65/0.65). Then, I have the following: v1=Counter() #0.65 because word restaurant is synonym In this scenario, I have to compare the original v1= against v2=. To apply cosine similarity under this scenario, I decided to keep the same word in both vectors, but if one word is considered synonym then I subtract a "penalty" to the counter. Published by Houghton Mifflin Harcourt Publishing Company. Copyright 2016 by Houghton Mifflin Harcourt Publishing Company. American Heritage Dictionary of the English Language, Fifth Edition. A corresponding aspect or feature equivalence: a similarity of writing styles. For instance, word restaurant and bar are considered synonyms in this example. The quality or condition of being similar resemblance. I recommend you to look into ELMo, which offers word-level contextual representations.I have some words which are synonyms that I would like to consider similar to the original word.I recommend you not using BERT, because you are interested at word-level information, while BERT only offert subword-level stuff.There is no direct way of obtaining a "combined representation" from the individual subword representations. ![]() for word "difficult" you may obtain a tokenization like "diff", "i", "cult". Therefore, you obtain representations of pieces of words, not words themselves, e.g. ![]() The similarity value comes from Word2vec. This means that before going throug the network, there is a tokenization process that splits words into word pieces. Unlike that other word game, its not about the spelling its about the meaning. the representations for country names like "france" and "italy" may be close) or there may even be negative correlation (antonyms may be very close).īERT is subword-level, not word-level. The similarity between two embedded vectors may only be loosely related to their semantics (e.g. word2vec) only reflect co-occurrence statistics. You can also look at some other functions that come with it which allow you to find similar words just by a single vector, you can find these in the second link: self.wv.similar_by_vector() Here is a link to how to train a word2vec model from scratch: Here is the link to the documentation of what I am talking: (This may be meaningful if you’ve sorted the vocabulary by descending frequency.) For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. Restrict_vocab = an optional integer which limits the range of vectors which are searched for most-similar values. Topn = the number of nearest neighbors you want to the input combination list of positive and negative words. model.most_similar(positive=, negative=, topn=10, restrict_vocab=None)Īn example, provided in the documentation: model.most_similar(positive=, negative=, topn=10, restrict_vocab=None) It allows you to input a list of positive and negative words to tackle the 'good' and 'bad' problem. You can make use of the most_similar function to find the top n similar words. You can train a Word2Vec model using gensim: model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) Gensim has a built in functionality to find similar words, using Word2vec.
0 Comments
Leave a Reply. |