Finding synonyms

A synonym is a word or phrase that has exactly the same meaning or very close meaning to another word. In a purely literature perspective this explanation is correct, but in a much wider perspective, in a given context, some of the words will have a very close relationship, and that is also called synonymous in this context. For example, Roger Federer is synonymous with Tennis. Finding this kind of synonym in context is a very common requirement in entity recognition, machine translation, and so on. The Word2Vec algorithm computes a distributed vector representation of words from the words of a given document or collection of words. If this vector space is taken, the words that have similarity or synonymity will be close to each other.

The University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/index.html) provides a lot of datasets as a service to those who are interested to learn machine learning. The Twenty Newsgroups Dataset (http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups) is being used here to find synonyms of words in context. It contains a dataset consists of 20,000 messages taken from 20 newsgroups.

Note

The Twenty Newsgroups Dataset download link lets you download the dataset discussed here. The file 20_newsgroups.tar.gz is to be downloaded and unzipped. The data directory used in the following code snippets should point to the directory where the data is available in unzipped form. If the Spark Driver is giving out of memory error because of the huge size of the data, remove some of the newsgroups data that is of not interest and experiment with a subset of the data. Here, to train the model, only the following news group data is used: talk.politics.guns, talk.politics.mideast, talk.politics.misc, and talk.religion.misc.

At the Scala REPL prompt, try the following statements:

	  
	  scala> import org.apache.spark.ml.feature.{HashingTF, Tokenizer, RegexTokenizer, Word2Vec, StopWordsRemover}
	  
      import org.apache.spark.ml.feature.{HashingTF, Tokenizer, RegexTokenizer, Word2Vec, StopWordsRemover}
    
	scala> // TODO - Change this directory to the right location where the data is stored
	scala> val dataDir = "/Users/RajT/Downloads/20_newsgroups/*"
	
      dataDir: String = /Users/RajT/Downloads/20_newsgroups/*
    
	scala> //Read the entire text into a DataFrame
	scala> // Only the following directories under the data directory has benn considered for running this program talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc. All other directories have been removed before running this program. There is no harm in retaining all the data. The only difference will be in the output.
	scala>  val textDF = sc.wholeTextFiles(dataDir).map{case(file, text) => text}.map(Tuple1.apply).toDF("sentence")
	
      textDF: org.apache.spark.sql.DataFrame = [sentence: string]
    
	scala>  // Tokenize the sentences to words
	scala>  val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setPattern("\w+").setGaps(false)
	
      regexTokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_ba7ce8ec2333
    
	scala> val tokenizedDF = regexTokenizer.transform(textDF)
	
      tokenizedDF: org.apache.spark.sql.DataFrame = [sentence: string, words: array<string>]
    
	scala>  // Remove the stop words such as a, an the, I etc which doesn't have any specific relevance to the synonyms
	scala> val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
	
      remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_775db995b8e8
    
	scala> //Remove the stop words from the text
	scala> val filteredDF = remover.transform(tokenizedDF)
	
      filteredDF: org.apache.spark.sql.DataFrame = [sentence: string, words: array<string> ... 1 more field]
    
	scala> //Prepare the Estimator
	scala> //It sets the vector size, and the method setMinCount sets the minimum number of times a token must appear to be included in the word2vec model's vocabulary.
	scala> val word2Vec = new Word2Vec().setInputCol("filtered").setOutputCol("result").setVectorSize(3).setMinCount(0)
	
      word2Vec: org.apache.spark.ml.feature.Word2Vec = w2v_bb03091c4439
    
	scala> //Train the model
	scala> val model = word2Vec.fit(filteredDF)
	
      model: org.apache.spark.ml.feature.Word2VecModel = w2v_bb03091c4439   
    
	scala> //Find 10 synonyms of a given word
	scala> val synonyms1 = model.findSynonyms("gun", 10)
	
      synonyms1: org.apache.spark.sql.DataFrame = [word: string, similarity: double]
    
	scala> synonyms1.show()
	
      +---------+------------------+
    
      |     word|        similarity|
    
      +---------+------------------+
    
      |      twa|0.9999976163843671|
    
      |cigarette|0.9999943935045497|
    
      |    sorts|0.9999885527530025|
    
      |       jj|0.9999827967650881|
    
      |presently|0.9999792188771406|
    
      |    laden|0.9999775888361028|
    
      |   notion|0.9999775296680583|
    
      | settlers|0.9999746245431419|
    
      |motivated|0.9999694932468436|
    
      |qualified|0.9999678135106314|
    
      +---------+------------------+
    
	scala> //Find 10 synonyms of a different word
	scala> val synonyms2 = model.findSynonyms("crime", 10)
	
      synonyms2: org.apache.spark.sql.DataFrame = [word: string, similarity: double]
    
	scala> synonyms2.show()
	
      +-----------+------------------+
    
	
      |       word|        similarity|
    
      +-----------+------------------+
    
      | abominable|0.9999997331058447|
    
      |authorities|0.9999946968941679|
    
      |cooperation|0.9999892536435327|
    
      |  mortazavi| 0.999986396931714|
    
      |herzegovina|0.9999861828226779|
    
      |  important|0.9999853354260315|
    
      |      1950s|0.9999832312575262|
    
      |    analogy|0.9999828272311249|
    
      |       bits|0.9999820987679822|
    
      |technically|0.9999808208936487|
    
      +-----------+------------------+

The preceding code snippet is loaded with a lot of functionality. The dataset is read from the filesystem into a DataFrame as one sentence of text from a given file. Then tokenisation is done to convert the sentences into words using regular expressions and removing the gaps. Then, from those words, the stop words are removed so that we only have relevant words. Finally, using the Word2Vec estimator, a model is trained with the data prepared. From the trained model, synonyms are determined.

The following code demonstrates the same use case using Python. At the Python REPL prompt, try the following statements:


	  >>> from pyspark.ml.feature import Word2Vec
	  >>> from pyspark.ml.feature import RegexTokenizer
	  >>> from pyspark.sql import Row
	  >>> # TODO - Change this directory to the right location where the data is stored
	  >>> dataDir = "/Users/RajT/Downloads/20_newsgroups/*"
	  >>> # Read the entire text into a DataFrame. Only the following directories under the data directory has benn considered for running this program talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc. All other directories have been removed before running this program. There is no harm in retaining all the data. The only difference will be in the output.
	  >>> textRDD = sc.wholeTextFiles(dataDir).map(lambda recs: Row(sentence=recs[1]))
	  >>> textDF = spark.createDataFrame(textRDD)
	  >>> # Tokenize the sentences to words
	  >>> regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", gaps=False, pattern="\w+")
	  >>> tokenizedDF = regexTokenizer.transform(textDF)
	  >>> # Prepare the Estimator
	  >>> # It sets the vector size, and the parameter minCount sets the minimum number of times a token must appear to be included in the word2vec model's vocabulary.
	  >>> word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="words", outputCol="result")
	  >>> # Train the model
	  >>> model = word2Vec.fit(tokenizedDF)
	  >>> # Find 10 synonyms of a given word
	  >>> synonyms1 = model.findSynonyms("gun", 10)
	  >>> synonyms1.show()
	  
      +---------+------------------+
    
      |     word|        similarity|
    
      +---------+------------------+
    
      | strapped|0.9999918504219028|
    
      |    bingo|0.9999909957939888|
    
      |collected|0.9999907658056393|
    
      |  kingdom|0.9999896797527402|
    
      | presumed|0.9999806586578037|
    
      | patients|0.9999778970248504|
    
      |    azats|0.9999718388241235|
    
      |  opening| 0.999969723774294|
    
      |  holdout|0.9999685636131942|
    
      | contrast|0.9999677676714386|
    
      +---------+------------------+
    
	>>> # Find 10 synonyms of a different word
	>>> synonyms2 = model.findSynonyms("crime", 10)
	>>> synonyms2.show()
	
      +-----------+------------------+
    
      |       word|        similarity|
    
      +-----------+------------------+
    
      |   peaceful|0.9999983523475047|
    
      |  democracy|0.9999964568156694|
    
      |      areas| 0.999994036518118|
    
      |  miniscule|0.9999920828755365|
    
      |       lame|0.9999877327660102|
    
      |    strikes|0.9999877253180771|
    
      |terminology|0.9999839393584438|
    
      |      wrath|0.9999829348358952|
    
      |    divided| 0.999982619125983|
    
      |    hillary|0.9999795817857984|
    
      +-----------+------------------+

The major difference between the Scala implementation and Python implementation is that in the Python implementation, the stop words have not been removed. That is because that functionality is not available in Python API of the Spark Machine Library. Because of this difference, the list of synonyms generated by Scala program and Python program are different.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.207.144