Identifying the gender

Identifying the gender of a name is an interesting task in NLP. We will use the heuristic that the last few characters in a name is its defining characteristic. For example, if the name ends with "la", it's most likely a female name, such as "Angela" or "Layla". On the other hand, if the name ends with "im", it's most likely a male name, such as "Tim" or "Jim". As we are sure of the exact number of characters to use, we will experiment with this. Let's see how to do it.

How to do it…

  1. Create a new Python file, and import the following packages:
    import random
    from nltk.corpus import names
    from nltk import NaiveBayesClassifier
    from nltk.classify import accuracy as nltk_accuracy
  2. We need to define a function to extract features from input words:
    # Extract features from the input word
    def gender_features(word, num_letters=2):
        return {'feature': word[-num_letters:].lower()}
  3. Let's define the main function. We need some labeled training data:
    if __name__=='__main__':
        # Extract labeled names
        labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                [(name, 'female') for name in names.words('female.txt')])
  4. Seed the random number generator, and shuffle the training data:
        random.seed(7)
        random.shuffle(labeled_names)
  5. Define some input names to play with:
        input_names = ['Leonardo', 'Amy', 'Sam']
  6. As we don't know how many ending characters we need to consider, we will sweep the parameter space from 1 to 5. Each time, we will extract the features, as follows:
        # Sweeping the parameter space
        for i in range(1, 5):
            print '
    Number of letters:', i
            featuresets = [(gender_features(n, i), gender) for (n, gender) in labeled_names]
  7. Divide this into train and test datasets:
            train_set, test_set = featuresets[500:], featuresets[:500]
  8. We will use the Naive Bayes classifier to do this:
            classifier = NaiveBayesClassifier.train(train_set)
  9. Evaluate the classifier for each value in the parameter space:
            # Print classifier accuracy
            print 'Accuracy ==>', str(100 * nltk_accuracy(classifier, test_set)) + str('%')
    
    # Predict outputs for new inputs
            for name in input_names:
                print name, '==>', classifier.classify(gender_features(name, i))
  10. The full code is in the gender_identification.py file. If you run this code, you will see the following output printed on your Terminal:
    How to do it…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.43.216