Python Single Word / Proper Name Extraction

by Chet Hosmer

When examining ASCII text data during a forensic investigation, it is often useful to extract proper names and then rank those proper names by the highest number of occurrences.  The Python language has built-in capabilities that will perform this extraction swiftly and easily.  To demonstrate, I created a single Python script that will do just that.  You can run the script on your favorite platform, Windows, Linux or Mac using Python 2.7.x.

So first, what is a proper name and how does that differ from a proper noun?  Also, why is this useful during cybercrime investigations?  Current linguistics differentiate this by classifying the noun.  For example proper nouns such as the single word (mountain or river) are valid nouns, but are not actually very useful to the forensic examiner.  On the other hand proper names in a single word such as (Everest, Mississippi,) are more interesting.  Even more thought-provoking are proper names such as (Robert , Jonathan, Kevin, Austin, Texas, Bourbon Street, the Pentagon, or the Chrysler building).  In normal texts these proper names are likely capitalized and quite easy to strip, identify, count and sort.

So how is this done?

The code excerpt found below is a function that I created to extract possible proper names from an input string. The function (ignoring comments) is just over 20 lines of code.

For a more in-depth look into searching, indexing and natural language examination using Python, check out Chapter 4 and Chapter 7 of my current book, Python Forensics.

####################
# Function
# Name: ExtractProperNames
# Purpose: Extract possible proper names from the passed string
# Input: string
# Return: Dictionary of possible Proper Names along with the number of 
#         of occurrences as a key, value pair
# Usage: theDictionary = ExtractProperNames('John is from Alaska')
####################

def ExtractProperNames(theString):

    # Prepare the string (strip formatting and special characters)
    # You can extend the set of allowed characters by adding to the string
    # Note this example assumes ASCII characters not unicode

    allowedCharacters ="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

    finalString = ''

    # Notice that you can write Python like English if you choose your 
    #    words carefully

    # Process each character in the theString passed to the function
    for eachCharacter in theString:
        # Check to see if the character is in the allowedCharacter string
        if eachCharacter in allowedCharacters:
            # Yes, then add the character to the finalString
            finalString = finalString + eachCharacter
        else:
            # otherwise replace the not allowed character 
            #    with a space
            finalString = finalString + ' '

    # Now that we only have allowed characters or spaces in finalString
    #     we can use the built in Python string.split() method
    # This one line will create a list of words contained in the finalString

    wordList = finalString.split()

    # Now, let's determine which words are possible proper names
    #     and create a list of them.

    # We start by declaring an empty list

    properNameList = []

    # For this example we will assume words are possible proper names
    #    if they are in title case and they meet certain length requirements
    # We will use a Min Length of 4 and a Max Length of 20  

    # To do this, we loop through each word in the word list
    #    and if the word is in title case and the word meets
    #    our minimum/maximum size limits we add the word to the properNameList
    # We utilize the Python built in string method string.istitle() 

    for eachWord in wordList:

        if eachWord.istitle() and len(eachWord) >= 4 and len(eachWord) <= 20:
            # if the word meets the specified conditions we add it
            #    to the properNamesList
            properNameList.append(eachWord)
        else:
            # otherwise we loop to the next word
            continue

    # Note this list will likely contain duplicates to deal with this
    #    and to determine the number of times a proper name is used
    #    we will create a Python Dictionary

    # The Dictionary will contain a key, value pair.
    # The key will be the proper name and value is the number of occurrences
    #     found in the text

    # Create an empty dictionary
    properNamesDictionary = {}

    # Next we loop through the properNamesList
    for eachName in properNameList:

        # if the name is already in the dictionary
        # the name has been processed so continue

        if eachName in properNamesDictionary:
            continue
        else:
            # otherwise we count the number of occurrences. 
            # We do this by using the List method list.count()
            cnt = properNameList.count(eachName)
            # then add the new entry to the dictionary
            # key   = eachName
            # value = count
            properNamesDictionary[eachName] = cnt

    # the function returns the created properNamesDictionary

    return properNamesDictionary

# End ExtractProperNames()

The completeprogram listing below utilizes this function to process a file and print out, in sorted order (highest occurrences first), each possible proper name found.  After the complete program listing, I apply the program to a well-known book.  The first 5 people that can determine the name of the book and author by examining the extracted proper names will receive a free Python Forensic Stylus Pen.  Send your guess for the name of the text and author to cdh@python-forensics.org.

Full Program Listing

 

'''
Copyright (c) 2014 Chet Hosmer

Permission is hereby granted, free of charge, to any person obtaining a copy of this software
and associated documentation files (the "Software"), to deal in the Software without restriction, 
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.

usage: python ExtractProperNames -i [full pathname of text file]

'''
# import the Operating System Module this handles file system I/O operations and definitions
import os

# import the Standard Library Module of handling program arguments
import argparse

# PSUEDO CONSTANTS
MIN_WORD_SIZE = 4      # Minimum length of a proper name
MAX_WORD_SIZE = 20     # Maximum lenght of a proper name

# Name: ParseCommand() Function
#
# Desc: Process and Validate the command line arguments
#           use Python Standard Library module argparse
#
# Input: none
#  
# Actions: 
#              Uses the standard library argparse to process the command line
#
# For this program we expect 3 potential arguments
# -i which defines the full path and file name of the input text file
#
# returns theArguments

def ParseCommandLine():

    parser = argparse.ArgumentParser('Proper Names Extractor')
    parser.add_argument('-i', '--inputFile', type= ValidateFileRead,  required=True, help="Input filename to extract from")

    theArgs = parser.parse_args()           

    return theArgs

# End ParseCommandLine()

#
# Name: ValidateFileRead Function
#
# Desc: Function that will validate that a file exists and is readable
#
# Input: A file name with full path
#  
# Actions: 
#              if valid will return path
#
#              if invalid it will raise an ArgumentTypeError within argparse
#              which will inturn be reported by argparse to the user
#

def ValidateFileRead(theFile):

    # Validate the path is a valid
    if not os.path.exists(theFile):
        raise argparse.ArgumentTypeError('File does not exist')

    # Validate the path is readable
    if os.access(theFile, os.R_OK):
        return theFile
    else:
        raise argparse.ArgumentTypeError('File is not readable')

# End ValidateFileRead()

####################
# Function
# Name: ExtractProperNames
# Purpose: Extract possible proper names from the passed string
# Input: string
# Return: Dictionary of possible Proper Names along with the number of 
#         of occurrences as a key, value pair
# Usage: theDictionary = ExtractProperNames('John is from Alaska')
####################

def ExtractProperNames(theString):

    # Prepare the string (strip formatting and special characters)
    # You can extend the set of allowed characters by adding to the string
    # Note this example assumes ASCII characters not unicode

    allowedCharacters ="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

    finalString = ''

    # Notice that you can write Python like English if you choose your 
    #    words carefully

    # Process each character in the theString passed to the function
    for eachCharacter in theString:
        # Check to see if the character is in the allowedCharacter string
        if eachCharacter in allowedCharacters:
            # Yes, then add the character to the finalString
            finalString = finalString + eachCharacter
        else:
            # otherwise replace the not allowed character 
            #    with a space
            finalString = finalString + ' '

    # Now that we only have allowed characters or spaces in finalString
    #     we can use the built in Python string.split() method
    # This one line will create a list of words contained in the finalString

    wordList = finalString.split()

    # Now, let's determine which words are possible proper names
    #     and create a list of them.

    # We start by declaring an empty list

    properNameList = []

    # For this example we will assume words are possible proper names
    #    if they are in title case and they meet certain length requirements
    # We will use a Min Length of 4 and a Max Length of 20  

    # To do this, we loop through each word in the word list
    #    and if the word is in title case and the word meets
    #    our minimum/maximum size limits we add the word to the properNameList
    # We utilize the Python built in string method string.istitle() 

    for eachWord in wordList:

        if eachWord.istitle() and len(eachWord) >= 4 and len(eachWord) <= 20:
            # if the word meets the specified conditions we add it
            #    to the properNamesList
            properNameList.append(eachWord)
        else:
            # otherwise we loop to the next word
            continue

    # Note this list will likely contain duplicates to deal with this
    #    and to determine the number of times a proper name is used
    #    we will create a Python Dictionary

    # The Dictionary will contain a key, value pair.
    # The key will be the proper name and value is the number of occurrences
    #     found in the text

    # Create an empty dictionary
    properNamesDictionary = {}

    # Next we loop through the properNamesList
    for eachName in properNameList:

        # if the name is already in the dictionary
        # the name has been processed so continue

        if eachName in properNamesDictionary:
            continue
        else:
            # otherwise we count the number of occurrences. 
            # We do this by using the List method list.count()
            cnt = properNameList.count(eachName)
            # then add the new entry to the dictionary
            # key   = eachName
            # value = count
            properNamesDictionary[eachName] = cnt

    # the function returns the created properNamesDictionary

    return properNamesDictionary

# End Extract Proper Names Function

#######################
# Main program for Extract Proper Names
# 
# Input: 
#       verboseFlag: used to be loud or silent in processing
#       inputfile:   full path and filename of the input text file
######################

def main(inputFile):

    try:
        # Note this method assumes the file can be completely
        # read into memory (for very large files a buffered approach
        # will be necesary)

        # Attempt to Open the File, then read the contents, and then close
        fp = open(inputFile, 'rb')
        fileContents = fp.read()
        fp.close()
    except:
        print 'File Handling Exception Reported'
        exit(0)

    # Call the ExtractProperNames function which returns
    #     a Python dictionary of possible proper names along with
    #     the number of occurances of that name

    properNamesDictionary = ExtractProperNames(fileContents)

    # Now lets print out the dictionary sorted by value
    #     the value is the number of occurances of the proper name

    # This approach will print out the possible proper names with
    #     the highest occurance first

    for eachName in sorted(properNamesDictionary, key=properNamesDictionary.get, reverse=True):
        print eachName, properNamesDictionary[eachName],
        print " : ",

# End Main Function

#=================================================================
# Entry Point
#=================================================================

# Processes the user supplied arguments
#     and if successful calls the main function
#     with the appropriate argument

if __name__ == "__main__":

    args = ParseCommandLine()

    # Call main passing the user defined argument
    #    in this case the input file name

    main(args.inputFile)

Program Execution and Output

C:\TESTPROPER>python ExtractProperNames.py -i sample.txt

Ahab 501  :  Whale 285  :  Stubb 255  :  Queequeg 252  :  Captain 215  :  Starbuck 196  :  What 175  :  Pequod 172  :  There 149  :  Sperm 135  :  This 113  :  Flask 104  :  White 89  :  Dick 86  :  Moby 86  :  That 85  :  Nantucket 85  :  Jonah 84  :  Project 84  :  Gutenberg 84  :  They 79  :  Bildad 76  :  Peleg 74  :  Leviathan 64  :  With 59  :  Then 57  :  Indian 57  :  Well 56  :  Tashtego 54  :  When 52  :  Right 52  :  Look 51  :  Here 49  :  Though 49  :  English 47  :  Lord 44  :  Steelkilt 40  :  Thou 39  :  Some 39  :  Cape 38  :  King 37  :  Greenland 35  :  Such 34  :  American 34  :  Daggoo 34  :  Fish 33  :  Pacific 32  :  While 31  :  Besides 30  :  Where 29  :  Come 29  :  Whales 29  :  Parsee 28  :  Fedallah 27  :  Nevertheless 27  :  From 26  :  Upon 26  :  Thus 26  :  First 24  :  Dutch 24  :  Lakeman 24  :  Stand 24  :  Foundation 24  :  Good 23  :  Like 23  :  Nantucketer 22  :  Radney 22  :  These 21  :  South 21  :  Gabriel 20  :  England 20  :  Atlantic 19  :  Take 19  :  Ishmael 19  :  Because 19  :  Meantime 19  :  Horn 18  :  Christian 18  :  Perth 18  :  Death 18  :  French 18  :  Bedford 18  :  Yojo 17  :  Soon 17  :  After 17  :  Hussey 16  :  Dough 16  :  Fast 16  :  Line 16  :  Meanwhile 16  :  However 16  :  German 15  :  Give 15  :  Only 15  :  Elijah 15  :  Town 15  :  Roman 15  :  Whether 15  :  States 15  :  Queen 14  :  Loose 14  :  Coffin 14  :  Still 13  :  Bunger 13  :  True 13  :  Archive 13  :  Post 13  :  London 13  :  Avast 13  :  Cook 13  :  Nothing 13  :  Literary 13  :  Great 13  :  Will 12  :  Manxman 12  :  Down 12  :  United 12  :  Derick 12

… shortened for brevity

Crammer 1  :  Macy 1  :  Canaris 1  :  Floundered 1  :  Dusk 1  :  Reckon 1  :  Spell 1  :  Mahomet 1  :  Societies 1  :  Petrified 1  :  Cooks 1  :  Science 1  :  Mesopotamian 1  :  Refund 1  :  Sicilian 1  :  Gold 1  :  Philologically 1  :  Alive 1  :  Britain 1  :  Borean 1  :  Scorpio 1  :  Archipelagoes 1  :  Executive 1  :  Passed 1  :  Berkshire 1  :  Cattegat 1  :  Monstrous 1  :  Shiver 1  :  Dunkirk 1  :  Kick 1  :  Fata 1  :  Turning 1  :  Manhattoes 1  :  Whitehall 1  :  Hollanders 1  :  Mazeppa 1  :  Buckets 1  :  Hygiene 1  :  Five 1  :  Know 1  :  Improving 1  :  Jimmini 1  :  February 1  :  Vitus 1  :  Pillar 1  :  Morquan 1  :  Future 1  :  Lose 1  :  Baltimore 1  :  Dover 1  :  Ehrenbreitstein 1  :  Dericks 1  :  Icebergs 1  :  Sleep 1  :  Actium 1  :  Shooting 1  :

_______________________________________________

Chet Hosmer is an author, educator and researcher.  Chet is a co-founder of WetStone Technologies, Inc., a Visiting Professor at Utica College in the Cybersecurity graduate program, and an Adjunct Professor at Champlain College where he teaches in the Digital Forensics Graduate program.  He resides with his two-legged and four-legged family near Myrtle Beach, South Carolina.  He is the author of the popular Syngress titles Python Forensics and Data Hiding, with more to come!

Python Forensicsdata hiding

This entry was posted in Discussion, Example, Source Code. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


eight + = 12