9.4 Write a program to read through the mbox-short.txt and figure out who has sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

Python Program to Find the Most Frequent Email Address in a File

This Python program reads a text file containing email log data and finds which email address appears the most times in lines starting with “From ”. It also counts how many times that email occurs. This type of program is commonly used in log file analysis, email data processing, and data mining.

Let us understand the program step by step.


1. Taking File Name as Input

name = input("Enter file:")

The program first asks the user to enter the name of the file. The input() function reads a string from the user and stores it in the variable name.

For example, the user might enter:

mbox-short.txt

This allows the program to work with different files instead of using a fixed file name.


2. Setting a Default File

if len(name) < 1 : name = "mbox-short.txt"

This line checks whether the user entered a file name or just pressed Enter.

  • len(name) finds the length of the string entered.

  • If the length is less than 1, it means the user did not type anything.

In that case, the program automatically uses the default file "mbox-short.txt".

This makes the program easier to test because the user does not always need to type the file name.


3. Opening the File

fh = open(name)

The open() function opens the file specified by the variable name.

The file handle is stored in the variable fh. A file handle allows Python to read the contents of the file line by line.

If the file does not exist, Python will generate an error.


4. Creating Data Structures

from_lines = []
emails = {}

Two variables are created here:

from_lines = []

This is an empty list. In this program, it is actually not used later, so it could be removed without affecting the program.

emails = {}

This is an empty dictionary.

A dictionary stores data in key-value pairs. In this case:

  • Key → email address

  • Value → number of times the email appears

Example dictionary after processing:

{
'stephen.marquard@uct.ac.za': 2,
'louis@media.berkeley.edu': 3,
'zqian@umich.edu': 1
}

5. Reading the File Line by Line

for line in fh:

This loop reads the file one line at a time.

Each iteration stores a single line from the file in the variable line.

This is efficient because Python does not load the entire file into memory.


6. Removing Extra Spaces

line = line.rstrip()

The rstrip() function removes whitespace characters (such as newline characters \n) from the end of the line.

This helps ensure the text is processed correctly.


7. Checking for Lines Starting with “From ”

if line.find('From ') == 0:

The find() method searches for a substring inside the line.

  • If 'From ' appears at position 0, it means the line starts with “From ”.

Example matching line:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Only these lines are processed further.


8. Splitting the Line

line = line.split(' ')

The split() function breaks the line into individual words using spaces as separators.

Example:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

becomes

['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jan', '5', '09:14:16', '2008']

Each element can be accessed using an index.


9. Extracting the Email Address

email = line[1]

The second element of the list (index 1) contains the email address.

Example:

stephen.marquard@uct.ac.za

This email address is stored in the variable email.


10. Counting Email Occurrences

if email not in emails:
emails[email] = 1
else:
emails[email] += 1

This part updates the dictionary.

Case 1: Email not in dictionary

If the email appears for the first time, it is added to the dictionary with a value of 1.

Example:

emails['louis@media.berkeley.edu'] = 1

Case 2: Email already exists

If the email is already in the dictionary, the count is increased by 1.

Example:

emails['louis@media.berkeley.edu'] += 1

This keeps track of how many times each email appears.


11. Finding the Email with Maximum Count

email = ''
count = 0

Two variables are created:

  • email → to store the most frequent email

  • count → to store the highest count


12. Checking Each Email

for key in emails:

This loop goes through every email in the dictionary.


13. Comparing Counts

if emails[key] > count:
count = emails[key]
email = key

If the count of the current email is greater than the stored maximum:

  • Update count

  • Store the email in the variable email

By the end of the loop, we will have the email with the highest frequency.


14. Printing the Result

print(email, str(count))

Finally, the program prints:

  • The email address that appeared the most

  • The number of times it appeared

Example output:

cwen@iupui.edu 5

This means the email cwen@iupui.edu appeared 5 times in lines starting with “From ”.


Conclusion

This program demonstrates several important Python concepts:

  • File handling

  • String processing

  • Lists

  • Dictionaries

  • Loops

  • Conditional statements

It reads an email log file, extracts sender addresses, counts their occurrences, and identifies the most frequent sender. Such programs are very useful in data analysis, log monitoring, and email processing systems.


Post a Comment

0 Comments