Python Program to Find the Most Frequent Email Address in a File
This Python program reads a text file containing email log data and finds which email address appears the most times in lines starting with “From ”. It also counts how many times that email occurs. This type of program is commonly used in log file analysis, email data processing, and data mining.
Let us understand the program step by step.
1. Taking File Name as Input
name = input("Enter file:")
The program first asks the user to enter the name of the file. The input() function reads a string from the user and stores it in the variable name.
For example, the user might enter:
mbox-short.txt
This allows the program to work with different files instead of using a fixed file name.
2. Setting a Default File
if len(name) < 1 : name = "mbox-short.txt"
This line checks whether the user entered a file name or just pressed Enter.
len(name)finds the length of the string entered.If the length is less than 1, it means the user did not type anything.
In that case, the program automatically uses the default file "mbox-short.txt".
This makes the program easier to test because the user does not always need to type the file name.
3. Opening the File
fh = open(name)
The open() function opens the file specified by the variable name.
The file handle is stored in the variable fh. A file handle allows Python to read the contents of the file line by line.
If the file does not exist, Python will generate an error.
4. Creating Data Structures
from_lines = []
emails = {}
Two variables are created here:
from_lines = []
This is an empty list. In this program, it is actually not used later, so it could be removed without affecting the program.
emails = {}
This is an empty dictionary.
A dictionary stores data in key-value pairs. In this case:
Key → email address
Value → number of times the email appears
Example dictionary after processing:
{
'stephen.marquard@uct.ac.za': 2,
'louis@media.berkeley.edu': 3,
'zqian@umich.edu': 1
}
5. Reading the File Line by Line
for line in fh:
This loop reads the file one line at a time.
Each iteration stores a single line from the file in the variable line.
This is efficient because Python does not load the entire file into memory.
6. Removing Extra Spaces
line = line.rstrip()
The rstrip() function removes whitespace characters (such as newline characters \n) from the end of the line.
This helps ensure the text is processed correctly.
7. Checking for Lines Starting with “From ”
if line.find('From ') == 0:
The find() method searches for a substring inside the line.
If
'From 'appears at position 0, it means the line starts with “From ”.
Example matching line:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Only these lines are processed further.
8. Splitting the Line
line = line.split(' ')
The split() function breaks the line into individual words using spaces as separators.
Example:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
becomes
['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jan', '5', '09:14:16', '2008']
Each element can be accessed using an index.
9. Extracting the Email Address
email = line[1]
The second element of the list (index 1) contains the email address.
Example:
stephen.marquard@uct.ac.za
This email address is stored in the variable email.
10. Counting Email Occurrences
if email not in emails:
emails[email] = 1
else:
emails[email] += 1
This part updates the dictionary.
Case 1: Email not in dictionary
If the email appears for the first time, it is added to the dictionary with a value of 1.
Example:
emails['louis@media.berkeley.edu'] = 1
Case 2: Email already exists
If the email is already in the dictionary, the count is increased by 1.
Example:
emails['louis@media.berkeley.edu'] += 1
This keeps track of how many times each email appears.
11. Finding the Email with Maximum Count
email = ''
count = 0
Two variables are created:
email→ to store the most frequent emailcount→ to store the highest count
12. Checking Each Email
for key in emails:
This loop goes through every email in the dictionary.
13. Comparing Counts
if emails[key] > count:
count = emails[key]
email = key
If the count of the current email is greater than the stored maximum:
Update
countStore the email in the variable
email
By the end of the loop, we will have the email with the highest frequency.
14. Printing the Result
print(email, str(count))
Finally, the program prints:
The email address that appeared the most
The number of times it appeared
Example output:
cwen@iupui.edu 5
This means the email cwen@iupui.edu appeared 5 times in lines starting with “From ”.
Conclusion
This program demonstrates several important Python concepts:
File handling
String processing
Lists
Dictionaries
Loops
Conditional statements
It reads an email log file, extracts sender addresses, counts their occurrences, and identifies the most frequent sender. Such programs are very useful in data analysis, log monitoring, and email processing systems.
0 Comments