Mar 15, 2013

How to remove duplicate lines from a file in Python

Today, we will discuss removing duplicate lines from a file in Python

Lets discuss in two ways as mentioned below :
1) Removing duplicate lines and print the lines in order (When Order is important)
        - Using normal way
2) Removing duplicate lines and print the lines in any order (When Order is NOT important) 
        - Using SET concept in Python. SET concept in python does not consider Order.

Note:
Both the scripts read duplicated content from file_with_duplicates.txt, 
Read the above file and remove duplicate lines and finally
Print to file_without_duplicates.txt 

Input: file_with_duplicates.txt

Mother Teresa
Winston Churchill
Abraham Lincoln
Mahatma Gandhi
Winston Churchill
Mother Teresa
Abraham Lincoln  

1)
infile = open('file_with_duplicates.txt', 'r')
outfile = open('file_without_duplicates.txt', 'w')
lines_seen = set()
for line in infile:
    if line not in lines_seen:
        outfile.write(line)
        lines_seen.add(line)
outfile.close()


1) remove_duplicate_lines_from_file_with_order.py
- Using Normal Way - it take cares of order of the lines

#!/usr/bin/python

try:
    input_file = open("file_with_duplicates.txt", "r")
    output_file = open("file_without_duplicates.txt", "w")

    unique = []
    for line in input_file:
        line = line.strip()
        if line not in unique:
            unique.append(line)
    input_file.close()
    
    for i in range(0, len(unique)-1):
        unique[i] += "\n"

    output_file.writelines(unique)
    output_file.close
    
except FileNotFoundError:
    print('\n File NOT Found Error')
    sys.exit
except IOError:
    print('\n IO Error')
    sys.exit  


Output: file_without_duplicates.txt

Mother Teresa
Winston Churchill
Abraham Lincoln
Mahatma Gandhi  


2) remove_duplicate_lines_from_file_without_order.py
- Using SET concept - it does not consider the order of the lines

#!/usr/bin/python

try:
    input_file = open("file_with_duplicates.txt", "r")
    output_file = open("file_without_duplicates.txt","w")

    #The main drawback of using sets is, the order of the lines may not be same as in input file
    uniquelines = set(input_file.read().split("\n"))
    output_file.write("".join([line + "\n" for line in uniquelines]))
    
    input_file.close()
    output_file.close()
    
except FileNotFoundError:
    print('\n File NOT Found Error')
    sys.exit
except IOError:
    print('\n IO Error')
    sys.exit      

Output: file_without_duplicates.txt

Abraham Lincoln
Winston Churchill
Mother Teresa
Mahatma Gandhi  

No comments:

Post a Comment