How to remove duplicates from a list in Python

Removing duplicate elements from Python lists is a common data cleaning task that developers encounter regularly. Python provides multiple built-in methods and techniques to efficiently handle duplicate values while maintaining list order and data integrity.

This guide covers proven techniques for duplicate removal, with practical examples and performance tips. All code examples were created with Claude, an AI assistant built by Anthropic.

Using set() to remove duplicates

numbers = [1, 2, 3, 2, 1, 4, 5, 4]
unique_numbers = list(set(numbers))
print(unique_numbers)
[1, 2, 3, 4, 5]

The set() function provides the fastest way to remove duplicates from a Python list. Sets store only unique values by design, automatically discarding duplicates during conversion. Converting the list to a set and back to a list creates a new sequence containing just the unique elements.

This approach offers key advantages for data cleaning:

  • Maintains O(n) time complexity even with large lists
  • Handles any hashable data type including numbers and strings
  • Requires minimal code compared to manual filtering methods

One important consideration: set() does not preserve the original order of elements. If maintaining sequence order matters for your use case, you'll need to explore alternative methods.

Basic techniques for removing duplicates

While set() excels at speed, Python offers several order-preserving methods to remove duplicates—from basic for loops to elegant dict.fromkeys() solutions.

Using a for loop to preserve order

numbers = [1, 2, 3, 2, 1, 4, 5, 4]
unique_numbers = []
for num in numbers:
    if num not in unique_numbers:
        unique_numbers.append(num)
print(unique_numbers)
[1, 2, 3, 4, 5]

This straightforward approach uses a for loop to iterate through the original list while building a new list of unique elements. The not in operator checks if each number already exists in unique_numbers before adding it.

  • Preserves the original order of elements, unlike the set() method
  • Works with both hashable and unhashable data types
  • Simple to understand and modify for custom filtering logic

While this method requires more code than using set(), it offers better control over the deduplication process. The trade-off comes in performance. The not in check becomes slower with larger lists because it must scan the entire unique_numbers list for each element.

Using list comprehension with a tracking set

numbers = [1, 2, 3, 2, 1, 4, 5, 4]
seen = set()
unique_numbers = [x for x in numbers if not (x in seen or seen.add(x))]
print(unique_numbers)
[1, 2, 3, 4, 5]

This elegant solution combines list comprehension with a tracking set to maintain element order while achieving better performance than a basic loop. The seen set efficiently tracks encountered elements, while the list comprehension creates the final unique list.

The clever part lies in the condition not (x in seen or seen.add(x)). It leverages Python's short-circuit evaluation and the fact that add() returns None. Here's how it works:

  • When an element is first encountered, it's not in seen. The add() method adds it and returns None
  • For duplicates, the in check returns True immediately, skipping the element
  • This approach preserves order while maintaining good performance for larger lists

The result combines the speed benefits of sets with the order preservation of loops, making it an excellent choice for most deduplication tasks.

Using the dict.fromkeys() method

numbers = [1, 2, 3, 2, 1, 4, 5, 4]
unique_numbers = list(dict.fromkeys(numbers))
print(unique_numbers)
[1, 2, 3, 4, 5]

The dict.fromkeys() method creates a dictionary using list elements as keys. Since dictionary keys must be unique, this automatically removes duplicates. Converting back to a list with list() produces the final deduplicated sequence.

  • Preserves the original order of elements in Python 3.7+ due to dictionary insertion order guarantees
  • Offers a clean one-line solution that's more readable than complex loops
  • Performs efficiently by leveraging dictionary's O(1) lookup time

This approach strikes an excellent balance between code simplicity and performance. It works particularly well for basic data types like numbers and strings that can serve as dictionary keys.

Advanced techniques for removing duplicates

Beyond Python's built-in methods, specialized libraries like collections, pandas, and NumPy offer powerful tools for handling duplicate values in complex data structures.

Using OrderedDict from collections

from collections import OrderedDict
numbers = [1, 2, 3, 2, 1, 4, 5, 4]
unique_numbers = list(OrderedDict.fromkeys(numbers))
print(unique_numbers)
[1, 2, 3, 4, 5]

The OrderedDict approach combines the benefits of dictionaries with guaranteed order preservation. Similar to dict.fromkeys(), it creates a dictionary using list elements as keys while maintaining their original sequence.

  • The fromkeys() method automatically discards duplicates since dictionary keys must be unique
  • Converting back to a list with list() produces the final deduplicated sequence
  • This method works reliably across all Python versions. Unlike regular dictionaries which only preserve order in Python 3.7+

While OrderedDict requires importing from the collections module, it provides a dependable solution when maintaining element order is crucial. The slight performance overhead compared to regular dictionaries rarely impacts real-world applications.

Using pandas drop_duplicates() method

import pandas as pd
data = [(1, 'a'), (2, 'b'), (1, 'a'), (3, 'c')]
df = pd.DataFrame(data, columns=['num', 'letter'])
unique_rows = df.drop_duplicates().values.tolist()
print(unique_rows)
[[1, 'a'], [2, 'b'], [3, 'c']]

Pandas offers a powerful solution for removing duplicates from complex data structures. The drop_duplicates() method efficiently handles duplicate rows in a DataFrame, considering all columns by default when determining uniqueness.

  • The example creates a DataFrame with paired number-letter data, where one pair (1, 'a') appears twice
  • Converting the data to a DataFrame enables pandas' optimized duplicate detection
  • The values.tolist() chain converts the deduplicated DataFrame back into a familiar Python list format

This approach particularly shines when working with tabular data or when you need to remove duplicates based on multiple columns. Pandas handles the heavy lifting of comparing complex data structures while maintaining excellent performance.

Using NumPy's unique() function with index tracking

import numpy as np
numbers = [4, 1, 3, 2, 1, 4, 5, 3]
unique_indices = np.unique(numbers, return_index=True)[1]
unique_in_order = [numbers[i] for i in sorted(unique_indices)]
print(unique_in_order)
[4, 1, 3, 2, 5]

NumPy's unique() function with return_index=True returns both unique values and their first occurrence positions in the original array. The code leverages these indices to maintain the original order while removing duplicates.

  • The unique_indices variable captures the positions where each unique number first appears in the list
  • Sorting these indices with sorted() ensures elements appear in their original sequence
  • The list comprehension [numbers[i] for i in sorted(unique_indices)] rebuilds the list using only the first occurrences

This approach combines NumPy's efficient array operations with Python's built-in sorting capabilities. It works particularly well for numerical data where maintaining the original order matters.

Finding unique words in a text

Text processing often requires extracting unique words while preserving their original sequence, and Python's list comprehension with a tracking set delivers an elegant solution for this common task.

text = "The quick brown fox jumps over the lazy dog. The dog was not amused."
words = text.lower().replace('.', '').split()
seen = set()
unique_words = [word for word in words if not (word in seen or seen.add(word))]
print(unique_words)

This code efficiently extracts unique words from a text string while preserving their original order. The process starts by converting the text to lowercase with lower() and removing periods with replace(). The split() function then creates a list of individual words.

  • The seen set tracks encountered words
  • The list comprehension creates a new list containing only first occurrences
  • The condition not (word in seen or seen.add(word)) cleverly combines checking and tracking in one step

This technique proves especially useful when processing natural language text where maintaining the original word sequence matters. The solution balances readability with efficient memory usage.

Removing duplicate users while keeping most recent data

When managing user data, a common challenge involves retaining only the most recent record for each unique user ID while discarding outdated entries—this example demonstrates an elegant dictionary-based solution for deduplicating and updating user profiles.

user_records = [
    {"id": 101, "name": "Alice", "timestamp": "2023-01-15"},
    {"id": 102, "name": "Bob", "timestamp": "2023-01-16"},
    {"id": 101, "name": "Alice Smith", "timestamp": "2023-02-20"},
    {"id": 102, "name": "Robert", "timestamp": "2023-02-25"}
]

latest_records = {}
for record in user_records:
    user_id = record["id"]
    if user_id not in latest_records or record["timestamp"] > latest_records[user_id]["timestamp"]:
        latest_records[user_id] = record

unique_users = list(latest_records.values())
print([f"{user['id']}: {user['name']}" for user in unique_users])

This code efficiently handles user profile updates by maintaining only the most recent record for each unique user ID. The latest_records dictionary stores user records with their IDs as keys, automatically overwriting older entries when newer timestamps appear.

The core logic lies in the if condition. It checks two scenarios: either the user ID doesn't exist yet in latest_records, or the current record has a more recent timestamp than the stored one. When either condition is true, the code updates the dictionary with the current record.

  • Uses dictionary's O(1) lookup time for efficient duplicate checking
  • Preserves the most recent data by comparing timestamp strings
  • Outputs a clean list of unique users with their latest information

Common errors and challenges

Python developers often encounter three key challenges when removing duplicates: handling unhashable types, preserving element order, and managing case sensitivity.

Dealing with unhashable types like list when removing duplicates

Python's set() function cannot directly handle lists as elements because lists are mutable. When you try to convert nested lists into a set, Python raises a TypeError. The code below demonstrates this common pitfall.

data = [[1, 2], [3, 4], [1, 2], [5, 6]]
unique_data = list(set(data))
print(unique_data)

The code fails because Python can't hash lists as elements within a set(). Lists can change after creation, making them incompatible with Python's hash-based data structures. Let's examine a working solution in the code below.

data = [[1, 2], [3, 4], [1, 2], [5, 6]]
unique_data = []
for item in data:
    if item not in unique_data:
        unique_data.append(item)
print(unique_data)

The solution uses a simple for loop with not in checks to handle unhashable types like lists. This approach works because it compares list elements directly instead of trying to hash them. While slightly slower than set(), it reliably removes duplicates while preserving the original order.

  • Watch for this issue when working with nested data structures like lists of lists or dictionaries
  • Consider converting unhashable elements to tuples if order preservation isn't critical
  • Remember that modifying list elements after deduplication could create unintended duplicates

This pattern becomes especially important when processing complex data structures from APIs or file imports that contain nested arrays or objects.

Maintaining order when using set() to remove duplicates

The set() function efficiently removes duplicates but randomizes the sequence of elements in your list. This behavior can create unexpected results when order matters. The output below demonstrates how Python's set operations shuffle the original sequence.

numbers = [10, 5, 3, 5, 10, 8]
unique_numbers = list(set(numbers))
print(unique_numbers)

The set() operation discards the original sequence, outputting elements in an arbitrary order that may differ between Python runs. The code below demonstrates a reliable solution that maintains the initial ordering.

from collections import OrderedDict
numbers = [10, 5, 3, 5, 10, 8]
unique_numbers = list(OrderedDict.fromkeys(numbers))
print(unique_numbers)

The OrderedDict solution elegantly preserves element sequence while removing duplicates. Unlike set(), which randomizes order, OrderedDict.fromkeys() maintains the original position of each element in the list. This approach works consistently across all Python versions.

  • Watch for order preservation when deduplicating sorted data or sequences where element position carries meaning
  • Consider using this method when processing user inputs, log files, or time-series data where sequence matters
  • The slight performance overhead rarely impacts real-world applications

Removing duplicates in a case-insensitive manner

Python's set() function treats strings with different letter cases as distinct elements. When deduplicating text data, this default case-sensitive behavior often produces unexpected results by keeping both uppercase and lowercase versions of the same word.

words = ["apple", "Apple", "banana", "orange"]
unique_words = list(set(words))
print(unique_words)

The set() function treats "apple" and "Apple" as completely different strings. This creates duplicate entries in the final output when we only want one version of each word. Let's examine a solution that handles case differences properly.

words = ["apple", "Apple", "banana", "orange"]
seen = set()
unique_words = []
for word in words:
    if word.lower() not in seen:
        seen.add(word.lower())
        unique_words.append(word)
print(unique_words)

The solution uses a tracking set to store lowercase versions of words while maintaining the original case in the output list. The seen set checks for duplicates by converting each word to lowercase with word.lower(). When a new word appears, both the lowercase version enters the tracking set and the original word joins the output list.

  • Watch for case sensitivity when processing user input or text data from multiple sources
  • Consider this approach for search functionality where "Apple" and "apple" should match
  • Remember that the first occurrence of a word preserves its original capitalization

This pattern proves especially useful when cleaning data from user forms, processing search queries, or standardizing text datasets where case variations shouldn't create duplicates.

FAQs

How do you remove duplicates while preserving the original order of elements?

To remove duplicates while preserving order, you can use a set to track seen elements and build a new list with unique values. The set provides constant-time lookups to check if we've encountered an element before.

  • Create an empty set to store seen values
  • Iterate through the original list in order
  • For each element, check if it exists in the set
  • If not in the set, add it to both our result list and the set

This approach maintains the original sequence because we process elements in their initial order. It achieves O(n) time complexity since set operations are constant time.

What's the difference between using set() and dict.fromkeys() for duplicate removal?

While set() and dict.fromkeys() both remove duplicates, they serve different purposes. set() creates an unordered collection of unique elements, perfect for simple deduplication. dict.fromkeys() generates a dictionary where the input elements become keys—all mapped to the same default value.

The key distinction lies in data preservation. set() discards duplicates entirely. dict.fromkeys() maintains the original items as dictionary keys while adding the capability to associate values with them.

Can you remove duplicates from a list containing unhashable types like dictionaries?

You can't directly remove duplicates from a list of dictionaries using set() since dictionaries aren't hashable. However, you can convert dictionaries to immutable tuples first. Transform each dictionary into a tuple of sorted items using tuple(sorted(d.items())).

This approach works because tuples of hashable elements are themselves hashable. After converting to a set of tuples to remove duplicates, transform the unique tuples back into dictionaries.

How do you remove duplicates based on a specific condition or key?

Python's dict.fromkeys() efficiently removes duplicates while preserving order. For more control, combine a dictionary comprehension with a key function that specifies your deduplication criteria. This approach lets you define custom logic for determining what makes items unique.

  • Create a key function that returns the values to compare
  • Use a dictionary comprehension with your key function to build a mapping
  • Extract the filtered values from the resulting dictionary

This method works because dictionaries inherently maintain unique keys. The key function transforms your data into a form that captures the essence of what makes each item distinct.

What happens to the original list when you use methods like set() to remove duplicates?

The set() method creates an entirely new object without modifying your original list. When you convert a list to a set, Python generates a new data structure in memory that contains only unique elements. The original list remains unchanged unless you explicitly reassign it.

To permanently remove duplicates from your list, you'll need to convert the set back to a list and reassign it—a process that creates a third object in memory. This explains why list(set()) is a common pattern for deduplication.

🏠