How to create a dataframe in Python

Creating a DataFrame in Python empowers you to organize and analyze data effectively. The pandas library provides multiple methods to construct DataFrames, transforming raw data into structured tables for seamless data manipulation and analysis.

This guide covers essential techniques, practical tips, and real-world applications for DataFrame creation, with code examples developed using Claude, an AI assistant built by Anthropic.

Basic dataframe creation with `pandas`

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
print(df)

Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

The pd.DataFrame() constructor transforms a Python dictionary into a structured table, where dictionary keys become column headers and values become the data rows. This approach provides a clean, intuitive way to create DataFrames from scratch, especially when working with small datasets or prototyping.

The dictionary format offers several advantages for DataFrame creation:

Column names automatically map to their corresponding data values
Python lists within the dictionary naturally convert to DataFrame columns
The index automatically generates sequential numbers starting from 0

This method excels at quick data organization but becomes less practical for larger datasets. For those cases, you'll want to explore methods like CSV imports or database connections.

Common ways to create dataframes

Building on the basic dictionary method, Python offers several powerful approaches to construct DataFrames—from structured dictionaries to arrays that handle complex numerical data.

Create from a dictionary of lists

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, 92, 78]}
df = pd.DataFrame(data)
print(df)

Name  Score
0    Alice     85
1      Bob     92
2  Charlie     78

This method creates a DataFrame by passing a dictionary where each key represents a column name and its corresponding value is a list of data. The dictionary structure naturally maps to the DataFrame's tabular format, with Name and Score becoming column headers while their list values form the rows.

Each list in the dictionary must have the same length to maintain data alignment
The index automatically generates as sequential numbers starting from 0
Column order in the DataFrame matches the dictionary's key order in Python 3.7+

The pd.DataFrame() constructor handles the conversion seamlessly. It transforms the dictionary of lists into a structured table that's ready for data analysis and manipulation.

Create from a list of dictionaries

import pandas as pd

data = [
    {'Name': 'Alice', 'Score': 85},
    {'Name': 'Bob', 'Score': 92},
    {'Name': 'Charlie', 'Score': 78}
]
df = pd.DataFrame(data)
print(df)

Name  Score
0    Alice     85
1      Bob     92
2  Charlie     78

This approach transforms a list of dictionaries into a DataFrame, where each dictionary represents a row of data. The keys become column names while their values populate the cells. pd.DataFrame() automatically aligns matching keys across dictionaries to create a consistent table structure.

Each dictionary in the list can contain different keys. Pandas will create columns for all unique keys and fill missing values with NaN
The row order in the DataFrame matches the order of dictionaries in the list
This method works well when your data naturally fits a row-oriented format or when processing JSON responses from APIs

The resulting DataFrame maintains the same structure as other creation methods but offers more flexibility in handling varying data shapes and sources.

Create from NumPy arrays

import pandas as pd
import numpy as np

data = np.array([['Alice', 85], ['Bob', 92], ['Charlie', 78]])
df = pd.DataFrame(data, columns=['Name', 'Score'])
print(df)

Name Score
0    Alice    85
1      Bob    92
2  Charlie    78

NumPy arrays provide a powerful foundation for DataFrame creation, especially when working with numerical data. The pd.DataFrame() constructor transforms a 2D array into a structured table, with each inner array becoming a row in the DataFrame.

The columns parameter explicitly names your DataFrame columns, making the data more meaningful and easier to reference
NumPy arrays must maintain consistent data types within each column. This constraint ensures data integrity and optimizes memory usage
The array's shape determines the DataFrame dimensions. In this case, a 3x2 array creates 3 rows and 2 columns

This method particularly shines when performing numerical computations or working with scientific data, as NumPy arrays offer superior performance for mathematical operations.

Advanced dataframe creation techniques

Building on the foundational methods, pandas offers sophisticated DataFrame creation techniques that unlock custom indexing, hierarchical data structures, and seamless integration with Series objects for more nuanced data organization.

Create with custom index and column labels

import pandas as pd

data = [[85, 90], [92, 88], [78, 85]]
df = pd.DataFrame(data, 
                 index=['Alice', 'Bob', 'Charlie'],
                 columns=['Math', 'Science'])
print(df)

Math  Science
Alice     85       90
Bob       92       88
Charlie   78       85

Custom indexing transforms how you reference and organize DataFrame data. The index parameter replaces default numeric indices with meaningful labels, while columns assigns names to each data column. This creates a more intuitive way to access specific values.

The data parameter accepts a nested list where each inner list represents a row
Index labels must be unique and match the number of rows in your data
Column labels must match the number of values in each row

This approach enables natural data access using descriptive labels. You can retrieve Bob's Math score with df.loc['Bob', 'Math'] instead of relying on numeric positions.

Create with multi-level indices

import pandas as pd
import numpy as np

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)], 
                                 names=['Letter', 'Number'])
df = pd.DataFrame({'Value': [0.1, 0.2, 0.3]}, index=index)
print(df)

Value
Letter Number      
A      1        0.1
       2        0.2
B      1        0.3

Multi-level indices create hierarchical organization in your DataFrame, enabling more complex data relationships. The MultiIndex.from_tuples() function transforms tuple pairs into a two-tier index structure, with Letter as the primary level and Number as the secondary level.

Each tuple in the index represents a unique combination of values ('A' pairs with both 1 and 2)
The names parameter assigns labels to each index level for clearer data access
Values in the DataFrame align with their corresponding index pairs

This structure proves invaluable when your data has natural hierarchies or requires grouping across multiple categories. You can access data using both levels, similar to how you'd navigate through nested dictionaries in Python.

Create from a `Series` object

import pandas as pd

s = pd.Series([85, 92, 78], index=['Alice', 'Bob', 'Charlie'], name='Score')
df = pd.DataFrame(s)
print(df)

Score
Alice      85
Bob        92
Charlie    78

Converting a pandas Series to a DataFrame transforms a one-dimensional array into a structured table. The Series name becomes the column header while its index values form the row labels.

The name parameter in the Series constructor defines the column name ('Score' in this case)
The index parameter creates meaningful row labels instead of default numeric indices
The resulting DataFrame maintains the original Series data organization but adds the flexibility of a two-dimensional structure

This method works particularly well when you need to expand a single column of data into a more complex table structure. The DataFrame format opens up additional capabilities for data manipulation and analysis that aren't available with Series objects.

Reading and analyzing sales data with `groupby`

The groupby operation transforms raw sales records into actionable insights by aggregating data based on shared characteristics—in this case, calculating total sales for each product category.

import pandas as pd

sales_data = {'Product': ['A', 'B', 'A', 'C', 'B', 'A'], 
              'Amount': [100, 200, 150, 300, 250, 175]}
sales_df = pd.DataFrame(sales_data)
product_sales = sales_df.groupby('Product').sum()['Amount']
print(product_sales)

This code demonstrates data aggregation using pandas' powerful grouping capabilities. The dictionary sales_data contains two lists: product identifiers (A, B, C) and their corresponding sales amounts. After converting this data into a DataFrame, groupby('Product') organizes the data by unique product values. The sum() function then calculates the total sales for each product.

Product A's total combines 100 + 150 + 175
Product B's total adds 200 + 250
Product C represents a single sale of 300

The final output displays each product's aggregated sales amount in a clean, indexed format.

Merging datasets for customer segment analysis

Combining customer profiles with transaction data through DataFrame merging enables precise analysis of purchasing patterns across different customer segments, revealing valuable insights about Premium and Standard tier behaviors.

import pandas as pd

customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Segment': ['Premium', 'Standard', 'Premium', 'Standard']
})
orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104, 105],
    'CustomerID': [1, 3, 3, 2, 1],
    'Amount': [150, 200, 300, 50, 100]
})
merged_data = pd.merge(orders, customers, on='CustomerID')
segment_analysis = merged_data.groupby('Segment')['Amount'].agg(['sum', 'mean'])
print(segment_analysis)

This code demonstrates data merging and aggregation in pandas. The first DataFrame stores customer profiles with their IDs, names, and segments. The second DataFrame contains order records with order IDs, customer IDs, and purchase amounts.

The pd.merge() function combines these DataFrames using CustomerID as the common key, creating a complete view of orders with customer details. Finally, groupby('Segment') organizes the data by customer segments while agg(['sum', 'mean']) calculates both total and average purchase amounts for each segment.

Premium customers' total spending and average order value become clear
Standard tier purchasing patterns emerge from the aggregated data
The merged dataset connects individual transactions to customer segments

Common errors and challenges

Creating DataFrames in Python requires careful attention to data types, column access methods, and handling missing values—mastering these challenges unlocks pandas' full potential.

Debugging type errors when accessing columns with `[]`

When accessing DataFrame columns containing spaces or special characters, the dot notation (df.column) often triggers type errors. The code below demonstrates this common pitfall where attempting to access 'First Name' with dot notation fails. Python interprets the space as a syntax error.

import pandas as pd

df = pd.DataFrame({'First Name': ['Alice', 'Bob', 'Charlie'], 
                  'Age': [25, 30, 35]})
# This will fail with an AttributeError
first_names = df.First Name
print(first_names)

The dot notation fails because Python's syntax rules don't allow spaces in attribute names. When you try to access df.First Name, Python reads it as two separate terms instead of a single column identifier. The code below demonstrates the correct approach.

import pandas as pd

df = pd.DataFrame({'First Name': ['Alice', 'Bob', 'Charlie'], 
                  'Age': [25, 30, 35]})
# Use bracket notation for column names with spaces
first_names = df['First Name']
print(first_names)

The bracket notation df['First Name'] provides a reliable way to access DataFrame columns containing spaces or special characters. While dot notation df.column_name works for simple column names, it fails when column names include spaces, hyphens, or other special characters.

Always use brackets when column names contain spaces or special characters
Watch for this issue when working with imported data, especially CSV files where column headers often include spaces
The bracket notation accepts any valid string as a column name. This flexibility makes it the safer choice for column access

Consider using consistent, Python-friendly column names in your data structures to avoid these access issues entirely. Underscores work well as space replacements.

Fixing column data type issues with `astype()`

Numeric calculations in pandas require proper data type handling. When DataFrame columns contain strings that look like numbers, mathematical operations produce unexpected results. The code below demonstrates this common issue where multiplying string values leads to unintended behavior.

import pandas as pd

data = {'ID': ['1', '2', '3'], 'Value': ['100', '200', '300']}
df = pd.DataFrame(data)
# This won't give the expected result because 'Value' is string type
result = df['Value'] * 2
print(result)

When multiplying df['Value'] by 2, pandas concatenates the string twice instead of performing mathematical multiplication. The string data type prevents numeric operations. The following code demonstrates the proper solution.

import pandas as pd

data = {'ID': ['1', '2', '3'], 'Value': ['100', '200', '300']}
df = pd.DataFrame(data)
# Convert 'Value' column to integer before multiplication
df['Value'] = df['Value'].astype(int)
result = df['Value'] * 2
print(result)

The astype() function converts DataFrame columns to the correct data type for calculations. In the example, converting string values to integers with astype(int) enables proper multiplication instead of string concatenation.

Watch for this issue when importing data from external sources like CSV files
Common data types include int, float, str, and bool
Always verify column data types before performing mathematical operations

This type conversion becomes especially important when working with financial data or performing calculations across multiple columns. Pandas automatically infers data types during DataFrame creation. However, explicit conversion often proves necessary for precise numerical operations.

Resolving `NaN` values when merging with mismatched keys

Merging DataFrames with mismatched keys often produces unexpected NaN values in the output. When key columns contain case differences or inconsistent formatting, pandas fails to match records correctly. Let's examine a common scenario where case sensitivity disrupts the merge operation.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'value': [4, 5, 6]})
# This will result in all NaN values due to case mismatch
merged = pd.merge(df1, df2, on='key')
print(merged)

The merge fails because pandas treats uppercase and lowercase letters as distinct values. When df1 contains uppercase keys and df2 has lowercase keys, pandas can't match them. The following code demonstrates the proper solution.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'value': [4, 5, 6]})
# Convert keys to the same case before merging
df1['key'] = df1['key'].str.lower()
df2['key'] = df2['key'].str.lower()
merged = pd.merge(df1, df2, on='key', suffixes=('_1', '_2'))
print(merged)

Converting keys to lowercase with str.lower() before merging ensures pandas can match records correctly. The suffixes parameter adds distinct identifiers to duplicate column names, preventing confusion in the merged DataFrame. This solution maintains data integrity while combining information from both sources.

Watch for case sensitivity issues when merging data from different sources. Common scenarios include:

Importing data from multiple CSV files with inconsistent formatting
Combining user input data with database records
Working with APIs that return mixed-case responses

Consider standardizing key columns early in your data pipeline to prevent these matching problems. String methods like upper(), lower(), or strip() help maintain consistent formatting.

How to create a dataframe in Python

Basic dataframe creation with `pandas`

Common ways to create dataframes

Create from a dictionary of lists

Create from a list of dictionaries

Create from NumPy arrays

Advanced dataframe creation techniques

Create with custom index and column labels

Create with multi-level indices

Create from a `Series` object

Reading and analyzing sales data with `groupby`

Merging datasets for customer segment analysis

Common errors and challenges

Debugging type errors when accessing columns with `[]`

Fixing column data type issues with `astype()`

Resolving `NaN` values when merging with mismatched keys

FAQs

What is the difference between a DataFrame and a Series in pandas?

How do you check the data types of columns in a DataFrame?

Can you create a DataFrame from a dictionary with lists of different lengths?

What happens if you don't specify column names when creating a DataFrame?

How do you add a new column to an existing DataFrame after creation?

🏠

How to create a dataframe in Python

Basic dataframe creation with pandas

Common ways to create dataframes

Create from a dictionary of lists

Create from a list of dictionaries

Create from NumPy arrays

Advanced dataframe creation techniques

Create with custom index and column labels

Create with multi-level indices

Create from a Series object

Reading and analyzing sales data with groupby

Merging datasets for customer segment analysis

Common errors and challenges

Debugging type errors when accessing columns with []

Fixing column data type issues with astype()

Resolving NaN values when merging with mismatched keys

FAQs

What is the difference between a DataFrame and a Series in pandas?

How do you check the data types of columns in a DataFrame?

Can you create a DataFrame from a dictionary with lists of different lengths?

What happens if you don't specify column names when creating a DataFrame?

How do you add a new column to an existing DataFrame after creation?

🏠

Basic dataframe creation with `pandas`

Create from a `Series` object

Reading and analyzing sales data with `groupby`

Debugging type errors when accessing columns with `[]`

Fixing column data type issues with `astype()`

Resolving `NaN` values when merging with mismatched keys