Tutorial

Working with Large Datasets using Pandas and JSON in Python

November 26, 202414 min readBy JSON Formatter Team

Working with large JSON datasets can be overwhelming, especially when they're too large to fit into memory. Learn how to use Python and Pandas to efficiently explore and analyze even massive datasets.

Real-World Data Analysis: Combining Python with tools like Pandas can make exploring and analyzing even massive datasets manageable. We'll work through a real-world example using traffic violation data.

Understanding Structured vs Unstructured Data

When data is stored in SQL databases, it typically follows a rigid, tabular structure where each row represents a record and each column represents a field. This structured format makes it easy to query specific facts.

However, as the amount of data we collect grows, we often encounter unstructured data—data that doesn't follow a consistent or predefined format. JSON is particularly well-suited for this type of data, especially when it includes nested fields and varying structures.

JSON Data Example

{
  "event_type": "started-lesson",
  "keen": {
    "created_at": "2015-06-12T23:09:03.966Z",
    "id": "557b668fd2eaaa2e7c5e916b",
    "timestamp": "2015-06-12T23:09:07.971Z"
  },
  "sequence": 1
}

In this example, the `keen` field is nested within the main object. This flexible structure makes JSON a popular format for storing unstructured data, as it can easily represent complex relationships.

Python's JSON Module

Python makes working with JSON data straightforward thanks to its built-in `json` library. With it, you can easily convert JSON strings into Python lists or dictionaries, and vice versa.

Basic JSON Operations

import json

# Reading JSON from a file
with open('data.json', 'r') as f:
    data = json.load(f)

# Converting Python dictionary to JSON string
json_string = json.dumps(data)

# Parsing JSON string
data = json.loads(json_string)

Exploring JSON Data from Command Line

Before importing data into Python, it's helpful to explore the structure using command line tools. This gives you a sense of the data's organization without loading it into memory.

Viewing First Few Lines

# View first few lines of JSON file
head data.json

# View last few lines
tail data.json

# Check file size
ls -lh data.json

Finding Top-Level Keys

You can use `grep` to identify top-level keys in a well-formatted JSON file:

# Find top-level keys (indented by 2 spaces)
grep -E '^ {2}"' data.json

Streaming Large JSON Files

For large JSON files that don't fit into memory, streaming parsers like `ijson` allow you to process data incrementally.

Using ijson for Streaming

import ijson

# Stream parse large JSON file
with open('large_file.json', 'rb') as f:
    parser = ijson.items(f, 'data.item')
    
    for item in parser:
        # Process each item without loading entire file
        print(item)

Working with Pandas

Once you have the JSON data in Python, Pandas provides powerful tools for analysis and manipulation. You can convert JSON data into DataFrames for easier analysis.

Loading JSON into Pandas

import pandas as pd

# Load JSON data into DataFrame
df = pd.read_json('data.json')

# Load JSON from URL
df = pd.read_json('https://api.example.com/data.json')

# Load with specific orientation
df = pd.read_json('data.json', orient='records')

Exploring Data Structure

# View first few rows
print(df.head())

# Get data info
print(df.info())

# Get statistics
print(df.describe())

# Check for missing values
print(df.isnull().sum())

Analyzing Traffic Violation Data

Let's work through a real example using traffic violation data. This dataset contains information about location, vehicle type, demographics, and violation details.

Loading and Exploring the Data

import pandas as pd
import json

# Load JSON data
with open('traffic_violations.json', 'r') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data['data'])

# Set column names if needed
column_names = ['id', 'guid', 'timestamp', 'date', 'description', 
                'location', 'state', 'violation_type', 'article', 
                'statute', 'article_text']
df.columns = column_names

# Explore the data
print(df.head())
print(df.shape)
print(df.info())

Data Cleaning and Transformation

JSON data often requires cleaning and transformation before analysis. Here are common operations:

Converting Date Strings

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour

Parsing Location Data

If location data is stored as strings with coordinates, you can parse and separate it:

import re

# Parse location string with coordinates
def parse_coordinates(location_str):
    match = re.search(r'((-?d+.d+),s*(-?d+.d+))', location_str)
    if match:
        return float(match.group(1)), float(match.group(2))
    return None, None

df['latitude'], df['longitude'] = zip(*df['location'].apply(parse_coordinates))

Data Analysis and Visualization

Once your data is cleaned, you can perform various analyses and create visualizations.

Filtering and Aggregating Data

# Filter data for specific conditions
morning_traffic = df[
    (df['hour'] >= 6) & 
    (df['hour'] <= 9) &
    (df['date'].dt.weekday < 5)  # Weekdays only
]

# Count violations by type
violations_by_type = df['violation_type'].value_counts()

# Group by hour
hourly_counts = df.groupby('hour').size()

print(violations_by_type)
print(hourly_counts)

Creating Visualizations

import matplotlib.pyplot as plt

# Create histogram of violations by hour
plt.figure(figsize=(12, 6))
df['hour'].hist(bins=24, edgecolor='black')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Violations')
plt.title('Traffic Violations by Hour of Day')
plt.show()

# Create bar chart of top violation types
top_violations = violations_by_type.head(10)
top_violations.plot(kind='barh', figsize=(10, 6))
plt.xlabel('Count')
plt.title('Top 10 Traffic Violations')
plt.show()

Mapping Geographic Data

For datasets with location information, you can create interactive maps using libraries like folium.

import folium

# Create base map
m = folium.Map(location=[39.0836, -77.1483], zoom_start=11)

# Add markers for traffic stops
for index, row in df.head(100).iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=row['description']
    ).add_to(m)

# Save map
m.save('traffic_map.html')

Best Practices for Large JSON Datasets

Use Streaming for Large Files

For files larger than available memory, use streaming parsers like ijson instead of loading everything at once.

Explore Data Structure First

Use command line tools to understand the data structure before loading into Python to save time and memory.

Clean and Transform Efficiently

Apply data transformations early and filter data to reduce dataset size before intensive analysis.

Conclusion

Working with large JSON datasets in Python using Pandas opens up powerful possibilities for data analysis. By combining command line exploration, streaming parsers, and Pandas DataFrames, you can efficiently explore and analyze even massive datasets.

The key is to understand your data structure, use appropriate tools for the data size, and apply transformations strategically to keep your analysis efficient and manageable.

Validate Your JSON Before Analysis

Use our free JSON formatter to validate and format your JSON data before importing it into Python. Ensure your data is properly structured for analysis.

Try JSON Formatter