How to Create Publication-Quality Upset Plots (R & Python Guide)
Back to Blog
Figure Focus

How to Create Publication-Quality Upset Plots (R & Python Guide)

Master the art of UpSet plots. Complete guide with R (ComplexHeatmap) and Python (upsetplot) code, color palettes, and interpretation guides for complex dataset intersections.

Dr. Sarah Chen
14 min
Share:

Handling complex set intersections is a common challenge in data science and bioinformatics. When you have more than three sets, Venn diagrams transform from helpful visualizations into "hairballs" of illegible overlapping shapes.

The solution used by top researchers in Nature and Cell? The UpSet plot.

In this guide, we'll deconstruct publication-quality UpSet plots to understand why they work, and then provide the exact R and Python code you need to create them for your own research.

The Problem: The "Hairball" Effect

Venn diagrams rely on area-proportional circles to represent sets. This works perfectly for A vs. B. It's manageable for A vs. B vs. C. But add a fourth set, and the geometry breaks down. You can no longer represent all possible intersections with simple circles.

Researchers often try to force it, resulting in figures where:

  • Intersection sizes are impossible to judge visually.
  • Labels become cluttered and unreadable.
  • The most important biological patterns (e.g., "genes shared by all treatment conditions") get lost in the center.

The Solution: The Matrix Layout

UpSet plots (lex et al., 2014) solve this by treating intersections as a matrix.

  1. Columns represent intersections.
  2. Rows represent the sets.
  3. Dots and Lines indicate which sets are part of an intersection.
  4. Bar Charts on top show the size of that specific intersection.

This approach scales linearly. Whether you have 4 sets or 40, the visualization remains crisp and readable.

Real-World Examples from Top Journals

Let's look at how top journals use UpSet plots to communicate complex multi-omics and clinical data.

1. Multi-Omics Integration (Nature Medicine)

Nature Medicine: Feasibility of multiomics tumor profiling

Why it works:

  • sorting: The intersections are sorted by size (descending), immediately highlighting the most common data availability scenarios.
  • Clarity: It visualizes the overlap between 7 different technology platforms (scDNA, scRNA, etc.). A 7-way Venn diagram would be impossible.
  • Context: The side bars show the total count for each technology, providing essential context for the intersection sizes.

2. Spatial Transcriptomics Benchmarking (Genome Biology)

Genome Biology: Benchmarking spatial transcriptomics technologies

Why it works:

  • Comparison: Effectively compares gene detection across different sample preparations (FFPE vs. OCT).
  • Sparsity: Shows that many genes are unique to specific conditions (the single-dot columns), a finding that might be obscured in a crowded Venn diagram.

3. Chromatin Loop Dynamics (BMC Biology)

BMC Biology: CTCF-anchored chromatin loop dynamics

Why it works:

  • Cell State Analysis: Tracks chromatin loops across 5 distinct cell populations in spermatogenesis.
  • Intersection Logic: Clearly separates "constitutive" loops (shared by all) from "stage-specific" loops (unique to one).

4. Pathway Enrichment Analysis (Cell Reports)

Cell Reports: Insulin hypersecretion pathway analysis

Why it works:

  • Integration: Combines UpSet-style logic with enrichment statistics, showing how different gene modules (WGCNA) overlap with defined clusters.
  • Color Coding: Uses color to map statistical significance or other metadata onto the bars, adding another layer of information.

How to Create UpSet Plots: Code Guide

Here is how you can recreate these publication-quality figures using industry-standard tools.

R: The Gold Standard (ComplexHeatmap)

In the R ecosystem, while UpSetR is popular, ComplexHeatmap is the superior choice for publication figures because it allows for extensive annotation and integration with other heatmaps.

library(ComplexHeatmap)

# 1. Prepare your data (List of sets)
# Example: Genes found significantly mutated in different cancer types
genes_list = list(
    Breast = c("TP53", "PIK3CA", "GATA3", "MAP3K1", "KMT2C"),
    Lung   = c("TP53", "KRAS", "EGFR", "STK11", "KEAP1"),
    Colon  = c("APC", "TP53", "KRAS", "PIK3CA", "BRAF"),
    Kidney = c("VHL", "PBRM1", "SETD2", "BAP1", "KDM5C"),
    Ovary  = c("TP53", "BRCA1", "BRCA2", "NF1", "CDK12")
)

# 2. Convert to Combination Matrix
m = make_comb_mat(genes_list)

# 3. Create the Plot with Publication Styling
UpSet(m,
    # Sort by intersection size to emphasize main patterns
    comb_order = order(comb_size(m), decreasing = TRUE),
    
    # Customize standard aesthetics
    pt_size = unit(3, "mm"), 
    lwd = 2,
    
    # Add annotations and styling
    top_annotation = HeatmapAnnotation(
        "Intersection Size" = anno_barplot(
            comb_size(m), 
            gp = gpar(fill = "#2C3E50"), # Dark blue-grey (Nature style)
            height = unit(3, "cm")
        ), 
        annotation_name_side = "left"
    ),
    
    right_annotation = rowAnnotation(
        "Set Size" = anno_barplot(
            set_size(m),
            gp = gpar(fill = "#E74C3C"), # Muted Red
            width = unit(3, "cm")
        )
    )
)

Pro Tip: Use comb_col to color specific vertical bars (e.g., to highlight the "all shared" intersection).

Python: The Versatile Option (upsetplot)

Python's upsetplot library allows for easy integration with pandas and matplotlib.

import matplotlib.pyplot as plt
from upsetplot import UpSet, from_contents

# 1. Prepare data (Dictionary of sets)
# Example: Shared users across different platforms
data_contents = {
    'Platform A': ['u1', 'u2', 'u3', 'u4', 'u5', 'u6'],
    'Platform B': ['u1', 'u2', 'u7', 'u8', 'u9', 'u10'],
    'Platform C': ['u1', 'u3', 'u7', 'u11', 'u12', 'u13'],
    'Platform D': ['u1', 'u4', 'u8', 'u12', 'u14', 'u15']
}

# 2. Transform to multi-index series
data = from_contents(data_contents)

# 3. Create the plot
fig = plt.figure(figsize=(10, 6))
upset = UpSet(data, 
    subset_size='count', 
    show_counts=True, 
    sort_by='cardinality',
    # Aesthetic tweaks
    element_size=40,
    intersection_plot_elements=3
)

# Custom color scheme
display_style = {'facecolor': '#2C3E50', 'edgecolor': 'black'}
upset.style_subsets(present=['Platform A', 'Platform B', 'Platform C', 'Platform D'], 
                   facecolor='#E74C3C', label="Shared by All")

upset.plot(fig=fig)
plt.title("User Overlap Analysis", fontsize=16, pad=20)
plt.show()

Styling & Color Palettes

For a professional look, avoid the default bright colors.

Nature / Science Style:

  • Primary Bars: Dark Grey (#333333) or Navy Blue (#00468B)
  • Sets: Deep Red (#ED0000) or Forest Green (#009900) for headers measures.
  • Background: Clean white, remove grid lines from the bar charts if they add clutter.

Colorblind Safe:

  • Palette: Okabe-Ito or Viridis.
  • Contrast: Ensure high contrast between the dots and the background matrix lines.

Decision Framework: When to Use UpSet?

Feature Venn Diagram UpSet Plot
Number of Sets 2-3 (Perfect) 4+ (Essential)
Quantitative Precision Low (Area is hard to judge) High (Bar charts are precise)
Empty Intersections Hard to show "zero" overlap Explicitly shows emptiness
Space Efficiency Compact Requires more width
Audience General Public Technical / Scientific

Common Mistakes to Avoid

  1. Sorting by Degree: Don't just sort by the number of sets (1-way, 2-way...). Usually, sorting by intersection size (cardinality) tells the most interesting data story.
  2. Too Many Intersections: If you have 50 sparsely overlapping sets, you might get hundreds of columns. Use min_size or min_degree to filter out negligible intersections (e.g., n < 5).
  3. Ignoring Empty Sets: Sometimes the fact that no intersection exists between Set A and Set B is the most important finding. Ensure your plot settings don't hide these if they are relevant.

Ready to upgrade your figures? Start by exploring our curated collection of UpSet Plots to find design inspiration for your next publication.

Related Visualizations