Wikipedia Knowledge Graph Visualization

A three-dimensional graph-based visualization system for exploring semantic relationships in Wikipedia articles using neural embeddings, dimensionality reduction, and clustering algorithms.

Production note: the current implementation uses sentence-transformer embeddings, Louvain clustering, and either UMAP or force-directed 3D layout generation. Older references to OpenAI-only embeddings or HDBSCAN reflect earlier research notes.

Abstract

This research project presents a novel approach to visualizing the semantic structure of Wikipedia articles in three-dimensional space. By leveraging state-of-the-art natural language processing techniques, specifically transformer-based neural embeddings, we transform textual content into high-dimensional vector representations that capture semantic meaning. These representations are then reduced to three dimensions using Uniform Manifold Approximation and Projection (UMAP), enabling spatial visualization where semantic similarity corresponds to geometric proximity.

Articles are grouped into thematic clusters using graph community detection. In production, Louvain clustering is computed from the article link graph, while 3D coordinates are generated with either UMAP over semantic embeddings or a force-directed layout over the graph itself. The resulting visualization represents a knowledge graph where nodes are articles, edges represent hyperlink relationships, and spatial positioning reflects semantic or structural proximity.

Technical Architecture

1. Data Acquisition and Processing

Wikipedia articles are extracted using the MediaWiki API, capturing both content and metadata. The raw text undergoes preprocessing to remove markup and extract clean, semantically meaningful content. Article metadata including titles, categories, and link structures are preserved for subsequent graph construction.

2. Neural Embeddings

Each article is transformed into a dense vector representation using a sentence-transformer encoder. The default production model is all-MiniLM-L6-v2, which produces 384-dimensional vectors that capture semantic relationships between articles. The embeddings are computed using the following process:

e(a) = Embedding_Model(text(a))

where:

- e(a) ∈ ℝ^384 is the embedding vector

- text(a) is the preprocessed article text

- Embedding_Model is the neural encoder

The resulting embeddings capture semantic relationships such that articles with similar content have vectors with high cosine similarity. This property is fundamental to the subsequent dimensionality reduction and clustering steps.

3. Dimensionality Reduction via UMAP

Uniform Manifold Approximation and Projection (UMAP) reduces the 384-dimensional embeddings to 3D coordinates suitable for visualization. The production pipeline also supports a force-directed alternative that derives positions directly from the hyperlink graph when structural layout is preferred.

UMAP algorithm parameters:

- n_neighbors: 15

- min_dist: 0.1

- n_components: 3

- metric: cosine

The algorithm constructs a fuzzy topological representation of the high-dimensional data using a k-nearest-neighbors graph, then optimizes a low-dimensional layout by minimizing cross-entropy between high and low-dimensional probability distributions:

minimize: CE = Σ(v_ij * log(v_ij / w_ij) + (1-v_ij) * log((1-v_ij) / (1-w_ij)))

where:

- v_ij: high-dimensional edge probability

- w_ij: low-dimensional edge probability

This approach preserves both local neighborhood structure and global data topology, making it superior to methods like t-SNE for visualization tasks.

4. Graph Community Detection

The production pipeline uses Louvain community detection on the filtered Wikipedia link graph. This groups articles according to graph connectivity rather than requiring a fixed cluster count in advance.

Louvain implementation details:

- graph input: filtered article link graph

- optimization target: modularity

- deterministic seed: 42

This keeps cluster coloring aligned with network structure while allowing the 3D layout itself to remain semantic or structural.

Earlier HDBSCAN experiments informed the project direction but are not the clustering method used by the current production code.

5. Graph Construction and Link Analysis

The knowledge graph is constructed by analyzing Wikipedia's hyperlink structure. Each article corresponds to a node positioned at its pipeline-derived 3D coordinates. Edges are created based on hyperlinks between articles, and the explorer can optionally render the global network as a 3D edge overlay:

G = (V, E) where:

V = set of article nodes

E = set of directed edges (a_i → a_j)

Edge weight w_ij:

- w = 1 if unidirectional link

- w = 2 if bidirectional link

Bidirectional links (where both articles reference each other) indicate stronger semantic relationships and are visually distinguished with increased opacity and line thickness in the rendering.

6. Rendering and Interaction

The visualization is rendered using WebGL through the Three.js library and React Three Fiber framework. Each article is represented as a sphere positioned at its stored (x, y, z) coordinates. Visual encoding includes:

Color: Determined by cluster membership, with each cluster assigned a distinct hue
Size: Base radius of 0.14 units, scaled to 0.22 when selected
Opacity: Full opacity (1.0) for all nodes to ensure visibility
Edges: Rendered as lines with opacity and thickness varying by link weight

User interaction employs orbital camera controls allowing rotation, panning, and zooming. Article selection triggers detail panel display and camera interpolation to the selected node's position, while the global network toggle can display the broader edge structure of the UMAP/layout space.

Performance Considerations

Rendering large graphs (10,000+ nodes) requires careful performance optimization. Current optimizations include:

Frustum culling: Three.js automatically culls objects outside the camera view frustum, reducing draw calls.
Level of Detail (LOD): Future implementations may employ instanced rendering for massive node counts, combining multiple geometries into single draw calls.
Edge filtering: Edges can be filtered by weight or distance to reduce visual clutter while preserving important connections.

Results and Discussion

The resulting visualization successfully groups semantically related articles into spatial clusters. Articles about similar topics (e.g., science, history, geography) naturally cluster together in 3D space, with cluster boundaries emerging from the density-based analysis rather than arbitrary geometric divisions.

The link structure reveals interesting patterns in how Wikipedia articles reference each other. Dense connection patterns within clusters indicate self-contained topic areas, while inter-cluster edges reveal conceptual bridges between domains. Bidirectional links often correspond to strongly related concepts or complementary topics.

This approach to knowledge graph visualization provides an intuitive interface for exploring large corpora of interlinked documents, potentially applicable to domains beyond Wikipedia including academic literature, legal documents, and enterprise knowledge bases.

Technical Stack

Backend

FastAPI (Python 3.11+)
PostgreSQL with pgvector extension
Sentence Transformers (all-MiniLM-L6-v2)
UMAP-learn (dimensionality reduction)
python-louvain (community detection)
NumPy, scikit-learn

Frontend

Next.js 14 (React 18)
TypeScript
Three.js (WebGL rendering)
React Three Fiber
Tailwind CSS

Documentation

Comprehensive technical documentation covering all aspects of the system:

Wikipedia Knowledge Graph Visualization

Abstract

Technical Architecture

1. Data Acquisition and Processing

2. Neural Embeddings

3. Dimensionality Reduction via UMAP

4. Graph Community Detection

5. Graph Construction and Link Analysis

6. Rendering and Interaction

Performance Considerations

Results and Discussion

Technical Stack

Backend

Frontend

Documentation

Reading the Graph

General Usage

Node Generation

Edge Calculation

Clustering Algorithm

Semantic Search

System Architecture

Performance & Snapshots