A three-dimensional graph-based visualization system for exploring semantic relationships in Wikipedia articles using neural embeddings, dimensionality reduction, and clustering algorithms.
This research project presents a novel approach to visualizing the semantic structure of Wikipedia articles in three-dimensional space. By leveraging state-of-the-art natural language processing techniques, specifically transformer-based neural embeddings, we transform textual content into high-dimensional vector representations that capture semantic meaning. These representations are then reduced to three dimensions using Uniform Manifold Approximation and Projection (UMAP), enabling spatial visualization where semantic similarity corresponds to geometric proximity.
Articles are grouped into thematic clusters using graph community detection. In production, Louvain clustering is computed from the article link graph, while 3D coordinates are generated with either UMAP over semantic embeddings or a force-directed layout over the graph itself. The resulting visualization represents a knowledge graph where nodes are articles, edges represent hyperlink relationships, and spatial positioning reflects semantic or structural proximity.
Wikipedia articles are extracted using the MediaWiki API, capturing both content and metadata. The raw text undergoes preprocessing to remove markup and extract clean, semantically meaningful content. Article metadata including titles, categories, and link structures are preserved for subsequent graph construction.
Each article is transformed into a dense vector representation using a sentence-transformer encoder. The default production model is all-MiniLM-L6-v2, which produces 384-dimensional vectors that capture semantic relationships between articles. The embeddings are computed using the following process:
e(a) = Embedding_Model(text(a))
where:
- e(a) ∈ ℝ^384 is the embedding vector
- text(a) is the preprocessed article text
- Embedding_Model is the neural encoder
The resulting embeddings capture semantic relationships such that articles with similar content have vectors with high cosine similarity. This property is fundamental to the subsequent dimensionality reduction and clustering steps.
Uniform Manifold Approximation and Projection (UMAP) reduces the 384-dimensional embeddings to 3D coordinates suitable for visualization. The production pipeline also supports a force-directed alternative that derives positions directly from the hyperlink graph when structural layout is preferred.
UMAP algorithm parameters:
- n_neighbors: 15
- min_dist: 0.1
- n_components: 3
- metric: cosine
The algorithm constructs a fuzzy topological representation of the high-dimensional data using a k-nearest-neighbors graph, then optimizes a low-dimensional layout by minimizing cross-entropy between high and low-dimensional probability distributions:
minimize: CE = Σ(v_ij * log(v_ij / w_ij) + (1-v_ij) * log((1-v_ij) / (1-w_ij)))
where:
- v_ij: high-dimensional edge probability
- w_ij: low-dimensional edge probability
This approach preserves both local neighborhood structure and global data topology, making it superior to methods like t-SNE for visualization tasks.
The production pipeline uses Louvain community detection on the filtered Wikipedia link graph. This groups articles according to graph connectivity rather than requiring a fixed cluster count in advance.
Louvain implementation details:
- graph input: filtered article link graph
- optimization target: modularity
- deterministic seed: 42
This keeps cluster coloring aligned with network structure while allowing the 3D layout itself to remain semantic or structural.
Earlier HDBSCAN experiments informed the project direction but are not the clustering method used by the current production code.
The knowledge graph is constructed by analyzing Wikipedia's hyperlink structure. Each article corresponds to a node positioned at its pipeline-derived 3D coordinates. Edges are created based on hyperlinks between articles, and the explorer can optionally render the global network as a 3D edge overlay:
G = (V, E) where:
V = set of article nodes
E = set of directed edges (a_i → a_j)
Edge weight w_ij:
- w = 1 if unidirectional link
- w = 2 if bidirectional link
Bidirectional links (where both articles reference each other) indicate stronger semantic relationships and are visually distinguished with increased opacity and line thickness in the rendering.
The visualization is rendered using WebGL through the Three.js library and React Three Fiber framework. Each article is represented as a sphere positioned at its stored (x, y, z) coordinates. Visual encoding includes:
User interaction employs orbital camera controls allowing rotation, panning, and zooming. Article selection triggers detail panel display and camera interpolation to the selected node's position, while the global network toggle can display the broader edge structure of the UMAP/layout space.
Rendering large graphs (10,000+ nodes) requires careful performance optimization. Current optimizations include:
The resulting visualization successfully groups semantically related articles into spatial clusters. Articles about similar topics (e.g., science, history, geography) naturally cluster together in 3D space, with cluster boundaries emerging from the density-based analysis rather than arbitrary geometric divisions.
The link structure reveals interesting patterns in how Wikipedia articles reference each other. Dense connection patterns within clusters indicate self-contained topic areas, while inter-cluster edges reveal conceptual bridges between domains. Bidirectional links often correspond to strongly related concepts or complementary topics.
This approach to knowledge graph visualization provides an intuitive interface for exploring large corpora of interlinked documents, potentially applicable to domains beyond Wikipedia including academic literature, legal documents, and enterprise knowledge bases.
Comprehensive technical documentation covering all aspects of the system:
A visual legend, interpretation guide, and explanation of what each algorithm contributes to the final view
Interface controls, navigation, interpretation guidelines, and analytical applications
Embedding generation, UMAP algorithm, dimensionality reduction, and positioning mathematics
Link extraction, directionality detection, weight calculation, and visual encoding
Louvain community detection, graph modularity, and cluster interpretation
Embedding similarity, vector search, query processing, and exploratory discovery
Frontend structure, backend API, AI layer integration, database schema, and deployment
Generated graph projections and metrics for discussing layout quality, cluster balance, and optimization tradeoffs