Show HN: Using LLMs and >1k 4090s to visualize 100k scientific research articles
twitter.comI'm curious about the technical side: how are you handling the dimensionality reduction and visualization? Also noticed you mentioned "custom-trained LLMs" in the tweet - how large are those models, and what motivated using custom ones instead of existing open models?
At the core of this project is a structured-extraction task using a custom Qwen 14B model, which we distilled from larger closed-source models. We needed a model we could run at scale on https://devnet.inference.net, which is comprised mostly of idle consumer-grade NVIDIA devices.
Embeddings were generated using SPECTER2, a transformer model from AllenAI specifically designed for scientific documents. The model processes each paper's title, executive summary, and research context to generate 768-dimensional embeddings optimized for semantic search over scientific literature.
The visualization uses UMAP to reduce the 768D embeddings to 3D coordinates, preserving local and global structure. K-Means clustering groups papers into ~100 clusters based on semantic similarity in the embedding space. Cluster labels are automatically generated using TF-IDF analysis of paper fields and key takeaways, identifying the most distinctive terms for each cluster.