Opentslm: Language Models That Understand Time Series
Key topics
Repo: https://github.com/StanfordBDHG/OpenTSLM
Foundation models excel at text, images, audio, and video, but lack temporal reasoning capabilities over time-series data streams that run the real world: vitals, prices, telemetry, grid loads, clickstreams, machine logs, business processes.
Time Series Language Models (TSLMs) are open foundation models, supporting time‑series as a native modality, next to text, letting users ask questions, get explanations, and recommendations, all in natural language.
The OpenTSLM White Paper released today demonstrates state-of-the-art temporal reasoning performance. Unlike prior approaches, the cross-attention architecture scales to long time-series remaining viable at scale.
The results:
- Sleep staging: 4.4× accuracy with a model 200× smaller (~880× efficiency)
- Activity recognition: ~6× accuracy with 200× smaller (~1,000× efficiency)
- ECG interpretation: ~2× accuracy with 200× smaller (~400× efficiency)
— first model to process 12-lead ECG signals and text simultaneously with chain-of-thought reasoning validated by cardiologists.
For the first time, foundation models can handle multiple time-series streams of varying lengths concurrently, integrate them with textual context, and produce interpretable explanations (verified by domain experts, clinicians).
This work is the result of a growing collaboration between researchers from Stanford, ETH Zurich, UIUC, University of St. Gallen, University of Washington, Google, and Amazon.
It points to the next foundation model frontier: temporal intelligence that unlocks proactive healthcare, adaptive robotics, resilient infrastructure, and new forms of human-AI collaboration.
The OpenTSLM project introduces a language model that understands time series data, enabling users to ask questions and get explanations in natural language, with the discussion revolving around its potential applications, limitations, and comparisons to existing approaches.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
46m
Peak period
68
0-12h
Avg / period
19.8
Based on 79 loaded comments
Key moments
- 01Story posted
Oct 1, 2025 at 1:25 PM EDT
3 months ago
Step 01 - 02First comment
Oct 1, 2025 at 2:11 PM EDT
46m after posting
Step 02 - 03Peak activity
68 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 6, 2025 at 2:59 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> The Claude Agent SDK excels at code generation—and for good reason. Code is precise, composable, and infinitely reusable, making it an ideal output for agents that need to perform complex operations reliably.
> When building agents, consider: which tasks would benefit from being expressed as code? Often, the answer unlocks significant capabilities.
https://www.anthropic.com/engineering/building-agents-with-t...
In medical AI, IMO, the most exciting work is detecting disease signals too subtle for humans—for example, estimating ejection fraction from an ECG (which cardiologists can’t do this, but algorithms can and have been tested in RCTs: https://www.nature.com/articles/s41591-021-01335-4 ).
Since OpenTSLM tokenizes time-series into an LLM embedding space, would that process prevent capturing such subtle signals? Or could the approach be extended to handle that use case?
But obviously ML is an empirical field, so if you found that a constrained architecture worked well in practice, that's an interesting result in its own right.
> A universal TSLM will power proactive healthcare, adaptive robotics, resilient infrastructure, and new forms of human-AI collaboration.
> scientists, engineers, and builders from ETH, Stanford, Harvard, Cambridge, TUM, CDTM, Google, Meta, AWS, and beyond
What's with all this fuss? Why not just upload your paper to arxiv? Time series models are interesting enough, but from the abstract it's not even clear whether they are using transformers or a recurrent architecture like xLSTM - arguably a more intuitive choice for time series - or something else. This website is barely distinguishable from a crypto/DeFi pitch.
https://news.ycombinator.com/newsguidelines.html
PyTorch is no secret either yet…
The point I’m making is there are models, based on database stream data, that you’ll never get access to even if you had $100m dollars.
Especially when you throw noisy subjective context at it.
You can't tell a numbers only model "ok, with this data, but now you know all the tomatoes in the world have gone rotten and the market doesn't know it yet, what's the best move?" You can use an LLM model like that, however, and with RL, which allows you to branch and layer strategies dependent on dynamic conditions and private data, for arbitrary outcomes. Deploy such a model at scale and run tens of thousands of simulations, iterating through different scenarios, and you can start to apply confidence metrics and complex multiple-degree-of-separation strategies to exploit arbitrage opportunities.
Any one of the big labs could do something like this, including modeling people, demographic samples, distributions of psychological profiles, cultural and current events, and they'd have a manipulation engine to tell them exactly who, when, and where to invest, candidates to support, messages to push and publish.
The fundamental measures of intelligence are how far into the future a system can predict across which domains. The broader the domains and farther into the future, the more intelligence, and things like this push the boundaries.
We should probably get around to doing a digital bill of rights, but I suspect it's too late already anyway, and we're full steam ahead to snow crash territory.
The actual algorithms for predicting price movement were fairly simplistic, most work was around strategies for dealing with overfitting and how to execute the trades. Accuracy was around 51-55% (a bit better than coin toss) so it was a big challenge to actually execute the trades and still make a profit after fees and other nonsense. Finding alpha is what ML is used for but that’s just the first step.
Numerous research, INCLUDING the OpenTSLM paper has PROVEN they are NOT able to do this out of the box. Did you even check out the results at all? They literally compare OpenTSLM against standard text only baselines. Gemma3-270M performs better than GPT-4o using tokenized time series alone. Thus, I guess you guys are being ironic.
If this is the level of one of the contributors to the OpenTSLM paper (which you very obviously are), no wonder due diligence wasn't done properly.
I don't know if this is your work or not, but I appreciate your wanting to defend it...we just need you to do that in a way that doesn't attack others, no matter how wrong they are or you feel they are. Easier said than done of course, but we're all working on it together.
Think of 100-200K worth of tokens formatted like this:
<Entity1>-<Entity2> <Dimension> <ISO 8601 time> <value>
<Entity1>-<Entity2> <Dimension> <ISO 8601 time +1> <value>
<Entity1>-<Entity2> <Dimension> <ISO 8601 time +2> <value>
<Entity1>-<Entity2> <Dimension2> <ISO 8601 time> <value>
<Entity1>-<Entity2> <Dimension2> <ISO 8601 time +1> <value>
The only pre-filtering we do is eliminate "obviously non relevant" data, such as series where the value is completely flat the whole time, but this was done to add more data to the context, not because Claude struggled with it (it doesn't).
For example, you ask an off-the-shelf LLM to analyze your ECG data. The LLM uses a tool to call out to your ECG ts analysis library. The library iterates over the data and finds stats & ECG events. It returns something like "Average heart rate: 60bpm, AFib detected at <time>, etc...". The LLM has all the info it needs to give an accurate analysis at a fraction of computational cost.
On top of that, this requires a large annotated dataset and a pre-trained model. And correct me if I'm wrong, but I don't think it's possible to have a "general" model that could handle arbitrary time series data. I.e. a model that is trained on ECG data would not be compatible with stock market data. And there isn't a way to have a model that understands both stock market data and ECG data.
Ok bro.
From the paper itself...
Imagine you asked ChatGPT a question but it could only give you answers from a single blog.
The point is to be reliably run it on the edge , nobody sane would want their heart rate monitor to be run via the cloud with the uptimes and reliability that come that would come with any remote service plus the extra challenges of llm inference .
The goal would be running on the edge in addition to standard rules based detection which already these machines have and add advanced pattern detection that llms can provide to reduce alert fatigue and also detect new class of complex patterns which these sensors typically don’t.
This sounds great and all, but it's wishful thinking. There isn't anything in this supporting that it's able to find any meaningful patterns beyond existing solutions (i.e. standard rules based detection/machine learning as mentioned above).
What they've essentially done is taken a dataset in which each report was "annotated with a report string (generated by cardiologist or automatic interpretation by ECG-device)" [1] and used it with a series of templates (i.e. questions to ask the llm) from the ECG-QA paper [2] to fine-tune a model to achieve 65% accuracy with solely pattern recognition and 85% accuracy with pattern+clinical context (i.e. patient history).
The 42 template questions they used (as mentioned in 4.1 in the paper) can each be evaluated deterministically via code and retrieved via a tool call for any llm to parse. And I argue that the results would be the same, if not better, for a fraction of the cost. Doing calculations like this on time series data is very very quick. A couple ms at most. I don't see why this couldn't be run on the edge.
Plus, Table 9 shows this thing takes a minimum of 7GB of ram usage with a 270m parameter model and ~15-20GB for a 1B model. I don't see how this could be run on the edge considering most phones have 6-8GB of ram.
[1]: https://physionet.org/content/ptb-xl/1.0.3/ [2]: https://arxiv.org/pdf/2306.15681
I just wanted to show what would be the motivation in this line of research of building fine tuned light-weight foundation models like this , I didn’t mean to imply this paper already achieves those goals.
The tech and hardware is not yet ready as you point out both in terms of performance and what it can actually do currently , but the key thing to be excited about is that gap is within the realm of possibility to close in next few years with the right funding.
I've been totally blown away by opus except on a project I'm working on I discovered a few unexpected weaknesses that have cost quite a bit of time.
https://huggingface.co/OpenTSLM
I mean, sure, but why would you need a study for that? There's plenty of prior work using cross-attention to integrate time series dynamics into non-LLM transformer models, right? Or maybe I'm assuming that integrating a time series embedding with an LLM is easier than it is.
Looking at the repo, the training data seems extremely health-focused. I guess I would have to tune the model with my own datasets if I want it to answer questions about multi-source sensor data?
In my opinion we need a multi-modal model that is great at both tabular datasets and text analysis. Most analytical work in economics, policy, public health, medicine etc requires a combination of crosschecking between both. Current gen LLMs are not good enough at generating novel insights by looking at tables and text at the same time. I also haven’t have found any data on this so please serve it to be on a plate if I’m wrong.
(The web site is too cute. Applying a left to right gradient on text is a bit much.)
[1] https://arxiv.org/pdf/2204.14198
Unlike most commercial & medical applications where signals are stationary with white (uncorrelated) noise, the NSA & Rentec mostly deal with non-stationary signals with regime changes and correlated noise, which can't be denoised without loss of information.
The idea is not so much to predict the next stock price tick or to decipher an intercepted signal (most likely encrypted anyways), but rather to detect "regime changes", ie quickest detection of a change of pattern in non-stationary signals. Then the detected pattern is matched to known trading patterns for a particular stock or to the expected spy activities.
I work with a large number of audio time series data (not words and all have subtle variation). It would be interesting to see how it compares to traditional statistical methods.
1 more comments available on Hacker News