Sharp, an Approach to Photorealistic View Synthesis From a Single Image

Posted18 days agoActive14 days ago

dvrp

507 points

107 comments

apple.github.iostoryHigh profile

informativeneutral

DetrsAI ResearchAI Image Generation

Key topics

Detrs

AI Research

AI Image Generation

The unveiling of SHARP, a technique for photorealistic view synthesis from a single image, has sparked a lively discussion about its potential applications and implications. Some commenters are abuzz about the technology's connection to features like Cinematic mode, while others are exploring its potential uses in simulation and 3D programming. A debate is brewing between those who see the value in AI-generated visuals, like accurrent, who notes its potential to aid in simulation, and skeptics like calvinmorrison, who questions the practicality of investing in such technology. As the conversation unfolds, it becomes clear that this innovation is stirring up both excitement and unease.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

59m

Peak period

0-6h

Avg / period

13.8

Comment distribution110 data points

Loading chart...

Based on 110 loaded comments

Key moments

01Story posted
Dec 15, 2025 at 11:06 PM EST
18 days ago
Step 01
02First comment
Dec 16, 2025 at 12:05 AM EST
59m after posting
Step 02
03Peak activity
60 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Dec 20, 2025 at 12:25 AM EST
14 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (107 comments)

Showing 110 comments

brcmthrowaway

18 days ago

1 reply

So this is the secret sauce behind Cinematic mode. The fake bokeh insanity has reached its climax!

duskwuff

18 days ago

2 replies

[delayed]

Terretta

18 days ago

It's available for everyday photos, portraits, everything, not just lock screens.

spike021

18 days ago

you can also press the button while viewing a photo in the Photos app to see this.

calvinmorrison

18 days ago

4 replies

I understand AI for reasoning, knowledge, etc. I haven't figured out how anyone wants to spend money for this visual and video stuff. It just seems like a bad idea.

accurrent

18 days ago

1 reply

Simulation. It takes a lot of effort totday to bring up simulations in various fields. 3 D programming is very nontrivial and asset development is extremely expensive. If I have a workspace i can take a photo of and just use it to generate a 3d scene I can then use it in simulations to test ideas out. This is particularly useful in robotics and industrial automation already.

jijijijij

17 days ago

1 reply

I don't see any examples of a 3D scene information usable for simulation. If you want to simulate something hitting a table, you need the whole table (surface) in space, not just some spatial illusion effect extrapolated from an image of a table. I also think modelling the 3D objects for simulation is the least expensive part of an simulation... the simulation is the expensive thing.

I doubt this will be useful for robotics or industrial automation, where you need an actual spatial, or functional understanding of the object/environment.

accurrent

17 days ago

With research like this you need to start somewhere. The fact we can get 3d information helps. There are people looking into making splats capture collision information [1].

I have worked on simulation and in my day job do a lot of simulation. While physics is oftem hard and expensive you only need to write the code once.

Assets? You need to comission 3d artists and then spend hours wrangling file formats. Its extremely tedious. If we could take a photo and extract meshes Im sure we'd have a much easier time.

[1] https://trianglesplatting.github.io/

netsharc

17 days ago

Photo apps on phones (can you still call them cameras?) already have a lot of "AI" to enhance photos and videos taken. Some of it is technological necessity, since you're capturing something through a tiny hole, a lot of it is sexying it up to appeal to people, because hey, people would prefer a cinema-quality depiction of their memories rather than the reality...

rv3392

18 days ago

This specific paper is pretty different to the kind of photo/video generation that has been hyped up in recent years. In this case, I think this might be what they're using for the iOS spatial wallpaper feature, which is arguably useless but is definitely an aesthetic differentiator to Android devices. So, it's indirectly making money.

re-thc

18 days ago

Do people not spend on entertainment? Commercials? It's probably less of a bad idea than knowledge. AI giving a bad visual has less negatives than giving the wrong knowledge leading to the wrong decision.

arjie

18 days ago

1 reply

This is incredibly cool. It's interesting how it fails in the section where you need to in-paint. SVC seems to do that better than all the rest, though not anywhere close to the photorealism of this model.

Is there a similar flow but to transform either a video/photo/NeRF of a scene into a tighter, minimal polygon approximation of it. The reason I ask is that it would make some things really cool. To make my baby monitor mount I had to knock out the calipers and measure the pins and this and that, but if I could take a couple of photos and iterate in software that would be sick.

necovek

17 days ago

1 reply

You'd still need one real measurement at least: this might get proportions right if background can be clearly separated, but the absolute size of an object can be worlds apart.

arjie

17 days ago

That's true. And there's lens correction and all that, but it would be nice to accelerate the CAD modeling.

Geee

18 days ago

1 reply

This is great for turning a photo into a dynamic-IPD stereo pair + allows some head movement in VR.

SequoiaHope

18 days ago

Ah and the dynamic IPD component preserves scale?

yodon

18 days ago

4 replies

See also Spaitial[0] which announced today full 3D environment generation from a single image

[0]https://www.spaitial.ai/

andsoitis

18 days ago

1 reply

Why are all their examples of rooms?

Why no landscape or underwater scenes or something in space, etc.?

jaccola

18 days ago

Constrained environments are much simpler.

I believe this company is doing image (or text) -> off the shelf image model to generate more views -> some variant of gaussian splatting.

So they aren't really "generating" the world as one might imagine.

dag11

18 days ago

1 reply

I'm confused, does it actually generate environments from photographs? I can't view the galleries since I didn't sign up for emails but all of the gallery thumbnails are AI, not photos.

jrflowers

18 days ago

> I'm confused, does it actually generate environments from photographs?

It’s a website that collects people’s email addresses

avaer

18 days ago

The best I've seen so far is Marble from World Labs, though that gives you a full 360 environment and takes several minutes to do so.

boguscoder

18 days ago

Requires email to view anything, that’s sad

superfish

18 days ago

4 replies

"Unsplash > Gen3C > The fly video" is nightmare fuel. View at your own risk: https://apple.github.io/ml-sharp/video_selections/Unsplash/g...

Traubenfuchs

18 days ago

1 reply

Early AI „everything turns into dog heads“ vibes. Beautiful.

drcongo

17 days ago

3 replies

I miss those. Anyone know if it's still possible to get the models etc. needed to generate them?

Traubenfuchs

17 days ago

2 replies

I wish there was an archive of all those melty dreamscapes.

https://m.youtube.com/watch?v=DgPaCWJL7XI&t=1s&pp=2AEBkAIB0g...

https://www.youtube.com/watch?v=X0oSKFUnEXc

what-the-grump

17 days ago

All this work to recreate a WinAmp viz from 20 years ago :) ?

StilesCrisis

17 days ago

Google was using them as wall mural artwork in one of the Sunnyvale offices. Very trippy.

tecleandor

17 days ago

1 reply

I also wanted to generate one of those this year, so I'll camp around here just in case anybody comments on it :)

kennyadam

14 days ago

https://github.com/kenjibailly/Deep_Dream_GUI

kennyadam

17 days ago

https://github.com/kenjibailly/Deep_Dream_GUI

uwela

17 days ago

1 reply

Goading companies into improving image and video generation by showing them how terrible they are is only going to make them go faster, and personally I’d like to enjoy the few moments I have left thinking that maybe something I watch is real.

It will evolve into people hooked into entertainment suits most of the day, where no one has real relationships or does anything real of consequence, like some sad mashup of Wall-E and Ready Player One.

If we’re lucky, some will want to “real world” with augmented reality.

Maybe we’ll get really nice holovisions, where we can chat with virtual celebrities.

Who needs that? We’re already having to shoot up weight-loss drugs because we binge watch streaming all the time, because we’ve all given up, assuming AI will do everything. What good will come from having better technology when technology is already doing harm?

camgunz

17 days ago

1 reply

It turns out the Great Filter is that any species with the technology to colonize space also has the technology to soma itself into annihilation.

https://en.wikipedia.org/wiki/Great_Filter

jodrellblank

17 days ago

There are way ways past this, from religion and Amish style ideology, to legal prohibition of making and selling and using it, to individuals being personally immune.

like there are people who avoid alcohol, opioids, heroin, all other wireheading-style drugs and experiences that exist already, and people who do exercise and stay thin in a world of fast food and cars.

A great filter needs to apply to every civilisation imaginable, no exceptions, nerfing billions of species before they get to a higher Kardashev scale, not just be the latest “Dunning-Kruger” mic drop to spam into every thread all the time.

Maybe when NASA, ESA, SpaceX, RosCOSMOS, CNSA, IRSA all collapse because of this effect… look how many countries have a space agency! https://en.wikipedia.org/wiki/List_of_government_space_agenc...

ghurtado

18 days ago

Seth Brundle has entered the chat.

schneehertz

17 days ago

san check, 1d10

harhargange

18 days ago

2 replies

TMPI looks just as good if not better.

jjcm

18 days ago

1 reply

Disagree - look at the sky in the seaweed shot. It doesn't quite get the depth right in anything, and the edges of things look off.

shwaj

18 days ago

Agreed. The head of the fly also seems to have weird depth.

wfme

18 days ago

Have a look through the rest of the images. TMPI has some pretty obvious shortcomings in a lot of them.

1. Sky looks jank 2. Blurry/warped behind the horse 3. The head seems to move a lot more than the body. You could argue that this one is desirable 4. Bit of warping and ghosting around the edges of the flowers. Particularly noticeable towards the top of the image. 5. Very minor but the flowers move as if they aren't attached to the wall

tartoran

18 days ago

1 reply

Impressive but something doesn't feel right to me.. Possibly too much sharpness, possibly a mix of cliches, all amplified at once.

a3w

17 days ago

For me, TMPI and SHARP look great. TMPI is consistently brighter, though, with me having no clue which is more correct.

remh

18 days ago

1 reply

Enhance! https://www.youtube.com/watch?v=LhF_56SxrGk

mvandermeulen

18 days ago

I thought this was going to be the Super Troopers version

moondev

18 days ago

5 replies

cuda gpu only

https://github.com/apple/ml-sharp#rendering-trajectories-cud...

diimdeep

17 days ago

[delayed]

gs17

17 days ago

The gaussian splat output can be generated with CPU (this was honestly one of the easiest AI repos to get running).

delis-thumbs-7e

18 days ago

Interestingly Apple’s own models don’t work on MPS. Well, I guess you just have to wait for few years..

rcarmo

17 days ago

Fixed that: https://github.com/rcarmo/ml-sharp

matthewmacleod

18 days ago

This is specifically only for video rendering. The model itself works across GPU, CPU, and MPS.

Leptonmaniac

18 days ago

9 replies

Can someone ELI5 what this does? I read the abstract and tried to find differences in the provided examples, but I don't understand (and don't see) what the "photorealistic" part is.

eloisius

18 days ago

1 reply

From a single picture it infers a hidden 3D representation, from which you can produce photorealistic images from slightly different vantage points (novel views).

avaer

18 days ago

2 replies

There's nothing "hidden" about the 3d represenation. It's a point cloud (in meters) with colors, and a guess at the the "camera" that produced it.

(I am oversimplifying).

eloisius

18 days ago

1 reply

Hidden in the sense of neural net layers. I mean intermediary representation.

avaer

18 days ago

Right.

I just want to emphasize that this is not a NERF where the model magically produces an image from an angle and then you ask "ok but how did you get this?" and it throws up its hands and says "I dunno, I ran some math and I got this image" :D.

uh_uh

18 days ago

"Hidden" or "latent" in a context like this just means variables that the algo is trying to infer because it doesn't have direct access to them.

emsign

18 days ago

2 replies

Imagine history documentaries where they take an old photo and free objects from the background and move them round giving the illusion of parallax movement. This software does that in less than a second, creating a 3D model that can be accurately moved (or the camera for that matter) in your video editor. It's not new, but this one is fast and "sharp".

Gaussian splashing is pretty awesome.

kurtis_reed

18 days ago

1 reply

What are free objects?

ferriswil

18 days ago

1 reply

The "free" in this case is a verb. The objects are freed from the background.

Retr0id

18 days ago

3 replies

Until your comment I didn't realise I'd also read it wrong (despite getting the gist of it). Attempted rephrase of the original sentence:

Imagine history documentaries where they take an old photo, free objects from the background, and then move them round to give the illusion of parallax.

necovek

17 days ago

1 reply

I'd suggest a different verb like "detach" or "unlink".

thenthenthen

17 days ago

1 reply

isolate from the background?

necovek

17 days ago

Even better, agreed!

nashashmi

17 days ago

1 reply

[delayed]

Sharlin

17 days ago

No, free objects in the foreground, from the background.

tzot

17 days ago

> Imagine history documentaries where they take an old photo, free objects from the background

Even using commas, if you leave the ambiguous “free” I suggest you prefix “objects” with “the” or “any”.

crazygringo

17 days ago

Oh man. I never thought about how Ken Burns might use that.

Already you sometimes see where manually cut out a foreground person from the background and enlarge them a little bit and create a multi-layer 3D effect, but it's super-primitive and I find it gimmicky.

Bringing actual 3D to old photographs as the camera slowly pans or rotates slightly feels like it could be done really tastefully and well.

carabiner

18 days ago

1 reply

Black Mirror episode portraying what this could do: https://youtu.be/XJIq_Dy--VA?t=14

diimdeep

17 days ago

1 reply

[delayed]

rasz

17 days ago

I was thinking Enemy of the State (1998) https://www.youtube.com/watch?v=3EwZQddc3kY

derleyici

18 days ago

It turns a single photo into a rough 3D scene so you can slightly move the camera and see new, realistic views. "Photorealistic" means it preserves real textures and lighting instead of a flat depth effect. Similar behavior can be seen with Apple's Spatial Scene feature in the Photos app: https://files.catbox.moe/93w7rw.mov

skygazer

17 days ago

Apple does something similar right now in their photos app, generating spatial views from 2d photos, where parallax is visible by moving your phone. This paper’s technique seems to produce them faster. They also use this same tech in their Vision Pro headset to generate unique views per eye, likewise on spatialized images from Photos.

zipy124

17 days ago

Basically depth estimation to split the scene into various planes, and then inpainting to work out the areas in the obscured parts of the planes, and then the free movement of them to allow for parallax. Think of 2D side scrolling games that have various different background depths to give illusion of motion and depth.

ares623

18 days ago

Takes a 2D image and allows you to simulate moving the angle of the camera with correct-ish parallax effect and proper subject isolation (seems to be able to handle multiple subjects in the same scene as well)

I guess this is what they use for the portrait mode effects.

avaer

18 days ago

It makes your picture 3D. The "photorealistic" part is "it's better than these other ways".

p-e-w

18 days ago

Agreed, this is a terrible presentation. The paper abstract is bordering on word salad, the demo images are meaningless and don’t show any clear difference to the previous SotA, the introduction talks about “nearby” views while the images appear to show zooming in, etc.

derleyici

18 days ago

1 reply

Apple's Spatial Scene in the Photos app shows similar behavior, turning a single photo into a small 3D scene that you can view by tilting the phone. Demo here: https://files.catbox.moe/93w7rw.mov

Traubenfuchs

18 days ago

It‘s awful and often creates a blurry mess in the imaginated space behind the object.

Photoshop content aware fill could do equally or better many years ago.

supermatt

17 days ago

1 reply

I note the lack of human portraits in the example cases.

My experience with all these solutions to date (including whatever apple are currently using) is that when viewed stereoscopically the people end up looking like 2d cutouts against the background.

I haven't seen this particular model in use stereoscopically so I can't comment as to its effectiveness, but the lack of a human face in the example set is likely a bit of a tell.

sorenjan

17 days ago

1 reply

They're using their Depth Pro model for depth estimation, and that seems to do faces really well.

https://github.com/apple/ml-depth-pro

https://learnopencv.com/depth-pro-monocular-metric-depth/

supermatt

17 days ago

1 reply

Im not sure how the depth estimation alone translates into the view synthesis, but the current implementation on device is definitely not convincing for any portrait photographs.

sorenjan

17 days ago

1 reply

Good monocular depth estimation is crucial if you want to make a 3D representation from a single image. Ordinarily you have images from several camera poses and can create the gaussian splats using triangulation, with a single image you have to guess z position for them.

Someone

17 days ago

[delayed]

rcarmo

17 days ago

2 replies

Well, I got _something_ to work on Apple Silicon:

https://github.com/rcarmo/ml-sharp (has a little demo GIF)

I am looking at ways to approximate Gaussian splats without having to reinvent the wheel, but I'm a bit over my depth since I haven't been playing a lot of attention to those in general.

7moritz7

17 days ago

1 reply

The example doesn't look particularly impressive to say the least. Look at the bottom 20 %

rcarmo

17 days ago

1 reply

I just refactored the rendering and resampling approach. Took me a few tries to figure out how to remove the banding masks from the layers, but with more stacked layers and a bit of GPT-foo to figure out the API it sort of works now (updated the GIF)

Keep in mind that this is not Gaussian splat rendering but just a hacked approximation--on my NVIDIA machine that looks way smoother.

vunderba

17 days ago

Consider using Spark as your renderer. I was able to get ml-sharp creating the PLY splat files and viewable on my Mac M1 in a few minutes.

https://github.com/sparkjsdev/spark

esperent

17 days ago

[delayed]

alexgotoi

17 days ago

1 reply

Apple dropping this is interesting. They've been quiet on the flashy AI stuff while everyone else is yelling about transformers, but 3D reconstruction from single images is actually useful hardware integration stuff.

What's weird is we're getting better at faking 3D from 2D than we are at just... capturing actual 3D data. Like we have LiDAR in phones already, but it's easier to neural-net your way around it than deal with the sensor data properly.

Five years from now we'll probably look back at this as the moment spatial computing stopped being about hardware and became mostly inference. Not sure if that's good or bad tbh.

Will include this one in my https://hackernewsai.com/ newsletter.

momojo

17 days ago

2 replies

I wonder if humans are any different. We don't have LIDAR in our eyes but we approximate depth "enough" with only our 2D input

jakefromstatecs

16 days ago

We don't have 2d input, we have 3d input.

We have two eyes that gives us depth by default.

dTal

16 days ago

We also constantly move our heads and refocus our eyes. We can get a rough idea of depth from only a static stereo pair, but in reality we ingest vastly more information than that and constantly update our internal representation in real time.

pmontra

17 days ago

So Deckard got lucky that the picture enhancement machine allucinated the correct clue? But that was boundto happen 6 years ago, no AI yet.

Dumbledumb

17 days ago

In Chapter D.7 they describe: "The complex reflection in water is interpreted by the network as a distant mountain, therefore the water surface is broken."

This is really interesting to me because the model would have to encode the reflection as both the depth of the reflecting surface (for texture, scattering etc) as well as the "real depth" of the reflected object. The examples in Figure 11 and 12 already look amazing.

Long tail problems indeed.

nashashmi

17 days ago

[delayed]

mhalle

17 days ago

It would be interesting to see how much better this algorithm would be with a stereo pair as input.

Not only do many VR and AR systems acquire stereo, we have historical collections of stereo views in many libraries and museums.

somethingsome

17 days ago

Meh last time I tried depth pro it was not really metric, I wonder if this one is as they claim. If someone has some experience on that side I would be interested

avaer

18 days ago

Is there a link with some sample gaussian splat files coming from this model? I couldn't find it.

Without that that it's hard to tell how cherry-picked the NVS video samples are.

stronglikedan

17 days ago

That's cool and all, but it seems like only the first step in this, where they go from 2D photo all the way to fully animated (animatable?) characters: https://www.youtube.com/watch?v=DSRrSO7QhXY

yieldcrv

17 days ago

I want to see with people

orthoxerox

17 days ago

The resulting animations feel more like "Live2D" than 3D.

benatkin

18 days ago

That is really impressive. However, it was a bit confusing at first because in the koala example at the top, the zoomed in area is only slightly bigger than the source area. I wonder why they didn't make it 2-3x as big in both axes like they did with the others.

codebyprakash

17 days ago

Quite cool!

yodon

18 days ago

> photorealistic 3D representation from a single photograph in less than a second

reactordev

17 days ago

[delayed]

BoredPositron

17 days ago

The paper is just a word salad and it's not better than previous sota? I might be missing a key element here.