The Physical AI Foundation Race is On

What is Physical AI?
Over the past several years, AI has taken the world by storm. The advent of modern LLMs in the late 2010s, followed by OpenAI’s public release of ChatGPT in November 2022, triggered an AI revolution that is fundamentally transforming how humans live and work.
The rise of AI thus far has been largely constrained to the digital world. The predominant AI foundation models, infrastructure, and applications are designed to complete tasks digitally.
However, throughout 2024, the next wave of AI emerged: Physical AI. Physical AI combines software with real-world sensor data to enable autonomous machines to interact with and understand their physical environment. Through this technology, machines can perceive, interpret, and act upon the physical world in real-time, allowing them to learn, adapt, and perform complex tasks just like humans. Essentially, physical AI technology extends autonomous capabilities beyond the digital realm and into the real world.
Physical AI has the same layers as digital AI (foundation models, infrastructure, and applications), but has a hardware component as an additional layer.
Similar to how LLMs are trained on vast amounts of text and image data, physical AI foundation models are trained on data that captures spatial relationships and physical rules of the real world. The models are taught how to comprehend and react to physical stimuli.
At the infrastructure level of physical AI, synthetic data is extremely important. Ideally, a physical AI model enables machines to adapt to variation in real-world scenarios. It can be extremely costly, and oftentimes not feasible, to collect real data from a large variety of physical scenarios. Thus, physical AI model developers generate synthetic data from 3D computer simulations of real-world scenarios. These synthetic datasets typically take the form of digital twins: visual representations of physical environments that mimic real-world data.
The most well-known examples of physical AI applications are robotics and self-driving vehicles. These technologies use sensors to perceive their surroundings, and operate on physical AI models that instruct the hardware to react to sensor data accordingly. Within the next decade, we may see physical AI automation in the vast majority of jobs which include elements of manual and physical labor—either fully automating certain functions or acting in tandem with a level of human operational control.
Vision-Language-Action Models
Currently, the heart of the physical AI foundation layer is the Vision-Language-Action Model, or VLA model. Particularly relevant for robotics, a VLA model is a unified architecture that combines vision (what the robot sees), language (what the robot is told), and action (what the robot does) capabilities. Essentially, VLA enables machines to see, understand, and act.
The first component of the VLA model is vision processing. This technology turns raw camera feed from the robot’s cameras and sensors into useful information. Vision processing includes object recognition and spatial reasoning.
The second component of the model is language grounding. This software connects words to actions and context. The algorithm dissects natural language commands into step-by-step motions, and situates those steps in the context of the robot’s physical surroundings.
The third component of the VLA model is action generation. This turns decisions into physical actions—i.e. the robot moving its arms, legs, or grippers to accomplish tasks in real life.
Putting it all together, here is a practical example. Let’s say you tell a VLA-powered robot to “safely drive to the grocery store”. The robot will observe traffic lights, road signs, and pedestrians through its cameras (vision). The robot will understand that the command means following traffic rules and GPS directions (language). And the robot will act by steering, accelerating, and braking smoothly without hitting other cars (action).
Overall, VLAs work like a human brain. They process vision, language, and action simultaneously. This creates a huge advancement from traditional robots, which work in slow, disconnected steps—they only think after they see, and they only act after they think. Meanwhile, VLA-powered robots can see, think, and act at the same time. Thus, these robots are significantly more capable of tackling complex situations. Some specific applications of VLA-powered robots that are being developed include robot maids that do house chores safely around children, and rescue robots that navigate post-disaster rubble to find survivors.
The Race Has Begun
The major players in the digital AI foundation model race are now set in stone: OpenAI, Anthropic, Meta, Google, AWS, etc. These companies have achieved “escape velocity”—they have passed the point of any competitors being able to catch up to them. The aforementioned firms’ vast resources and progress put them far out of the reach of other companies in this space.
However, the physical AI foundation race has just begun. The infancy of this race means that clear leaders have not yet emerged—there is plenty of room for competition.
In recent months, several tech companies have released their own, new VLA models. The table below lists the most prominent VLA models in existence as of April 2025.

As depicted in the table, most of the major VLA models were introduced in H1 2025, meaning that the physical AI foundation model race is just beginning to pick up steam.
You may be wondering why this race has not hit escape velocity if heavy hitters like Google and NVIDIA are releasing VLA models. Although these companies have the most resources by far, they are not leading in physical AI like they are in the digital AI foundation model race. Many other scientists and tech startups are experimenting with VLA models and experiencing just as much, if not more, traction as the tech giants.
For instance, OpenVLA is a free, open-source VLA model developed by robotics researchers at Stanford, UC Berkeley, and MIT. OpenVLA is a 7B parameter model that enables small teams to build robots without expensive software. For example, an engineer can use OpenVLA to create a robot that helps farmers harvest crops. OpenVLA is already powering robots in labs and small factories, where they handle tasks like sorting parts or inspecting products.
Furthermore, throughout the past year, multiple physical AI foundation model startups raised huge VC rounds to create breakthroughs in robot learning. For example, in July 2024, Skild AI raised $300 million in Series A funding, and in November 2024, Physical Intelligence raised $400 million. Most recently, in April 2025, South Korean startup RLWRLD raised a $14.8 million seed round. RLWRLD promises to develop a new model that will enable robots to make quick, agile movements and perform some logical reasoning.
Evidently, the physical AI foundation model space is extremely ripe for disruption.
At the macro level, the excitement surrounding physical AI is illustrated by the lush robotics funding landscape. In 2024, robotics startups raised $7.5 billion, up from $6.9 billion in 2023. But this funding is becoming increasingly concentrated—there were only 473 funding rounds in 2024, down from 671 in 2023. This means that larger checks are going to fewer companies—hence the massive round sizes for startups like Skild AI and Physical Intelligence.
I predict that round sizes will continue to grow larger as robotics funding becomes even more concentrated. Physical AI is a very risky space to invest in. It is highly capital-intensive, and there are immense market adoption uncertainties given the novelty and integrative complexity of robots. Thus, investors will be more selective about which physical AI startups they invest in, and the majority of funding will flow into a small number of highly promising startups.
Where the Race is Headed
We are still in the very early stages of this race, and only time will tell exactly who emerges as the winners. However, the overall direction of the race is clear: the best physical AI foundation models will be the ones that enable robots to behave as similar to humans as possible. In that sense, VLA is likely just the start—we will eventually develop new types of models that enable more human-like robot behavior.
Current progress illustrates this direction. Take adaptation, for example. Existing robots excel at repetitive tasks—think of industrial robots working on assembly lines in factories. However, current robots struggle with adjusting to changing conditions—they are missing humans’ adaptive nature. Hence, one of the largest strides being made in physical AI is the enablement of robots to adapt to new circumstances within the broader task that they are trained to do. In the industrial robot example, the progress being targeted here is enabling the robots to handle odd-shaped parts without reprogramming.
Another major stride being made is the creation of general-purpose humanoid robots. Traditionally, robots specialize in specific categories of tasks, whether it be manufacturing, packing, harvesting, etc. However, these robots are confined to the single categories of tasks that they are trained for. Companies like Skild AI and Figure AI are developing models that power general-purpose robots capable of completing a large variety of tasks, across multiple categories, reflecting humans’ multifaceted skillsets.
While the latest humanoid robots are increasingly adaptive and general-purpose, there remains a major hindrance: these robots won’t act unless they are given specific natural-language commands. Although existing robots can automate the tasks they are prompted to perform, they cannot create their own agendas.
Therefore, I predict that the most significant breakthrough yet to be made in physical AI is the development of agentic models for general-purpose humanoid robots. Emphasis on agentic. Similar to the current boom in digital AI agents that autonomously create and execute their own agendas in pursuit of certain goals, the future of physical AI is humanoid robots that can form goals related to an overarching objective, determine how to achieve them, and execute its plans adaptively. In other words, robots that can lead themselves, like humans can. A foundation model that enables such agentic robot behavior could unlock trillions of dollars in economic value.
Ultimately, while much work nowadays lies in the digital realm, the vast majority of labor is still physical. The future of AI is the automation of physical labor—whether it be industries like farming, manufacturing, or household chores. The race to dominate the physical AI foundation model market is unfolding quickly. I believe that the people who innovate an agentic foundation model, enabling humanoid robots to conduct physical labor without any human supervision, will win this race.
We were thrilled to have Matthew Zhang intern with Exceptional Capital for the Spring 2025 semester. Grateful for his support, engagement, and curiosity!