LeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?-AI-php.cn

LLM の制限

Video Prediction

JEPA (Joint Embedding Prediction Architecture)

Reinforcement Learning

AGI

humanoid robots

ホームページ

テクノロジー周辺機器

LeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Mar 11, 2024 pm 12:52 PM

オープンソース agi ロボット技術

人工知能の分野では、ヤンルカンのように 65 歳になった今でもソーシャルメディアで精力的に活動している学者はほとんどいません。

ヤン・ルカンは、人工知能に対する率直な批判者として知られています。彼はオープンソース精神の積極的な支持者であり、Meta のチームを率いて人気の Llama 2 モデルを立ち上げ、オープンソースの大規模モデルの分野のリーダーになりました。多くの人が人工知能の将来について不安を抱き、起こり得る終末シナリオを心配していますが、ルカン氏は異なる見解を持っており、人工知能の発展、特に超知能の出現が社会にプラスの影響を与えると強く信じています。

最近、LeCun は再び Lex Fridman のポッドキャストに来て、オープンソースの重要性、LLM の限界、そしてなぜ人工知能が必要なのかについて 3 時間近くの会話を始めました。 AGI への道について間違っています。

LeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?

視聴ページ: https://youtu.be/5t1vTLU7s40?feature=shared

usこのポッドキャストからの貴重なポイントをいくつか紹介します:

LLM の制限

Lex Fridman: 自己回帰 LLM は私たちが作る方法ではないと言いましたね。超人的な知性への進歩。なぜ彼らは私たちをずっと連れて行ってくれないのでしょうか？

ヤン・ルカン: 理由はたくさんあります。まず、知的な行動には多くの特徴があります。たとえば、世界を理解する能力、物理的世界を理解する能力、物事を覚えて検索する能力、永続的な記憶、推論と計画の能力などです。これらは、知的システムまたはエンティティ、人間、動物の 4 つの基本的な特性です。 LLM はこれらを実行できないか、非常に原始的な方法でしか実行できず、物理世界を実際には理解していません。 LLM には永続的な記憶がなく、論理的に考えることもできず、もちろん計画を立てることもできません。したがって、システムがスマートであることを期待していても、これらのことは実行できないのであれば、それは間違いです。これは、自己回帰 LLM が役に立たないと言っているわけではありません。確かに便利ですが、面白くはなく、それらを中心にアプリのエコシステム全体を構築することはできません。しかし、人間レベルの知性へのパスポートとして必要な構成要素が欠けています。

私たちは、言語よりも感覚入力を通じてはるかに多くの情報を見ています。直観にもかかわらず、私たちが学び、知っていることのほとんどは、現実の世界ではなく観察し、対話することによって得られます。言葉を通して。私たちが人生の最初の数年間に学ぶことはすべて、そして確かに動物が学ぶことはすべて、言語とは何の関係もありません。

レックス・フリッドマン: LLM には物理世界の理解が欠けている、ということですか?したがって、直観的な物理学、物理空間や物理的現実についての常識的な推論は、あなたにとって特別なことではありません。これはLLMでは実現できない大きな飛躍なのでしょうか?

Yann LeCun: 現在使用している LLM では、さまざまな理由でこれを行うことができませんが、主な理由は、LLM のトレーニング方法がつまり、テキストの一部を取得し、テキスト内のいくつかの単語を削除し、マスクして空のトークンに置き換え、欠落している単語を予測するように遺伝的ニューラルネットワークをトレーニングします。このニューラルネットワークを特別な方法で構築し、左側の単語または予測しようとしている単語のみを参照できるようにすると、基本的にテキスト内の次の単語を予測しようとするシステムが完成します。したがって、テキストやプロンプトを与えて、次の単語を予測させることができます。次の単語を正確に予測することはできません。

つまり、辞書内のすべての可能な単語に対する確率分布を生成することです。実際、単語は予測されません。単語のチャンクをサブワード単位で予測するため、辞書に表示される単語の数が限られており、その分布を計算するだけなので、予測の不確実性を簡単に処理できます。次に、システムはこの分布から単語を選択します。もちろん、この分布では、より高い確率で単語が選択される確率が高くなります。したがって、その分布からサンプリングし、実際に単語を生成し、システムが 2 番目の単語を予測しないように、その単語を入力に移動します。

これは自己回帰予測と呼ばれます。そのため、これらの LLM は「自己回帰 LLM」と呼ばれる必要がありますが、ここでは単に LLM と呼びます。このプロセスは、単語を生成する前のプロセスとは異なります。

When you and I talk, and we are both bilingual, we think about what we are going to say, and that is relatively independent of the language we are going to say. When we talk about a mathematical concept, the thinking we do and the answer we intend to give have nothing to do with whether we express it in French, Russian or English.

Lex Fridman: Chomsky rolled his eyes, but I get it, so you're saying there's a larger abstraction that exists before language and maps to it?

Yann LeCun: For a lot of the thinking we do, yes.

Lex Fridman: Is your humor abstract? When you tweet, and your tweets are sometimes a little spicy, do you have an abstract representation in your brain before the tweet is mapped to English?

Yann LeCun: There really is an abstract representation to imagine the reader's reaction to the text. But thinking about a mathematical concept, or imagining what you want to make out of wood, or something like that, has absolutely nothing to do with language. You're not having an inner monologue in a specific language. You are imagining a mental model of things. I mean, if I ask you to imagine what this water bottle would look like if I rotated it 90 degrees, it has nothing to do with language. It's clear that most of our thinking occurs at a more abstract representational level. If the output is language, we will plan what we are going to say. Instead of outputting muscle movements, we will plan the answer before we make it. Answer.

LLM doesn’t do that and just instinctively says word after word. It's kind of like a subconscious move, where someone asks you a question and you answer it. There was no time to think about the answer, but it was simple. So you don't need to pay attention, it will react automatically. This is what LLM does. It doesn't really think about the answers. Because it has accumulated a lot of knowledge, it can retrieve some things, but it will just spit out token after token without planning the answer.

Lex Fridman: Generating token by token is necessarily simplistic, but if the world model is complex enough, it is most likely to generate a series of tokens, which will be a Esoteric things.

Yann LeCun: But this is based on the assumption that these systems actually have an eternal model of the world.

Video Prediction

#Lex Fridman: So the real question is... Can we build a model that has a deep understanding of the world?

Yann LeCun: Can you build it out of predictions, the answer is probably yes. But can it be built by predicting words? The answer is most likely no, because language is very poor at weak or low bandwidth and doesn't have enough information. So building a model of the world means looking at the world, understanding why the world evolves the way it does, and then an additional component of the world model is being able to predict how the world will evolve as a result of the actions you might take.

So a true model is: here is my idea of the state of the world at time T, and here are the actions I might take. What is the predicted state of the world at time T 1? Now, the state of the world does not need to represent everything about the world, it just needs to represent enough information relevant to planning this operation, but not necessarily all the details.

Now, here comes the problem. Generative models cannot do this. So generative models need to be trained on video, and we've been trying to do that for 10 years, where you take a video, you show the system a video, and you're asked to predict the reminder of the video, basically predicting what's going to happen.

You can make large video mockups if you want. The idea of doing this has been around for a long time, at FAIR I and some of our colleagues have been trying to do it for 10 years, but you can't really do the same trick with LLM because LLM, as I said, you can't do it accurately Predict which word will follow a sequence of words, but you can predict the distribution of words. Now, if you look at a video, what you have to do is predict the distribution of all possible frames in the video, and we don't know how to do that correctly.

We don't know how to represent distributions on high-dimensional continuous spaces in a useful way. That's the main problem, and we can do this because the world is much more complex and information-rich than words. Text is discrete, while video is high-dimensional and continuous. There are a lot of details in this. So if I take a video of this room and the camera is panning around in the video, I simply can't predict everything that's going to be in the room as I pan around. The system also cannot predict what will appear in the room when the camera pans. Maybe it predicts that it's a room and there's a light in it and there's a wall and that sort of thing. It can't predict what a painting on a wall will look like or what the texture of a sofa will look like. Of course there is no way to predict the texture of a carpet. So I can't predict all those details.

So one possible way to deal with this, which we have been studying, is to build a model with so-called latent variables. The latent variables are fed into the neural network, which is supposed to represent all the information about the world that you haven't yet sensed. You need to enhance the predictive power of the system to be able to predict pixels well, including the subtleties of carpets, sofas and paintings on the wall. texture.

We have tried direct neural networks, we have tried GANs, we have tried VAEs, we have tried various regularized autoencoders. We also try to use these methods to learn good representations of images or videos, which can then be used as input to image classification systems and so on. Basically failed.

All systems that try to predict missing parts from a corrupted version of an image or video basically do this: take the image or video, corrupt it or convert it in some way , and then try to reconstruct the full video or image from the corrupted version, and then hopefully develop a good image representation inside the system that can be used for object recognition, segmentation, whatever. This approach is basically a complete failure, whereas it works extremely well when it comes to text. This is the principle used for LLM.

Lex Fridman: Where did the failure come from? Is it difficult to present the image well, such as embedding all the important information well into the image? Is it the consistency between image and image, image and image that forms the video? What would it look like if we made a compilation of all the ways you fail?

Yann LeCun: First, I have to tell you what doesn’t work, because there are other things that do. So, what doesn't work is training the system to learn representations of images, training it to reconstruct good images from corrupted images.

We have a whole set of techniques for this, which are all variants of denoising autoencoders. Some of my colleagues at FAIR developed something called MAE, or Masked Autoencoder. Encoder. So it's basically like an LLM or something like that, where you train the system by corrupting the text, but you corrupt the image, remove patches from it, and then train a giant neural network to reconstruct it. The features you get are not good, and you know they are not good, because if you now train the same architecture, but you train it supervised with labeled data, text descriptions of the images, etc., you do get good representations , the performance on the recognition task is much better than if you do this kind of self-supervised retraining.

The structure is good, and the structure of the encoder is also good, but the fact that you train the system to reconstruct images does not make it produce long and good general features of images. So what's the alternative? Another approach is joint embedding.

JEPA (Joint Embedding Prediction Architecture)

#Lex Fridman:: What is the fundamental difference between Joint Embedding Architecture and LLM? Can JEPA take us into AGI?

Yann LeCun: First, how does it differ from generative architectures like LLM? An LLM or a vision system trained through reconstruction generates the input. The raw input they generate is uncorrupted, untransformed, so you have to predict all the pixels, and it takes a lot of resources for the system to actually predict all the pixels and all the detail. In JEPA, you don't need to predict all pixels, you only need to predict an abstract representation of the input. This is much easier in many ways. Therefore, what the JEPA system has to do when training is to extract as much information as possible from the input, but only extract information that is relatively easy to predict. Therefore, there are many things in the world that we cannot predict. For example, if you have a self-driving car driving down the street or on the road, there might be trees around the road, and it might be a windy day. So the leaves on the tree move in a semi-chaotic, random way that you can't predict, and you don't care, and you don't want to predict. So you want the encoder to basically remove all of these details. It will tell you that the leaves are moving, but it won't tell you exactly what's going on. So when you predict in representation space, you don't have to predict every pixel of every leaf. Not only is this much simpler, but it also allows the system to essentially learn an abstract representation of the world, where what can be modeled and predicted is retained, and the rest is treated as noise by the encoder and eliminated.

Therefore, it increases the level of abstraction of the representation. If you think about it, this is definitely something we've been doing. Whenever we describe a phenomenon, we do so at a specific level of abstraction. We don't always use quantum field theory to describe every natural phenomenon. That is impossible. So we have multiple levels of abstraction to describe what's going on in the world, from quantum field theory to atomic theory, molecules, chemistry, materials, all the way to concrete objects in the real world and so on. So we can't just simulate everything at the lowest level. And this is exactly the idea behind JEPA, learning abstract representations in a self-supervised manner, and also learning them hierarchically. So I think that's an important part of smart systems. In terms of language, we don't have to do this, because language is already abstract to a certain extent and has eliminated a lot of unpredictable information. Therefore, we can directly predict words without doing joint embeddings or increasing the level of abstraction.

Lex Fridman: You're talking about language, and we're too lazy to use language because we've been given abstract representations for free, and now we have to zoom in and really think about intelligent systems in general. We have to deal with physical reality and reality that is a mess. And you really have to do that, jumping from full, rich, detailed reality to an abstract representation of reality based on what you can reason about, and all that kind of stuff.

Yann LeCun: That’s right. Self-supervised algorithms that learn by prediction, even in representation space, learn more concepts if the input data is more redundant. The more redundant the data, the better they capture the internal structure of the data. Therefore, in sensory input such as perceptual input and vision, there are much more redundant structures than in text. Language may actually represent more information because it has been compressed. You're right, but that also means it has less redundancy, so the self-supervision won't be as good.

Lex Fridman: Is it possible to combine self-supervised training on visual data with self-supervised training on language data? Even though you're talking about 10 to 13 tokens, there's a ton of knowledge that goes into it. These 10 to 13 tokens represent everything we humans have figured out, including the crap on Reddit, the content of all the books and articles, and everything the human intellect has ever created.

Yann LeCun: Well, ultimately yes. But I think if we do it too early, we risk being induced to cheat. And in fact, this is exactly what people are currently doing with visual language models. We are basically cheating, using language as a crutch to help our deficient visual systems learn good representations from images and videos.

The problem with this is that we can improve language models by feeding them images, but we won’t even reach the level of intelligence or understanding of the world that a cat or a dog has because they There is no language. They have no language but understand the world far better than any LLM. They can plan very complex actions and imagine the consequences of a sequence of actions. How do we get machines to learn this before combining it with language? Obviously if we combine this with language we will get results, but until then we have to focus on how to get the system to learn how the world works.

In fact, the technology we use is non-contrastive. Therefore, not only is the architecture non-generative, the learning procedures we use are also non-comparative. We have two sets of technologies. One set is based on the distillation method. There are many methods that use this principle. DeepMind has one called BYOL, there are several FAIRs, one is called vcREG, and one is called I-JEPA. It should be said that vcREG is not a distillation method, but I-JEPA and BYOL certainly are. There is also one called DINO or DINO, which is also produced by FAIR. The idea behind these methods is that you run the complete input, say an image, through an encoder, producing a representation, and then you destroy or transform the input, running it through what is essentially the same encoder, but with some nuances and then train a predictor.

Sometimes the predictor is very simple, sometimes the predictor does not exist, but a predictor is trained to predict the relationship between the first uncorrupted input and the corrupted input. But you only train the second branch. You only train the part of the network that takes corrupted input. The other network does not require training. But since they share the same weights, when you modify the first network, it also modifies the second network. Through various tricks, you can prevent the system from crashing, like the one I explained earlier, where the system basically ignores the input. Therefore, this method is very effective. Two technologies we developed at FAIR, DINO and I-JEPA, are very effective in this regard.

Our latest version is called V-JEPA. It's basically the same idea as I-JEPA, just applied to video. So you can take the entire video and then block a chunk of it. What we're masking out is actually a time pipe, so the entire clip for every frame in the entire video.

This is the first system we have that can learn good representations of video, so when you feed those representations into a supervised classifier head, it can Tells you with a fairly high degree of accuracy what action is taking place in the video. So this is the first time we're getting something of this quality.

The results seem to indicate that our system can use representations to tell whether a video is physically possible or completely impossible because some objects disappear or an object suddenly Jumping from one location to another, or changing shape or something.

Lex Fridman: Does this allow us to build a model of the world that understands it well enough to be able to drive a car?

Yann LeCun: It may take a while to get there. There are already some robotic systems based on this idea. What you need is a slightly modified version. Imagine that you have a complete video, and what you do with this video is time-shift it into the future. Therefore, you can only see the beginning of the video but not the second half of the original video, or only the second half of the video is blocked. You can then train a JEPA system or a system like the one I described to predict the complete representation of the occluded video. However, you also need to provide the predictor with an action. For example, the wheel turns 10 degrees to the right or something, right?

So if this is a car camera and you know the angle of the steering wheel, then to some extent you should be able to predict how what you see will change . Obviously, you can't predict all the details of the objects that appear in the view, but at the level of abstract representation, you may be able to predict what will happen. So, now you have an internal model that says, "This is my idea of the state of the world at time T, and here's the action I'm taking. Here's T plus 1, T plus delta T, T plus 2 seconds Prediction of the state of the world," whatever it is. If you have such a model, you can use it for planning. So now you can do what an LMS can't do, which is plan what you want to do. So when you reach a specific result or meet a specific goal.

So you can have many goals. I can predict that if I had an object like this and I opened my hand, it would fall. If I push it against the table with a specific force, it moves. If I push the table with the same force, it probably won't move. As a result, we have an internal model of the world in our minds, which allows us to plan a sequence of actions to achieve a specific goal. Now, if you have this model of the world, we can imagine a sequence of actions, predict the outcome of that sequence of actions, measure how well the final state satisfies a particular goal, such as moving the bottle to the left of the table, and then run Plan a series of actions to minimize this goal.

We’re not talking about learning, we’re talking about reasoning time, so that’s planning, really. In optimal control, this is a very classic thing. It's called model predictive control. You have a model of the system you want to control that predicts a sequence of states that corresponds to a sequence of instructions. And you're planning a sequence of instructions so that, based on your role model, the end state of the system will meet the goals you set. Rocket trajectories have been planned this way since the advent of computers, in the early 1960s.

Reinforcement Learning

#Lex Fridman: Suggestion to abandon generative models in favor of joint embedding architectures? You've been a critic of reinforcement learning for some time. It feels like court testimony, abandoning probabilistic models in favor of the energy-based models we talked about, abandoning contrastive methods in favor of regularization methods.

Yann LeCun: I don’t think it should be abandoned completely, but I think its use should be minimized because it’s sampling Very inefficient. Therefore, the correct way to train a system is to first have it learn a good representation of the world and a model of the world from primary observations (and maybe a little interaction).

Lex Fridman: Why does RLHF work so well? Yann LeCun of reinforcement learning.

Open Source

Yann LeCun

#: Owning the artificial intelligence industry and having no unique bias The only way to build an AI system is to have an open source platform on which any group can build a specialized system. The inevitable direction of history is that the vast majority of AI systems will be built on open source platforms.

Meta revolves around a business model where you provide a service that is funded either by advertising or commercial clients.

For example, if you have an LLM that can help a pizza shop by talking to customers through WhatsApp, the customer only needs to order a pizza and the system will ask them: " What ingredients do you want or what sizes do you want, etc." Merchants will pay for it, and that's the model.

Otherwise, if it is a more classic service system, it can be supported by advertising or have several modes. But the thing is, if you have a large enough potential customer base that you need to build the system for them anyway, there's no harm in releasing it into open source.

Lex Fridman: Meta’s bet is: Will we do better?

Yann LeCun

: No. We already have a huge user base and customer base.

It doesn’t hurt that we provide open source systems or basic models and basic models for others to build applications on. If these apps are useful to our customers, we can purchase them directly from them. They may improve the platform. In fact, we've seen this happen. LLaMA 2 has been downloaded millions of times, and thousands of people have come up with ideas on how to improve the platform. So this obviously speeds up the process of making the system available to a wide range of users, and thousands of businesses are building applications using the system. Therefore, Meta's ability to generate revenue from this technology is not affected by the open source distribution of the underlying model.

Llama 3

Lex Fridman: What are you most excited about about LLaMA 3?

Yann LeCun

: There will be various versions of LLaMA, which are improvements on the previous LLaMA, bigger, better, Multimodality, that sort of thing. And then, in future generations, there are planning systems that are able to actually understand how the world works, probably trained on video, so they will have some model of the world that might be able to do the type of reasoning and planning that I talked about earlier.

How long will this take? When will research in this direction make its way into the product line? I don't know and I can't tell you. We basically have to go through some breakthroughs before we get there, but people are able to monitor our progress because we publish our research publicly. Therefore, last week we published the V-JEPA effort, a first step toward a video training system.

The next step will be to train a world model based on this video creativity. DeepMind has similar work, and UC Berkeley has work on world models and videos. Many people are working on this. I think a lot of good ideas are coming. My bet is that these systems will be JEPA lightweight systems, they won't be generative models, and we'll see what happens in the future.

More than 30 years ago, when we were working on combinatorial networks and early neural networks, I saw a path to human-level intelligence, systems that could understand the world, remember, plan, reason. There are some ideas that can move forward that might have a chance to work, and I'm really excited about that.

What I like is that we are somehow moving in a good direction and maybe succeeding before my brain turns to white sauce or before I need to retire.

Lex Fridman: Most of your excitement is still in the theoretical aspect, that is, the software aspect?

Yann LeCun: I used to be a hardware guy many years ago. Scale is necessary, but not sufficient. It is possible that I will live ten years in the future, but I will still have to run a short distance. Of course, the further we go in terms of energy efficiency, the more progress we make in terms of hard work. We have to reduce power consumption. Today, a GPU consumes between half a kilowatt and a kilowatt. A human brain draws about 25 watts of power, while a GPU draws far less than a human brain. You'd need 100,000 or 1 million of power to match that, so we're pretty far apart.

AGI

Lex Fridman: You often say that GI is not coming anytime soon, what is the underlying intuition behind that?

Yann LeCun: The idea, popularized by science fiction and Hollywood, that someone will discover AGI or human-level AI Or the secret of AMI (whatever you want to call it), and then turning on the machine and we have AGI, is not going to happen.

This will be a gradual process. Will we have systems that can understand how the world works from videos and learn good representations? It will take quite some time before we reach the scale and performance we observe in humans, not just a day or two.

Do we allow systems to have large amounts of associative memory to remember things? Yes, but it won’t happen tomorrow either. We need to develop some basic technologies. We have a lot of these technologies, but getting them to work with a complete system is another story.

Will we have systems that can reason and plan, perhaps like the goal-driven AI architectures I described earlier? Yes, but it will take a while to get it working properly. It's going to be at least a decade or more before we get all these things working together, before we get systems based on this that learn hierarchical planning, hierarchical representations, that can be configured the way a human brain can for the different situations at hand, Because there are a lot of problems that we don't see yet, that we haven't encountered yet, so we don't know if there are simple solutions within this framework.

For the past decade or so, I’ve heard people claim that AGI is just around the corner, and they’re all wrong.

IQ can measure something about humans, but because humans are relatively uniform in form. However, it only measures an ability that may be relevant to some tasks but not others. But if you're talking about other intelligent entities for which the basic things that are easy to do are completely different, then it doesn't make any sense. Therefore, intelligence is a collection of skills and the ability to acquire new skills efficiently. The set of skills that a particular intelligent entity possesses or is able to learn quickly is different from the set of skills of another intelligent entity. Because this is a multi-dimensional thing, the skill set is a high-dimensional space that you can't measure and you can't compare two things to see if one is smarter than the other. It is multi-dimensional.

Lex Fridman: You often speak out against so-called AI doomsdayers, explain their views and why you think they are wrong.

Yann LeCun: AI doomsdayers imagine various disaster scenarios, how the AI could escape or take control and essentially kill All of us, this relies on a bunch of assumptions, most of which are wrong.

The first hypothesis is that the emergence of superintelligence will be an event and at some point we will discover the secrets and we will open a superintelligent machine because we have never done this before passed, so it will take over the world and kill us all. This is wrong. This won't be an event.

We will have systems that are as smart as cats, they have all the characteristics of human intelligence, but their level of intelligence may be like a cat or a parrot or something. Then, we gradually improve their intelligence. While making them smarter, we also need to set up some guardrails on them and learn how to set up guardrails to make them behave more normally.

In nature, it seems that the more intelligent species will eventually dominate the other species, sometimes even intentionally, and sometimes just by mistake to differentiate the other species.

So you’re thinking, “Well, if AI systems are smarter than us, they’re definitely going to wipe us out, if not on purpose, just because they don’t care about us,” which is Absurd - The first reason is that they will not become a species that competes with us and will not have the desire to dominate, because the desire to dominate must be something inherent in intelligent systems. It is deeply ingrained in humans and is shared by baboons, chimpanzees, and wolves, but not in orangutans. This desire to dominate, obey, or otherwise gain status is unique to social species. Non-social species like orangutans have no such desire and are just as smart as we are.

humanoid robots

Lex Fridman: Do you think there will be millions of humanoid robots walking around soon?

Yann LeCun: It won’t be soon, but it will happen.

The next ten years, I think the robotics industry is going to be very interesting, the rise of the robotics industry has been 10, 20 years in the making and there's not really a Appear. The main question remains Moravec's Paradox, how do we get these systems to understand how the world works and plan actions? In this way, we can complete truly professional tasks. What Boston Dynamics did was basically through a lot of hand-crafted dynamic models and careful planning in advance, which is very classic robotics with a lot of innovation and a little bit of perception, but it was still not enough and they couldn't make a home robot.

Additionally, we are still some distance away from fully autonomous L5 driving, such as a system that can train itself like a 17-year-old through 20 hours of driving.

So we won’t make significant progress in robotics until we have a model of the world, a system that can train itself to understand how the world works.

以上がLeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

このウェブサイトの声明

この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。

ホットAIツール

Undresser.AI Undress

リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover

写真から衣服を削除するオンライン AI ツール。

Undress AI Tool

脱衣画像を無料で

Clothoff.io

AI衣類リムーバー

Video Face Swap

完全無料の AI 顔交換ツールを使用して、あらゆるビデオの顔を簡単に交換できます。

ホットツール

メモ帳++7.3.1

使いやすく無料のコードエディター

SublimeText3 中国語版

中国語版、とても使いやすい

ゼンドスタジオ 13.0.1

強力な PHP 統合開発環境

ドリームウィーバー CS6

ビジュアル Web 開発ツール

SublimeText3 Mac版

神レベルのコード編集ソフト（SublimeText3）

ホットトピック

Gmailメールのログイン入り口はどこですか？

7672

CakePHP チュートリアル

1393

C# チュートリアル

1206

Steamのアカウント名の形式は何ですか

Win11 Activation Key Permanent

Related knowledge

オープンソースのフリーテキスト注釈ツールのおすすめ 10 選 Mar 26, 2024 pm 08:20 PM

テキスト注釈は、テキスト内の特定のコンテンツにラベルまたはタグを対応させる作業です。その主な目的は、特に人工知能の分野で、より深い分析と処理のためにテキストに追加情報を提供することです。テキスト注釈は、人工知能アプリケーションの教師あり機械学習タスクにとって非常に重要です。これは、自然言語テキスト情報をより正確に理解し、テキスト分類、感情分析、言語翻訳などのタスクのパフォーマンスを向上させるために AI モデルをトレーニングするために使用されます。テキストアノテーションを通じて、AI モデルにテキスト内のエンティティを認識し、コンテキストを理解し、新しい同様のデータが出現したときに正確な予測を行うように教えることができます。この記事では主に、より優れたオープンソースのテキスト注釈ツールをいくつか推奨します。 1.LabelStudiohttps://github.com/Hu

オープンソースの無料画像注釈ツールおすすめ 15 選 Mar 28, 2024 pm 01:21 PM

画像の注釈は、ラベルまたは説明情報を画像に関連付けて、画像の内容に深い意味と説明を与えるプロセスです。このプロセスは機械学習にとって重要であり、画像内の個々の要素をより正確に識別するために視覚モデルをトレーニングするのに役立ちます。画像に注釈を追加することで、コンピュータは画像の背後にあるセマンティクスとコンテキストを理解できるため、画像の内容を理解して分析する能力が向上します。画像アノテーションは、コンピュータビジョン、自然言語処理、グラフビジョンモデルなどの多くの分野をカバーする幅広い用途があり、車両が道路上の障害物を識別するのを支援したり、障害物の検出を支援したりするなど、幅広い用途があります。医用画像認識による病気の診断。この記事では主に、より優れたオープンソースおよび無料の画像注釈ツールをいくつか推奨します。 1.マケセンス

こんにちは、電気アトラスです！ボストン・ダイナミクスのロボットが復活、180度の奇妙な動きにマスク氏も恐怖 Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas は正式に電動ロボットの時代に突入します!昨日、油圧式アトラスが歴史の舞台から「涙ながらに」撤退したばかりですが、今日、ボストン・ダイナミクスは電動式アトラスが稼働することを発表しました。ボストン・ダイナミクス社は商用人型ロボットの分野でテスラ社と競争する決意を持っているようだ。新しいビデオが公開されてから、わずか 10 時間ですでに 100 万人以上が視聴しました。古い人が去り、新しい役割が現れるのは歴史的な必然です。今年が人型ロボットの爆発的な年であることは間違いありません。ネットユーザーは「ロボットの進歩により、今年の開会式は人間のように見え、人間よりもはるかに自由度が高い。しかし、これは本当にホラー映画ではないのか？」とコメントした。ビデオの冒頭では、アトラスは仰向けに見えるように地面に静かに横たわっています。次に続くのは驚くべきことです

宇宙探査と人類居住工学における人工知能の進化 Apr 29, 2024 pm 03:25 PM

1950 年代に人工知能 (AI) が誕生しました。そのとき、研究者たちは、機械が思考などの人間と同じようなタスクを実行できることを発見しました。その後、1960 年代に米国国防総省は人工知能に資金を提供し、さらなる開発のために研究所を設立しました。研究者たちは、宇宙探査や極限環境での生存など、多くの分野で人工知能の応用を見出しています。宇宙探査は、地球を超えた宇宙全体を対象とする宇宙の研究です。宇宙は地球とは条件が異なるため、極限環境に分類されます。宇宙で生き残るためには、多くの要素を考慮し、予防策を講じる必要があります。科学者や研究者は、宇宙を探索し、あらゆるものの現状を理解することが、宇宙の仕組みを理解し、潜在的な環境危機に備えるのに役立つと信じています。

推奨: 優れた JS オープンソースの顔検出および認識プロジェクト Apr 03, 2024 am 11:55 AM

顔の検出および認識テクノロジーは、すでに比較的成熟しており、広く使用されているテクノロジーです。現在、最も広く使用されているインターネットアプリケーション言語は JS ですが、Web フロントエンドでの顔検出と認識の実装には、バックエンドの顔認識と比較して利点と欠点があります。利点としては、ネットワークインタラクションの削減とリアルタイム認識により、ユーザーの待ち時間が大幅に短縮され、ユーザーエクスペリエンスが向上することが挙げられます。欠点としては、モデルサイズによって制限されるため、精度も制限されることが挙げられます。 js を使用して Web 上に顔検出を実装するにはどうすればよいですか? Web 上で顔認識を実装するには、JavaScript、HTML、CSS、WebRTC など、関連するプログラミング言語とテクノロジに精通している必要があります。同時に、関連するコンピュータービジョンと人工知能テクノロジーを習得する必要もあります。 Web 側の設計により、次の点に注意してください。

Alibaba 7B マルチモーダル文書理解の大規模モデルが新しい SOTA を獲得 Apr 02, 2024 am 11:31 AM

マルチモーダル文書理解機能のための新しい SOTA!アリババの mPLUG チームは、最新のオープンソース作品 mPLUG-DocOwl1.5 をリリースしました。これは、高解像度の画像テキスト認識、一般的な文書構造の理解、指示の遵守、外部知識の導入という 4 つの主要な課題に対処するための一連のソリューションを提案しています。さっそく、その効果を見てみましょう。複雑な構造のグラフをワンクリックで認識しMarkdown形式に変換：さまざまなスタイルのグラフが利用可能：より詳細な文字認識や位置決めも簡単に対応：文書理解の詳しい説明も可能：ご存知「文書理解」「」は現在、大規模な言語モデルの実装にとって重要なシナリオです。市場には文書の読み取りを支援する多くの製品が存在します。その中には、主にテキスト認識に OCR システムを使用し、テキスト処理に LLM と連携する製品もあります。

Llama 70B を実行するシングルカードはデュアルカードより高速、Microsoft は FP6 を A100 オープンソースに強制導入 Apr 29, 2024 pm 04:55 PM

FP8 以下の浮動小数点数値化精度は、もはや H100 の「特許」ではありません。 Lao Huang は誰もが INT8/INT4 を使用できるようにしたいと考え、Microsoft DeepSpeed チームは NVIDIA からの公式サポートなしで A100 上で FP6 の実行を開始しました。テスト結果は、A100 での新しい方式 TC-FPx の FP6 量子化が INT4 に近いか、場合によってはそれよりも高速であり、後者よりも精度が高いことを示しています。これに加えて、エンドツーエンドの大規模モデルのサポートもあり、オープンソース化され、DeepSpeed などの深層学習推論フレームワークに統合されています。この結果は、大規模モデルの高速化にも即座に影響します。このフレームワークでは、シングルカードを使用して Llama を実行すると、スループットはデュアルカードのスループットの 2.65 倍になります。 1つ

リリースされたばかりの！ワンクリックでアニメ風の画像を生成するオープンソースモデル Apr 08, 2024 pm 06:01 PM

最新の AIGC オープンソースプロジェクト、AnimagineXL3.1 をご紹介します。このプロジェクトは、アニメをテーマにしたテキストから画像へのモデルの最新版であり、より最適化された強力なアニメ画像生成エクスペリエンスをユーザーに提供することを目的としています。 AnimagineXL3.1 では、開発チームは、モデルのパフォーマンスと機能が新たな高みに達することを保証するために、いくつかの重要な側面の最適化に重点を置きました。まず、トレーニングデータを拡張して、以前のバージョンのゲームキャラクターデータだけでなく、他の多くの有名なアニメシリーズのデータもトレーニングセットに含めました。この動きによりモデルの知識ベースが充実し、さまざまなアニメのスタイルやキャラクターをより完全に理解できるようになります。 AnimagineXL3.1 では、特別なタグと美学の新しいセットが導入されています

See all articles

LeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?

LLM の制限

Video Prediction

JEPA (Joint Embedding Prediction Architecture)

Reinforcement Learning

Yann LeCun

AGI

humanoid robots

ホットAIツール

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

人気の記事

ホットツール

メモ帳++7.3.1

SublimeText3 中国語版

ゼンドスタジオ 13.0.1

ドリームウィーバー CS6

SublimeText3 Mac版

ホットトピック