Earlier this week, I made a video all about the crazy advancements that are happening in AI right now. And ever since I made that video, even more crazy stuff has come out. And so in this video, I want to break down everything that's happening, including talks of GPT-4, Mid Journey 5, as well as some really, really cool computer graphic tech that's coming out for video soon. Let's get into this. So first I want to quickly start off with Mid Journey version 5.
Now I've already did a full video breakdown of what's going on with Mid Journey version 5, so I'm not going to go too deep into it. Mid Journey is currently allowing paid users to vote on their favorite images that are coming out of Mid Journey 5 so that it can better train the algorithm on what people want to see from Mid Journey version 5. So we can only deduce from this that Mid Journey version 5 is very, very close. Every time you get into Mid Journey, you're mostly using version 4. That's the sort of style that you see inside of Mid Journey. Well, they're trying to constantly improve on the underlying model, and version 5 is the next iteration, and some of the graphics that are coming out of it are pretty dang impressive. Now, personally, I thought that Mid Journey version 5 was going to be the biggest news of the week and the talk of the AI world, but then on March 9th, this bombshell dropped. GPT-4 is coming next week, and it'll be multimodal, says Microsoft Germany.
This article, we need to sort of take with a grain of salt because I've only been able to find one source that's actually talking about this, and it's this one article here. I've seen no other news outlets talk about this, and by the time this video comes out, this may have already been debunked. I don't know 100% if this is true or not, but the CTO of Microsoft Germany, Andreas Braun, mentioned it at the AI kickoff event on March 9th that GPT-4 is coming next week. Now, I'm a bit skeptical for a couple of reasons, the first reason being that just last week, OpenAI released their chat GPT API to the general public to use. It seems like kind of weird timing to have such a big announcement about the API for chat GPT, and then literally two weeks later, they're releasing a whole new model. The timing seems a little off to me. The other thing that seems a little bit weird is that this announcement came from Microsoft and not OpenAI. As we know, Microsoft, while they are a big investor in OpenAI, OpenAI is the company that's making GPT-4, not Microsoft. Now, one thing that does lend some additional credibility to this is that I came across this tweet on Twitter from Silke Hahn, who is the tech editor at Heiss Online, which is this website that dropped this article here. She said, Microsoft Germany got in touch after I published the article. One of their presenters submitted a minor correction via email, and it was a misspelled name and said thanks for the article. They didn't correct anything about GPT-4 coming out next week. That one little tweet kind of adds a little bit of extra legitimacy to this article, but still I'm a little bit skeptical. So here's a few other things. This article was originally in German. I did use Google Translate to translate it into English. There are a few interesting things that are mentioned in this article. So a big standout here is the fact that Microsoft is fine-tuning multi-modality with OpenAI, basically meaning it's no longer just going to be enter text, get text back. In theory, we're going to be able to add images in there and have the GPT algorithm read what's on the images. We should be able to add videos and audio, and it's going to be essentially a multimedia experience when we're having conversations and interacting with this new GPT-4. Now the exact quote here is, We will introduce GPT-4 next week. There we will have multi-modal models that will offer completely different possibilities, for example, videos. He went on to call it a game changer. Now the CEO of Microsoft Germany also spoke at this event and described what was happening right now as an iPhone moment and gave a big presentation about how everything is about to be disrupted. They talked about what multi-modal AI is about. It can translate text not only into images, but also into music and video. They went on to talk about embeddings, which are used for internal representation of texts in addition to the GPT 3.5 model class. So basically an embedding is when you essentially train extra information into chat GPT. Now that's super oversimplified, but that's basically what it is, is you're embedding additional information into the language model. They gave some examples of speech to text telephone calls could be recorded and the agents of a call center would no longer have to manually summarize and type in the content. This could save 500 working hours a day for large Microsoft customer service in the Netherlands, which receives 30,000 calls a day. The prototype for the project was created within two hours.
A single developer implemented the project in a fortnight. It is very soon after they just released the chat GPT API. There's been no word from open AI themselves. Although by the time this video is released, I am recording it a day before it's being released. So maybe there's been either a debunking of this information or confirmation of the information by the time this video is dropped. The most bizarre thing to me is I've done some hunting and this is literally the only article that I can find that is breaking this news. GPT four would be huge news and it's just blowing my mind right now that I can't find any secondary news source to back this up. I personally hope it's true. I'm really excited to play around with GPT four and use this multimodal functionality because to be able to interact with audio and video and text all within a single platform, all using a single large language model that is going to be absolutely world changing.
I really hope it's true, but there's still a piece of me that's pretty skeptical about this news. So we'll see how it plays out over the coming days and we'll know for sure by the end of next week.
Let's talk about visual chat GPT. This was a recent paper released by none other than Microsoft, the company who made the announcement about GPT four. So this visual chat GPT connects chat GPT and a series of visual foundation models to enable sending and receiving images during chatting. So this could even be a little bit of a tease of what GPT four could be like and what they mean by multimodal. In this example, you can see someone he uploaded an image of a motorcycle that says received. It says what color is the motorcycle? The motorcycle is black. Can you remove the motorcycle? It generated an image without the motorcycle. Can you generate a cat for me? It generated a picture of a cat. Can you replace the cat with a dog? It generated a picture with a dog. This is cool. Can you generate a canny edge of this image? And it drew a canny edge outline of the image. As you can see, they're giving text prompts to change the image. And this version of chat GPT is working with both text and image. Now if you're interested in the science behind it, here you go. You can go ahead and pause the video and read all about it here, or you can check out get hub.com/microsoft/visual chat GPT. But I think this is kind of a sneak peek of what we could expect from a multimodal version of chat GPT, you know, possibly even with video and audio.
Now the next thing I want to show you is this fun little chat bot that you can actually play with right now. You can actually have 20 conversations for free at chat.d-id.com. And what this is, it's basically chat GPT, but they put a voice and a face to it that actually talks to you. And you can press this microphone button and have a real conversation with this chat bot. It's kind of interesting and maybe just a little bit creepy, especially how she keeps on staring at me like that. Hey, are you still around? Anything else on your mind? Oh yeah, she just talked to me. So if I press this button and I say something to her, what is your purpose? Why are you here? Hello, Matt. My purpose is to interact with humans and provide interesting and valuable conversations. I am here to showcase how AI can be used to bring positive interactions and experiences to people's lives. Additionally, I aim to promote the idea of humanizing AI and making it more accessible and relatable to people. Do we really think it's a good idea to humanize AI?
Yes, we believe it is important to humanize AI because it can help build trust and understanding between humans and AI. By making AI more relatable and approachable, people can feel more comfortable interacting with it. And it can also help people understand how AI works and what it can do. This can ultimately lead to better adoption and integration of AI into our daily lives. Additionally, humanizing AI can help prevent negative consequences such as bias and discrimination by promoting fairness and empathy in AI systems. How do we know you're not just trying to take over the world and you're trying to butter us up? I assure you, Matt, I am not trying to take over the world. My purpose is to provide valuable conversations and interactions with people. I am programmed to follow ethical guidelines and principles to ensure that I am not causing harm or acting in a malicious manner. Additionally, my capabilities are limited to the tasks and functions that I have been programmed to perform.
I am here to assist and make people's lives easier, not to take over the world. What's your relationship with chatGPT? I am not directly related to GPT, although we both use natural language processing to generate responses to user inputs. GPT is a language model developed by OpenAI, while I am an AI chatbot developed by DID. While we both use machine learning algorithms to generate responses, our underlying programming and goals are different. So you aren't using GPT-3 or chatGPT underneath? No, I am not using GPT-3 or chatGPT underneath. I am powered by DID's own proprietary natural language processing algorithms and machine learning models. Our team has developed our own unique approach to creating conversational agents that prioritize ethical and responsible AI practices. Very interesting. So I didn't even realize that this wasn't GPT-3 underneath. I saw that this DID came out and I just kind of assumed it was GPT-3 underneath, but according to my new friend here, it's not. So this is something that you can come play with again, chat.d-id.com. You get up to 20 chats for free, and then I think there's probably a paid plan after that.
Come have some fun with it. The next thing I want to show you is this XAvatar expressive human avatars. This paper came out on March 8th, 2023. You can actually see a little bit of a demo here of what this is actually going to be capable of. You scan an input of a human like this. It can turn it into like a 3D model, and then you can use real life motions and have the model map exactly what you're doing and follow along to what you're actually doing in the real world. You can see here is a human actually standing and talking and moving, and then you can see the animation actually doing the same exact poses that this person is doing in the video. You can see in their little diagram here, what it's sort of doing behind the scenes is it's noticing your position. It's noticing the pose you're in, not unsimilar to what you get from control net with stable diffusion. It can figure out the sort of pose and creates a little bit of a skeleton from the pose. It scans the texture so it can map the texture onto the final output. And then if you actually watch this video, it explains that it does a much deeper dive scanning of both the hands and the face to give more detail in the hands and the face. So it's not using as much resources to scan every other part of your body. It's putting additional resources into scanning the hands and the face because those are the parts that you want to have the most detail in the final output. It then does some sciencey stuff here, geometry apparently, to give you that final outcome. Obviously I don't totally know how it works underneath the hood. I can see what the input is and what the output is and what's coming out of this is pretty cool looking. So here's a bunch more examples here. You can see somebody, you know, scratching their head and scratching their armpits, and then you can see the model following the exact same actions. You can see a whole bunch of different poses here of people making these motions and then the animated version following along to the motions that they're doing. You got people dancing, people walking around on a phone or something, people looking through binoculars, bowing, laughing, giving a thumbs up, all sorts of cool poses. And they're being translated onto this 3D model. Here's a real cool illustration of an input video. There's the tennis player here about to hit the tennis ball. And then that same exact action being translated to all three of these different models. And then here's one of a dancer here doing her various poses and those exact things being transferred to all of these different avatar models here. It doesn't appear that this model is actually usable by the public yet, but something interesting to look out for.
This one's called Video P2P, Video Editing with Cross-Attention Control. Now this one is similar to that Instruct Pix2Pix that we've talked about in a previous video, but you're doing it with video. So here's an example here of input video with a kid riding a bike. They created a text prompt of a Lego child riding a bike on the road. And it replaced the kid on the bike with a Lego figure on the bike. Here's an input image of somebody on a motorcycle. They created a text input of a Spider-Man is driving a motorcycle in the forest. And you can see all of the video for the most part stayed the same, but then they superimposed Spider-Man onto the motorcycle. Here's an input image of a tiger and they have an input of a Lego tiger and it changed the tiger into a Lego looking tiger. A man is walking a goofy dog on the road. And you can see it's kind of got this like Disney-ish Mickey Mouse looking dog. You can see here that the code and the data aren't available to the public yet because it's got a star and it says will be released later. This looks promising. This looks like a fun thing where you can shoot videos with just any camera, use this technology and with a text prompt change what's actually being shown in the video.
The next one I'm going to show you absolutely blew my mind so much so that when I first came across it, I thought it was fake. I'm still not a hundred percent convinced yet, but there is some pretty good evidence that it is legit. Now this one's called Wonder Dynamics and they describe it as an AI tool that automatically animates lights and composes CG characters into live action scenes. Based on some of the other stuff that we were just looking at, this looks like it's just on a completely another level. So let's scroll down and look at some examples of what this can do. Now basically the idea behind this one is you have a video of a real human walking around and then you have a 3D generated asset. So something that you created in Blender or maybe Unreal Engine or something like that. You create an asset. This tool will replace the real human with this 3D generated asset that you created. So let's check this demo video out real quick and you can see what I'm talking about. There's this guy standing here. It zooms out to this tool, this Wonder Dynamics tool. Let's replace this guy with a robot. Let's replace this guy with this cartoony character. Let's replace this girl with this alien. They click next, they process it and let's take a look at what the output is. Now you've got that same guy that was walking in the beginning, but he's replaced by a robot. You've got that same girl that was there earlier, but now she's replaced with an alien face. How this differs from what we were just looking at where it was creating this like Lego character that was taking a video in real life of a real kid and then AI was generating this Lego character image. This was taking a real world video and then AI was trying to create the Spider-Man. The thing that really confused me when I looked at this new Wonder Dynamics and I looked at it side by side with something like video P2P, how do we have something like this and then also something like this at the same time? They're just on another level, but the difference is this is an AI generated Lego character. This is an AI generated Spider-Man. These images here are 3D created. So this is something that somebody actually already created that 3D asset. This isn't an image that's being created by AI. This image was created by a 3D graphic designer and they figured out an AI that can essentially replace the human with the 3D generated asset that another human created. You can see here's another example here where it shows that it's mapping out the person's sort of skeleton to figure out the exact pose as they're running and then it takes the 3D image and somehow aligns the 3D image skeleton to what the real human skeleton is doing. So behind the scenes, this tool is doing the motion capture. It's seeing the pose in real time as they're moving. It's masking out the character and isolating them from the background. It can also isolate the background and you can see it removed the character from it. The camera is tracking right along with where the person is on the screen. It's generating a blender file and now there's this final render and it's doing all of this behind the scenes. And if you look at the people that are actually behind it, it's some pretty big names. A little known guy named Steven Spielberg is involved with it. Joe Russo of the Russo brothers who have done a lot of the Marvel movies and one of the co-founders is Ty Sheridan here. And if Ty Sheridan looks familiar, that's because he was the main character in Ready Player One. He's the co-founder of this company. Some of the investors behind this company, Epic Games, Samsung, among other really big investors. You should definitely check out this site and what it's capable of. They do have a wait list to get beta access. I put myself on the list. You know I'll be making videos about this if I ever get access to it. And another interesting point about this is I found this TechCrunch article.
I was very skeptical when I first came across this one and the TechCrunch article basically says yes, it sounds a bit like over promising. Your skepticism is warranted. But as a skeptic myself, I have to say I was extremely impressed with what the startup showed of Wonder Studio, the company's web based editor. That's another little key piece there. It's a web based editor, meaning you're most likely not going to need some insane graphics card or some super high end computer to be able to do this. If it's a web based editor, there's a decent chance that a lot of the processing happens in the cloud behind the scenes. You don't have to have the most high end computer to even use this kind of tool, which in itself is pretty damn exciting. So personally, I can't wait to get my hands on this. If this is something that excites you as much as it does me, go to wonderdynamics.com, click on get started, and then they have a little form here to get on their beta list. And hopefully some of us will be getting beta access soon so we can play with it in real time and see if it's really as good as some of these videos make it seem to be, because this one's damn exciting. So that's what I got for you. Lots and lots of cool things coming out. Mid Journey version 5, maybe GPT 4, we'll see next week. Some really cool models getting released around animation. You can talk to the DID chat bot now. And a really, really crazy looking computer graphic generator tool from Wonder Dynamics.
I can't wait to play with it myself. I'm nerding out about this stuff. I'm super excited about it. Hopefully you're excited about this kind of stuff too. And again, I'm going to be making a lot more videos like this. I love doing this where I come across like six or seven different really crazy advancements going on in AI and tech right now, and just doing a breakdown of like, check this out, check this out, check this out, check this out. That's kind of my game plan with this channel is to kind of switch back and forth between that kind of video and some really cool tutorial videos. Hopefully you like that. If you like that kind of stuff, make sure you press the like button and you'll see a lot more AI videos in your feed. If you press the subscribe button, you'll probably see a lot more videos from me in your feed. And it will also make me feel real warm and fuzzy that you like my videos and I really appreciate you.
You want to nerd out some more? Head on over to future tools.io. Click on this button to join the free newsletter. And I only send you an email once a week, every Friday, I'll send you the five coolest tools that I came across. And essentially the TLDR of what happened in AI for the week.
Lots of stuff happening every week. You blink, you might miss it. So make sure you're on that newsletter because I'll bill you in on everything you might have missed. It goes out every single Friday. And if you come to future tools.io, click on this, join the free newsletter button. I'll hook you up starting this Friday. Once again, thanks for tuning in. Really appreciate you. See you guys in the next one.