![]() “Once there’s easy and high-quality translation between images and text, a world of possibilities opens up. “Florence is a complete re-thinking of vision models,” Montgomery said. But Montgomery asserts that multimodality adds something valuable to the equation. None of those use cases require a multimodal vision model, I’d note. Montgomery sees Florence being used by customers for much more down the line, like detecting defects in manufacturing and enabling self-checkout in retail stores. And Designer and OneDrive, courtesy of Florence, have gained better image tagging, image search and background generation. PowerPoint, Outlook and Word are leveraging Florence’s image captioning abilities for automatic alt text generation. In Microsoft Teams, Florence is driving video segmentation capabilities. On LinkedIn, as on Reddit, Florence-powered services will generate captions to edit and support alt text image descriptions. Microsoft is also using Florence across a swath of its own platforms, products and services. “Reddit will also use the captioning to help all users improve article ranking for searching for posts.” “Florence’s ability to generate up to 10,000 tags per image will give Reddit much more control over how many objects in a picture they can identify and help generate much better captions,” Montgomery said. Montgomery says that Reddit will use the new Florence-powered APIs to generate captions for images on its platform, creating “alt text” so users with vision challenges can better follow along in threads. We’ll have to take the company’s word for it. So what about Florence? Well, because it understands images, video and language and the relationships between those modalities, it can do things like measure the similarity between images and text or segment objects in a photo and paste them onto another background. Microsoft being the profit-driven business that it is, that is, no doubt, a plus. The second reason is, multimodal models tend to be more efficient from a computational standpoint - leading to speedups in processing and (presumably) cost reductions on the backend. For example, an AI assistant that understands images, pricing data and purchasing history is likely to offer better-personalized product suggestions than one that only understands pricing data. Why not string several “unimodal” models together to achieve the same end, like a model that understands only images and another that understands exclusively language? A few reasons, the first being that multimodal models in some cases perform better at the same task than their unimodal counterpart thanks to the contextual information from the additional modalities. Naturally, multimodal models - models that, once again, understand multiple modalities, such as language and images or videos and audio - are able to perform tasks in one shot that unimodal models simply cannot (e.g. The AI research community, which includes tech giants like Microsoft, have increasingly coalesced around the idea that multimodal models are the best path forward to more capable AI systems. “Ask Florence to find a particular frame in a video, and it can do that ask it to tell the difference between a Cosmic Crisp apple and a Honeycrisp apple, and it can do that.” As a result, it’s incredibly versatile,” John Montgomery, CVP of Azure AI, told TechCrunch in an email interview. ![]() ![]() “Florence is trained on billions of image-text pairs. ![]() The Florence-powered Microsoft Vision Services launches today in preview for existing Azure customers, with capabilities ranging from automatic captioning, background removal and video summarization to image retrieval. Now, as a part of Microsoft’s broader, ongoing effort to commercialize its AI research, Florence is arriving as a part of an update to the Vision APIs in Azure Cognitive Services. Unlike most vision models at the time, Florence was both “unified” and “multimodal,” meaning it could (1) understand language as well as images and (2) handle a range of tasks rather than being limited to specific applications, like generating captions. Two years ago, Microsoft announced Florence, an AI system that it pitched as a “complete rethinking” of modern computer vision models. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |