The structure of the defect recognition mainly focuses on the analysis of building image surface images in order to identify and classify existing defects [1]. These defects vary in the type, including cracks, breaks, mildew, discoloration and more, which may influence both structural security and the aesthetic quality of buildings [2]. While manual inspection can identify defects to a certain extent, due to low efficiency, high costs and susceptibility to subjective errors, it is considerable challenges in large recognition tasks or real -time monitoring scenarios [3]Present [4]. Therefore, the determination of automated defect detection is not only for the timely identification and repair of building problems, but also for reducing potential security risks and minimizing the maintenance costs of more important [5]. In addition, it contributes to the intelligent and sustainable management of the structure infrastructure.
Recently, automated methods for recognizing building defects are mainly based on folding networks (CNNS) and their variants [6]Present [7]Present [8]. With continuous technological advances, these models have improved their ability to recognize various defects in buildings. In practical applications, however, the managers do not concentrate indiscriminately all defects that concentrate in certain types of defects, such as: B. severe structural defects that represent security risks or surface damage that influence aesthetics. This targeted detection enables more efficient resource assignment and better informed maintenance planning. For example, as shown in Fig. 1, managers can be particularly concerned about the crack on the balcony (i.e. shown in the yellow box), since these defects could be a security risk for pedestrians and thus require immediate attention. To satisfy this need, we set a new task called specific defect recognition. In particular, our new task differs from the existing general identification task, i.e. instead of indiscriminately recognizing all defects, aims to precisely identify specific defects in buildings based on the management requirements. In order to reflect the maintenance requirements, our task uses text descriptions to convey these requirements. Text descriptions offer advantages such as clarity and intuitiveness and make it easier to express or understand maintenance problems. As shown in Fig. 1 in Fig. 1, as shown in Fig. 1, the text is used to guide a detection model in order to precisely locate the required specific defects. Research on this new task not only offers technical support for the improvement of the efficiency of repair operation, but also optimizes resource assignment by ensuring that critical defects are prioritized. As a result, it can effectively improve the accuracy and reliability of maintenance processes and contribute to more effective and intelligent building management.
Existing methods for recognizing building defects [7]Present [9]Present [10] Show two main restrictions if you are applied to our specific recognition task, i.e. the lack of adequate training and validation records, which makes it difficult for the models to learn effectively and to generalize well. (II) The lack of mechanisms for the processing of text descriptions, which means that the models do not use precise text information in order to locate certain defects. In order to tackle the first restriction, we create a new data record that includes creating defects together with your corresponding text descriptions and offers a valuable benchmark for future research in this area. As shown in Fig. 1, a data record test contains a picture of creation errors, in which there may be several defects. However, only a specific defect is selected as a goal and commented on with a corresponding text description. In the text description, the visual features of the defect, z.
In order to address the second restriction, we suggest a model called a multi-modal query transformer (MMQ transformer) and consists of three modules, ie the characteristic extractor, the multimodal query generator and the multimodal fusion module. MMQ transformer introduces a text information processing mechanism and aligns multimodal data (ie text descriptions and images) in order to achieve precise identification of specific defects in images. First, the characteristic extractor processes the raw image and text input, extracts visual and semantic characteristics and to order in a uniform common vector room in order to determine an initial correlation between images and text. Next we present the multimodal query generator that integrates object, image and text queries to generate a number of query, which are both visual and semantic information. These vectors serve as input for the multimodal fusion module. In view of the skills of large models in related tasks, we take on a educated MDETR model [11] As the basis of our model design. Although the preparation of large models mainly focuses on certain objects, the learned shape, color and other characteristics remain valuable for our task. In addition to the vectors for object function queries, we integrate visual and semantic information in order to better adapt the model to certain defect identification tasks. We consider the vectors of object functions as existing knowledge and use them as the basis to guide the model in the identification and localization of defects more effectively. Finally, the multimodal fusion module enables interactions between these queries and the extracted multimodal characteristics, the prediction of the boundary field and the orientation with the text description of the specific defect, which means that a high prerequisite defect localization and agreement is achieved. Experimental results show that the MMQ transformer exceeds existing models across the evaluation metrics significantly and offers a more precise and intelligent solution for maintenance and management of buildings.
In summary, it can be said that our contributions are as follows:
We are introducing a new task called specific defecter recognition task, which aims to precisely locate target errors in buildings based on text descriptions. In contrast to existing methods that recognize all defects, this task emphasizes the role of text information in the recognition of defects and enables more targeted and precise identification. This approach not only improves the accuracy of maintenance of buildings, but also optimizes resource allocation and improves general maintenance efficiency.
We create a specialized data record that covers various defect types in different building darios with which research is supported for our specific defect recognition task. This data record offers a solid basis for the implementation of the task and to evaluate models, which laid a decisive data -controlled foundation for future studies.
We propose MMQ transformer, a recognition model that improves multimodal queries by integrating image and text functions into preceded object queries and enables more precise and context-related defect identification. The MMQ transformer improves the multimodal interaction functions of the model. Experimental results show significant performance improvements for the specific defect recognition task and offer a valuable framework and a direction for future research.