A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://patents.google.com/patent/CN103617262A/en below:

CN103617262A - Picture content attribute identification method and system

CN103617262A - Picture content attribute identification method and system - Google PatentsPicture content attribute identification method and system Download PDF Info
Publication number
CN103617262A
CN103617262A CN201310632676.8A CN201310632676A CN103617262A CN 103617262 A CN103617262 A CN 103617262A CN 201310632676 A CN201310632676 A CN 201310632676A CN 103617262 A CN103617262 A CN 103617262A
Authority
CN
China
Prior art keywords
picture
pictures
reprints
cluster
same
Prior art date
2013-12-02
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310632676.8A
Other languages
Chinese (zh)
Other versions
CN103617262B (en
Inventor
陶哲
白明
韩玉刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2013-12-02
Filing date
2013-12-02
Publication date
2014-03-05
2013-12-02 Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
2013-12-02 Priority to CN201310632676.8A priority Critical patent/CN103617262B/en
2014-03-05 Publication of CN103617262A publication Critical patent/CN103617262A/en
2014-09-22 Priority to PCT/CN2014/087109 priority patent/WO2015081748A1/en
2017-03-08 Application granted granted Critical
2017-03-08 Publication of CN103617262B publication Critical patent/CN103617262B/en
Status Active legal-status Critical Current
2033-12-02 Anticipated expiration legal-status Critical
Links Images Classifications Landscapes Abstract Translated from Chinese

本发明提供了一种图片内容属性识别方法和系统,方法包括:计算多个同源图片簇对于特定资源站点的相对转载数;根据多个同源图片簇以及对应的相对转载数训练筛选器模型;根据训练后的筛选器模型识别目标图片簇中的图片内容属性。本发明的优点在于,根据图片在网络上被转载或传播的数据可以识别图片的内容属性,尤其可以用于判断是否为广告图片。

The present invention provides a method and system for identifying image content attributes. The method includes: calculating the relative number of reprints of multiple homologous image clusters for a specific resource site; training a filter model according to the multiple homologous image clusters and the corresponding relative reprint numbers ; Identify image content attributes in the target image cluster according to the trained filter model. The advantage of the present invention is that the content attribute of the picture can be identified according to the data that the picture is reproduced or disseminated on the network, especially for judging whether it is an advertisement picture.

Description Translated from Chinese 图片内容属性识别方法和系统Image content attribute identification method and system

技术领域technical field

本发明涉及图像识别领域,具体涉及一种图片内容属性识别方法和系统。The present invention relates to the field of image recognition, in particular to a method and system for identifying image content attributes.

背景技术Background technique

在网络中很多类型的资源站点上,都会出现一些广告图片,这些广告图片的种类非常丰富,其包括各类商品的广告(例如,关于奶粉、衣服的广告),和实体商店的广告,以及一些其他类型的广告。On many types of resource sites on the Internet, there will be some advertisement pictures. These advertisement pictures are very rich in types, including advertisements of various commodities (for example, advertisements about milk powder and clothes), advertisements of physical stores, and some Other Types of Ads.

这些广告图片不但会出现在商家的站点上,也会出现在其他资源站点的页面上,例如,在允许用户上传图片的社区(论坛、图片站等),会有一些用户上传广告图片。大量广告图片的存在,往往对用户造成干扰,甚至用户进行图片搜索时,也会出现与用户需求无关的广告图片。These advertisement pictures will appear not only on the merchant's site, but also on the pages of other resource sites. For example, in communities (forums, picture sites, etc.) that allow users to upload pictures, some users will upload advertisement pictures. The existence of a large number of advertising pictures often causes interference to users, and even when users search for pictures, advertising pictures that have nothing to do with user needs will appear.

从图片的图像内容角度来看,不同广告图片是没有特别多的相似点的,所以基于目前的图像识别技术,难以对图片的图片内容属性进行识别,即难以识别出哪些图片为广告图片,也就无法对广告图片进行针对性的处理,用户的体验必然受到广告图片的影响。From the perspective of the image content of the pictures, different advertising pictures do not have many similarities, so based on the current image recognition technology, it is difficult to identify the picture content attributes of the pictures, that is, it is difficult to identify which pictures are advertising pictures, and It is impossible to carry out targeted processing on the advertisement picture, and the user's experience is bound to be affected by the advertisement picture.

发明内容Contents of the invention

鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种图片内容属性识别方法和系统。In view of the above problems, the present invention is proposed to provide a method and system for identifying image content attributes that overcome the above problems or at least partially solve the above problems.

依据本发明的一个方面,提供了一种图片内容属性识别方法,其包括:计算多个同源图片簇对于特定资源站点的相对转载数;根据多个同源图片簇以及对应的相对转载数训练筛选器模型;根据训练后的筛选器模型识别目标图片簇中的图片内容属性。According to one aspect of the present invention, a method for identifying image content attributes is provided, which includes: calculating the relative number of reprints of multiple homologous image clusters for a specific resource site; Filter model; identify the image content attributes in the target image cluster according to the trained filter model.

可选地,计算多个同源图片簇对于特定资源站点的相对转载数的步骤包括:对于多个同源图片簇中的一个同源图片簇,将同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较,得到同源图片簇对于特定资源站点的相对转载数,多个资源站点包括特定资源站点。Optionally, the step of calculating the relative number of reprints of multiple homologous picture clusters for a specific resource site includes: for a homologous picture cluster among the multiple homologous picture clusters, placing the pictures in the homologous picture cluster on the specific resource site Compared with the number of reprints on multiple resource sites, the relative number of reprints of the same-source picture cluster to a specific resource site is obtained, and the multiple resource sites include the specific resource site.

可选地,将同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较的步骤包括:计算特定资源站点上的图片的第一平均转载数;计算多个资源站点上的图片的第二平均转载数;取同源图片簇中的图片在特定资源站点上的转载数与第一平均转载数的第一差值,以及取同源图片簇中的图片在多个资源站点上的转载数与第二平均转载数的第二差值,将第一差值和第二差值对比得到同源图片簇对于特定资源站点的相对转载数。Optionally, the step of comparing the number of reprints of pictures in the same-source picture cluster on a specific resource site with the number of reprints on multiple resource sites includes: calculating the first average number of reprints of pictures on a specific resource site ;Calculate the second average number of reprints of pictures on multiple resource sites; take the first difference between the number of reprints and the first average number of reprints of pictures in the same-source picture cluster on a specific resource site, and take the same-source picture cluster The second difference between the number of reprints of the pictures in multiple resource sites and the second average number of reprints, and compare the first difference with the second difference to obtain the relative number of reprints of the same-source picture cluster for a specific resource site.

可选地,计算特定资源站点上的图片的第一平均转载数的步骤包括:取多个同源图片簇的图片中位于特定资源站点上的多个图片,将多个图片的数量与多个图片对应的同源图片簇的数量进行对比,得到第一平均转载数。Optionally, the step of calculating the first average number of reprints of pictures on a specific resource site includes: taking multiple pictures located on a specific resource site among pictures of multiple homologous picture clusters, and combining the number of multiple pictures with the multiple The number of homologous picture clusters corresponding to the picture is compared to obtain the first average number of reprints.

可选地,计算多个资源站点上的图片的第二平均转载数的步骤包括:将多个同源图片簇的图片的数量,与多个同源图片簇的数量进行比较,得到第二平均转载数。Optionally, the step of calculating the second average number of reprints of pictures on multiple resource sites includes: comparing the number of pictures in multiple homologous picture clusters with the number of multiple homologous picture clusters to obtain the second average Number of reprints.

可选地,在将同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较的步骤之前,还包括:抓取多个资源站点上出现的图片链接;检测图片链接与同源图片簇的图片对应的链接是否相同,和/或检测图片链接对应的图片的校验信息与同源图片簇的图片的校验信息是否相同,和/或检测图片链接对应的图片与同源图片簇的图片是否存在一个或多个相同的图像特征;根据检测结果,确定图片链接是否为同源图片簇的图片的转载,并统计同源图片簇的图片的转载数。Optionally, before the step of comparing the number of reprints of pictures in the same-source picture cluster on a specific resource site with the number of reprints on multiple resource sites, it also includes: grabbing the pictures that appear on multiple resource sites Picture link; detect whether the picture link is the same as the link corresponding to the picture of the same-source picture cluster, and/or check whether the check information of the picture corresponding to the picture link is the same as the check information of the picture of the same-source picture cluster, and/or detect Whether the picture corresponding to the picture link and the picture of the same source picture cluster have one or more identical image features; according to the detection result, determine whether the picture link is a reprint of the picture of the same source picture cluster, and count the pictures of the same source picture cluster Number of reprints.

可选地,特定资源站点为多个同源图片簇中转载每个同源图片簇的图片最多的资源站点。Optionally, the specific resource site is the resource site that reprints the most pictures of each same-source picture cluster among multiple same-source picture clusters.

可选地,每个同源图片簇的图片对应同一源图片,且每个同源图片簇的图片与其对应的源图片具有一个或多个相同的图像特征。Optionally, the pictures of each homologous picture cluster correspond to the same source picture, and the pictures of each homologous picture cluster and its corresponding source picture have one or more identical image features.

可选地,所述方法进一步包括:提取所述同源图片簇中包含的图片的格式特征和/或图片的链接特征,根据所述多个同源图片簇、对应的相对转载数,以及对应包含的图片的格式特征训练筛选器模型;根据训练后的筛选器模型,基于所述相对转载数以及目标图片簇中包含的图片的格式特征和/或图片的链接特征,来识别目标图片簇中的图片内容属性。Optionally, the method further includes: extracting format features and/or link features of pictures contained in the same-source picture clusters, according to the multiple homologous picture clusters, corresponding relative reprint numbers, and corresponding The format features of the included pictures train the filter model; according to the trained filter model, based on the relative number of reprints and the format features of the pictures contained in the target picture cluster and/or the link features of the pictures, identify the The image content property of the .

可选地,所述图片的格式特征包括但不限于以下中的一种或几种组合:图片的长/宽,图片的大小,图片的清晰度,Optionally, the format features of the picture include but are not limited to one or a combination of the following: length/width of the picture, size of the picture, clarity of the picture,

可选地,所述图片的链接特征包括但不限于以下中的一种或几种组合:图片链接是否和网页同站,图片跳转链接是否站外。依据本发明的另一个方面,提供了一种图片内容属性识别系统,其包括:相对转载数计算模块,用于计算多个同源图片簇对于特定资源站点的相对转载数;训练模块,用于将多个同源图片簇以及对应的相对转载数输入筛选器中训练筛选器模型;筛选器,适于根据训练模块得到训练后的筛选器模型,并根据模型对目标图片簇进行筛选;识别模块,用于根据筛选器对目标图片簇进行筛选,识别目标图片簇中的图片内容属性。Optionally, the picture link features include but are not limited to one or several combinations of the following: whether the picture link is on the same site as the web page, and whether the picture jump link is off-site. According to another aspect of the present invention, a system for identifying image content attributes is provided, which includes: a relative reprint number calculation module, used to calculate the relative reprint number of multiple homologous picture clusters for a specific resource site; a training module, used for Input multiple homologous picture clusters and corresponding relative reprints into the filter to train the filter model; the filter is adapted to obtain the trained filter model according to the training module, and screen the target picture cluster according to the model; the identification module , used to filter the target image cluster according to the filter, and identify the image content attributes in the target image cluster.

可选地,相对转载数计算模块对于多个同源图片簇中的一个同源图片簇,将同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较,得到同源图片簇对于特定资源站点的相对转载数,多个资源站点包括特定资源站点。Optionally, the relative reprint count calculation module compares the number of reprints of the pictures in the same source picture cluster on a specific resource site to the number of reprints on multiple resource sites for a homologous picture cluster in a plurality of homologous picture clusters. By comparing the numbers, the relative number of reprints of the homologous picture clusters to the specific resource site is obtained, and multiple resource sites include the specific resource site.

可选地,还包括:第一平均转载数计算模块,用于计算特定资源站点上的图片的第一平均转载数;第二平均转载数计算模块,用于计算多个资源站点上的图片的第二平均转载数;相对转载数计算模块取同源图片簇中的图片在特定资源站点上的转载数与第一平均转载数的第一差值,以及取同源图片簇中的图片在多个资源站点上的转载数与第二平均转载数的第二差值,将第一差值和第二差值对比得到同源图片簇对于特定资源站点的相对转载数。Optionally, it also includes: a first average reprint calculation module, used to calculate the first average reprint number of pictures on a specific resource site; a second average reprint number calculation module, used to calculate the number of pictures on multiple resource sites The second average number of reprints; the relative reprint number calculation module gets the first difference between the number of reprints and the first average number of reprints of the pictures in the same-source picture cluster on a specific resource site, and the number of pictures in the same-source picture cluster in multiple The second difference between the number of reprints on a resource site and the second average number of reprints, and comparing the first difference with the second difference to obtain the relative number of reprints of the same-source picture cluster for a specific resource site.

可选地,第一平均转载数计算模块取多个同源图片簇的图片中位于特定资源站点上的多个图片,将多个图片的数量与多个图片对应的同源图片簇的数量进行对比,得到第一平均转载数。Optionally, the first average reprint count calculation module takes multiple pictures located on a specific resource site among the pictures of multiple homologous picture clusters, and calculates the number of multiple pictures with the number of homologous picture clusters corresponding to the multiple pictures. By comparison, the first average number of reprints is obtained.

可选地,第二平均转载数计算模块将多个同源图片簇的图片的数量,与多个同源图片簇的数量进行比较,得到第二平均转载数。Optionally, the second average reprint number calculation module compares the number of pictures in multiple homologous picture clusters with the number of multiple homologous picture clusters to obtain the second average reprint number.

可选地,还包括:图片链接抓取模块,用于抓取多个资源站点上出现的图片链接;图片链接检测模块,用于检测图片链接与同源图片簇的图片对应的链接是否相同,和/或检测图片链接对应的图片的校验信息与同源图片簇的图片的校验信息是否相同,和/或检测图片链接对应的图片与同源图片簇的图片是否存在一个或多个相同的图像特征;图片转载数统计模块,用于根据检测结果,确定图片链接是否为同源图片簇的图片的转载,并统计同源图片簇的图片的转载数。Optionally, it also includes: a picture link grabbing module, used to grab picture links that appear on multiple resource sites; a picture link detection module, used to detect whether the picture link is the same as the link corresponding to the picture of the same source picture cluster, And/or detect whether the verification information of the picture corresponding to the picture link is the same as the verification information of the pictures of the same-source picture cluster, and/or detect whether one or more of the pictures corresponding to the picture link and the pictures of the same-source picture cluster are the same image features; the picture reprint count statistics module is used to determine whether the picture link is a reprint of a picture of the same source picture cluster according to the detection result, and count the reprint number of the picture of the same source picture cluster.

可选地,特定资源站点为多个同源图片簇中转载每个同源图片簇的图片最多的资源站点。Optionally, the specific resource site is the resource site that reprints the most pictures of each same-source picture cluster among multiple same-source picture clusters.

可选地,每个同源图片簇的图片对应同一源图片,且每个同源图片簇的图片与其对应的源图片具有一个或多个相同的图像特征。Optionally, the pictures of each homologous picture cluster correspond to the same source picture, and the pictures of each homologous picture cluster and its corresponding source picture have one or more identical image features.

根据本发明的图片内容属性识别方法和系统,利用了同源图片簇对于特定资源站点的相对转载数作为训练数据进行筛选器模型的训练,相对转载数是能够反映图片在特定资源站点的站内站外比例的数据,而作为广告的图片的一个主要特点在于:在某一资源站点上转载的次数非常高,而在互联网范围内其他资源站点上转载的次数会相对地明显变少,因此相对转载数的大小可以用于区分别图片是否作为广告进行传播,而利用相对转载数进行的筛选器模型的训练,则得到的筛选器模型可以自行对图片的图片内容属性进行识别,准确地判断图片是否为广告图片。According to the picture content attribute recognition method and system of the present invention, the relative reprint number of the homologous picture cluster to the specific resource site is used as the training data for the training of the filter model. One of the main characteristics of pictures used as advertisements is that the number of reprints on a certain resource site is very high, while the number of reprints on other resource sites within the Internet range will be relatively less, so the relative reprint The size of the number can be used to distinguish whether the picture is spread as an advertisement, and the filter model is trained by using the relative reprint number, and the obtained filter model can identify the picture content attribute of the picture by itself, and accurately judge whether the picture is for the ad image.

上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了根据本发明的一个实施例的图片内容识别方法的流程图;Fig. 1 shows the flowchart of the picture content recognition method according to one embodiment of the present invention;

图2示出了根据本发明的一个实施例的图片内容识别方法的部分流程图;FIG. 2 shows a partial flow chart of a method for identifying picture content according to an embodiment of the present invention;

图3示出了根据本发明的一个实施例的图片内容识别方法的流程图;Fig. 3 shows the flowchart of the picture content identification method according to an embodiment of the present invention;

图4示出了根据本发明的一个实施例的图片内容识别系统的框图;Fig. 4 shows a block diagram of a picture content recognition system according to an embodiment of the present invention;

图5示出了根据本发明的一个实施例的图片内容识别系统的框图;FIG. 5 shows a block diagram of a picture content recognition system according to an embodiment of the present invention;

图6示出了根据本发明的一个实施例的图片内容识别系统的框图。Fig. 6 shows a block diagram of a picture content recognition system according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

如图1所示,本发明的一个实施例提供了一种图片内容属性识别方法,其包括:步骤110,计算多个同源图片簇对于特定资源站点的相对转载数,每个图片簇是对一组图片的聚合,例如,可以是相似度较高的一组图片,而相对转载数是一种能够反映同源图片簇的图片在特定资源站点站内站外的转载比例的数据,相对转载数的计算方式较多,本实施例中不对相对转载数的计算方式进行限制;步骤120,根据多个同源图片簇以及对应的相对转载数训练筛选器模型,通过对广告图片的研究发现,广告图片有以下特点:广告图片生产成本高,很多广告图片都是商户花费金钱、花费时间制作的,因为广告图片的生产成本高,所以商户会将一张广告图片传播很多次,但是这些广告图片基本上只有商户会进行传播,而其他的用户则基本不会传播广告图片,广告图片在传播上的这种差别最终会体现在资源站点上的转载数上:在特定的资源站点上转载的次数非常多(商户故意传播),而在互联网其他站点上的转载的次数相对少的多(其他用户并不传播),也即广告图片在特定资源站点站内站外的转载比例会比较高,所以相对转载数可以作为区分广告图片和非广告图片的一种数据,而训练筛选器模型的工具包括但不限于开源的LIBSVM;步骤130,根据训练后的筛选器模型识别目标图片簇中的图片内容属性,即识别目标图片簇中的图片是否为广告图片,有利于对广告图片进行过滤等处理,避免广告图片对用户的体验造成影响,假设目标图片簇为对应图片搜索请求的一组图片,则根据本实施例的技术方案,可以从其中识别出广告图片并进行过滤,从而将非广告图片作为搜索结果提供给用户,从而保证用户的使用体验。As shown in Figure 1, an embodiment of the present invention provides a method for identifying image content attributes, which includes: step 110, calculating the relative number of reprints of multiple homologous image clusters for a specific resource site, each image cluster is a pair of The aggregation of a group of pictures, for example, can be a group of pictures with high similarity, and the relative reprint number is a data that can reflect the reprint ratio of pictures of the same source picture cluster on and off the site of a specific resource site, and the relative reprint number There are many calculation methods, and this embodiment does not limit the calculation method of the relative reprint number; Step 120, train the filter model according to multiple homologous picture clusters and the corresponding relative reprint numbers, and find through the research on the advertisement pictures that the advertisement Pictures have the following characteristics: the production cost of advertising pictures is high, and many advertising pictures are produced by merchants who spend money and time. Because of the high production cost of advertising pictures, merchants will spread a single advertising picture many times, but these advertising pictures Only merchants on the Internet will spread, while other users will basically not spread advertising pictures. This difference in the spread of advertising pictures will eventually be reflected in the number of reprints on resource sites: the number of reprints on specific resource sites is very high. More (merchant intentionally disseminates), while the number of reprints on other Internet sites is relatively less (other users do not disseminate), that is, the reprint ratio of advertising images on specific resource sites will be relatively high, so the relative reprint Number can be used as a kind of data of distinguishing advertisement picture and non-advertising picture, and the tool of training filter model includes but not limited to open-source LIBSVM; Step 130, according to the filter model after training, identify the picture content attribute in the target picture cluster, That is to identify whether the pictures in the target picture cluster are advertising pictures, which is beneficial to filter the advertising pictures and avoid the impact of the advertising pictures on the user experience. Assuming that the target picture cluster is a group of pictures corresponding to the picture search request, then according to this According to the technical solution of the embodiment, the advertising pictures can be identified and filtered, so that the non-advertising pictures can be provided to the user as the search result, thereby ensuring the user experience.

在实际应用中,在本发明提出的相对转载数之外,还同时考虑到其他的特征,例如图片的长/宽,图片的大小,图片的清晰度,图片链接是否和网页同站,或图片跳转链接是否站外等特征,在训练筛选器时会根据多个同源图片簇各自对应的相对转载数,以及图片簇中的图片的长/宽,图片的大小,图片的清晰度,图片链接是否和网页同站,图片跳转链接是否站外中的一个或多个组合,先经过筛选器去学习和训练。在目标图片簇识别时,也会对应参照上述这些其他特征中的一个或多个来进行筛选并识别是否为广告图片。In practical applications, in addition to the relative number of reprints proposed by the present invention, other features are also considered, such as the length/width of the picture, the size of the picture, the clarity of the picture, whether the picture link is on the same site as the web page, or whether the picture Whether the jump link is off-site or not, when training the filter, it will be based on the relative number of reprints corresponding to multiple homologous picture clusters, as well as the length/width of the picture in the picture cluster, the size of the picture, the clarity of the picture, the picture Whether the link is on the same site as the web page, whether the picture jump link is one or more combinations outside the site, first go through the filter to learn and train. When the target picture cluster is identified, one or more of the above-mentioned other features will also be referred to to filter and identify whether it is an advertisement picture.

本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,步骤110可以包括:对于多个同源图片簇中的一个同源图片簇,将同源图片簇中的图片在特定资源站点上的转载数,例如在图片站A上转载了30次,与在多个资源站点上的转载数相比较,例如在10个图片站(包括图片站A)上共转载了35次,得到同源图片簇对于特定资源站点的相对转载数,多个资源站点包括特定资源站点,本实施例中提供了计算相对转载数的可行方式,且不对具体的比较方式进行限定,例如,取30/35、30/(35-30)作为相对转载数都是可以的。Another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiments, in the method for identifying image content attributes in this embodiment, step 110 may include: Picture cluster, compare the number of reprints of pictures in the same source picture cluster on a specific resource site, for example, 30 times on picture site A, with the number of reprints on multiple resource sites, for example, on 10 picture sites A total of 35 reprints were made on (including picture site A), and the relative reprint numbers of the same-source picture clusters for specific resource sites were obtained. Multiple resource sites include specific resource sites. This embodiment provides a feasible way to calculate the relative reprint numbers. The specific comparison method is not limited, for example, 30/35, 30/(35-30) can be used as the relative number of reprints.

如图2所示,本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,步骤110包括:步骤111,计算特定资源站点上的图片的第一平均转载数,例如假设图片站A的第一平均转载数为5;步骤112,计算多个资源站点上的图片的第二平均转载数,例如假设10个图片站(包括图片站A)的第二平均转载数为20;步骤113,取同源图片簇中的图片在特定资源站点上的转载数与第一平均转载数的第一差值,则第一差值实际上可反映同源图片簇的图片与其他图片在特定资源站点上的转载差异,差值越大则表示同源图片簇为广告图片的可能性越大,结合前述的实施例可知第一差值为30-5=25,以及取同源图片簇中的图片在多个资源站点上的转载数与第二平均转载数的第二差值,则第二差值实际上可反映同源图片簇的图片与其他图片在多个资源站点上的转载差异,差值越大表示同源图片簇为广告图片的可能性越小,结合前述的实施例可知第二差值为35-20=15,将第一差值和第二差值对比得到同源图片簇对于特定资源站点的相对转载数,本实施例中提供了另一种计算相对转载数的方式,且考虑到同源图片簇的图片与其他图片的转载差异,使得相对转载数能更好地反映图片是否为广告图片,本实施例中不对第一差值和第二差值对比方式进行限定,例如,取25/15,(25±a)/(15±b)都是可以的,a、b为常数。As shown in Figure 2, another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiment, step 110 of the method for identifying image content attributes in this embodiment includes: step 111, calculating the specific resource site The first average number of reprints of pictures on the website, for example, assuming that the first average number of reprints of picture site A is 5; step 112, calculating the second average number of reprints of pictures on multiple resource sites, for example, assuming 10 picture sites (including The second average number of reprints of picture station A) is 20; step 113, take the first difference between the number of reprints of pictures in the same source picture cluster on a specific resource site and the first average number of reprints, then the first difference is actually It can reflect the reprinting difference between pictures of the same-source picture cluster and other pictures on a specific resource site. The larger the difference, the greater the possibility that the same-source picture cluster is an advertisement picture. Combining the above-mentioned embodiments, we can know the first difference is 30-5=25, and taking the second difference between the number of reprints and the second average number of reprints of pictures in the same-source picture cluster on multiple resource sites, the second difference can actually reflect the same-source picture cluster The difference between the reprinted pictures of the picture and other pictures on multiple resource sites, the greater the difference, the less likely the homologous picture cluster is an advertisement picture. Combining the foregoing embodiments, it can be seen that the second difference is 35-20=15, Compare the first difference with the second difference to obtain the relative number of reprints of the same-source picture cluster for a specific resource site. This embodiment provides another way to calculate the relative reprint number, and takes into account the pictures The reprint difference with other pictures makes the relative number of reprints better reflect whether the picture is an advertisement picture. In this embodiment, the comparison method between the first difference and the second difference is not limited, for example, take 25/15, (25 ±a)/(15±b) are all possible, a and b are constants.

本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,步骤111包括:取多个同源图片簇的图片中位于特定资源站点上的多个图片,将多个图片的数量与多个图片对应的同源图片簇的数量进行对比,得到第一平均转载数,例如图片站A上有100张图片,该100张图片位于20个图片簇中,则第一平均转载数为100/20=5,本实施例的技术方案中提供了一种快速高效得到平均转载数的方式。Another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiments, in the method for identifying image content attributes in this embodiment, step 111 includes: taking images located in a specific resource in multiple homologous image clusters For multiple pictures on the site, compare the number of multiple pictures with the number of homologous picture clusters corresponding to multiple pictures to obtain the first average number of reprints. For example, there are 100 pictures on picture site A, and the 100 pictures are located in Among the 20 image clusters, the first average number of reprints is 100/20=5, and the technical solution of this embodiment provides a way to quickly and efficiently obtain the average number of reprints.

本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,步骤112包括:将多个同源图片簇的图片的数量,与多个同源图片簇的数量进行比较,得到第二平均转载数,例如10个图片站(包括图片站A)上有1000张图片,该1000张图片可聚类为50个图片簇,则第二平均转载数为1000/50=20,本实施例的技术方案中提供了一种快速高效得到平均转载数的方式。Another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiments, in the method for identifying image content attributes in this embodiment, step 112 includes: combining the number of images in multiple homologous image clusters with Compare the number of multiple homologous picture clusters to get the second average number of reprints. For example, there are 1000 pictures on 10 picture sites (including picture site A), and the 1000 pictures can be clustered into 50 picture clusters, then the second 2. The average number of reprints is 1000/50=20. The technical solution of this embodiment provides a fast and efficient way to obtain the average number of reprints.

如图3所示,本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,步骤110之前,还包括:步骤101,抓取多个资源站点上出现的图片链接(URL);步骤102,检测图片链接与同源图片簇的图片对应的链接是否相同,这反映了一张图片是否以不同的URL被转载,和/或检测图片链接对应的图片的校验信息与同源图片簇的图片的校验信息(包括但不限于MD5值)是否相同,这反映了是否存在多张相同的图片,和/或检测图片链接对应的图片与同源图片簇的图片是否存在一个或多个相同的图像特征,这反映了多张图片是否相同,或由同一张图片修改得到,本实施例中的图像特征包括但不限于轮廓特征、颜色特征、直方图特征等;步骤103,根据检测结果,确定图片链接是否为同源图片簇的图片的转载,并统计同源图片簇的图片的转载数,则本实施例中提供了一种可全面统计图片转载数的技术方案。As shown in FIG. 3 , another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiments, the method for identifying image content attributes in this embodiment, before step 110, further includes: step 101, capturing Get picture links (URLs) that appear on multiple resource sites; step 102, check whether the picture links are the same as the links corresponding to pictures in the same source picture cluster, which reflects whether a picture is reproduced with different URLs, and/or Detect whether the verification information of the picture corresponding to the picture link is the same as the verification information (including but not limited to MD5 value) of the picture in the same source picture cluster, which reflects whether there are multiple identical pictures, and/or detects whether the picture link corresponds to Whether there are one or more of the same image features in the pictures of the same source picture cluster, which reflects whether multiple pictures are the same, or obtained by modifying the same picture. The image features in this embodiment include but are not limited to contour features , color features, histogram features, etc.; step 103, according to the detection results, determine whether the picture link is a reprint of a picture of the same source picture cluster, and count the number of reprints of the pictures of the same source picture cluster, then the present embodiment provides a A technical solution that can comprehensively count the number of picture reprints.

本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,特定资源站点为多个同源图片簇中转载每个同源图片簇的图片最多的资源站点,转载图片最多次数的站点很可能为广告图片的商户进行传播的站点,该站点对应的转载数最能够有效地反映出图片是否为广告图片。Another embodiment of the present invention proposes a method for identifying image content attributes. Compared with the above-mentioned embodiments, the method for identifying image content attributes in this embodiment requires a specific resource site to reprint each image of the same origin in multiple image clusters of the same origin. The resource site with the most pictures in the cluster, and the site with the most reposted pictures are likely to be the sites that the merchants of the advertising pictures spread. The number of reprints corresponding to this site can most effectively reflect whether the pictures are advertising pictures.

本发明的另一实施例提出一种图片内容属性识别方法,与上述实施例相比,本实施例的图片内容属性识别方法,每个同源图片簇的图片对应同一源图片,且每个同源图片簇的图片与其对应的源图片具有一个或多个相同的图像特征,则在本实施例的技术方案中,每个同源图片簇的图片相同,或可以同一图片修改得到,本实施例中的图像特征包括但不限于轮廓特征、颜色特征、直方图特征等。Another embodiment of the present invention proposes a picture content attribute recognition method. Compared with the above-mentioned embodiment, in the picture content attribute recognition method of this embodiment, each picture of the same source picture cluster corresponds to the same source picture, and each same source picture cluster The pictures of the source picture cluster and the corresponding source pictures have one or more identical image features, then in the technical solution of this embodiment, the pictures of each same-source picture cluster are the same, or can be obtained by modifying the same picture. Image features in include but not limited to contour features, color features, histogram features, etc.

如图4所示,本发明的一个实施例提供了一种图片内容属性识别系统,其包括:相对转载数计算模块210,用于计算多个同源图片簇对于特定资源站点的相对转载数,每个图片簇是对一组图片的聚合,例如,可以是相似度较高的一组图片,而相对转载数是一种能够反映同源图片簇的图片在特定资源站点站内站外的转载比例的数据,相对转载数的计算方式较多,本实施例中不对相对转载数的计算方式进行限制;训练模块220,用于将多个同源图片簇以及对应的相对转载数输入筛选器中训练筛选器模型。通过对广告图片的研究发现,广告图片有以下特点:广告图片生产成本高,很多广告图片都是商户花费金钱、花费时间制作的,因为广告图片的生产成本高,所以商户会将一张广告图片传播很多次,但是这些广告图片基本上只有商户会进行传播,而其他的用户则基本不会传播广告图片,广告图片在传播上的这种差别最终会体现在资源站点上的转载数上:在特定的资源站点上转载的次数非常多(商户故意传播),而在互联网其他站点上的转载的次数相对少的多(其他用户并不传播),也即广告图片在特定资源站点站内站外的转载比例会比较高,所以相对转载数可以作为区分广告图片和非广告图片的一种数据;筛选器230,适于根据训练模块得到训练后的筛选器模型,并根据模型对目标图片簇进行筛选,本实施例中使用的筛选器包括但不限于开源的LIBSVM;识别模块240,用于根据筛选器对目标图片簇进行筛选,识别目标图片簇中的图片内容属性,即识别目标图片簇中的图片是否为广告图片。As shown in FIG. 4 , an embodiment of the present invention provides a picture content attribute identification system, which includes: a relative reprint count calculation module 210, which is used to calculate the relative reprint count of multiple homologous picture clusters for a specific resource site, Each picture cluster is an aggregation of a group of pictures, for example, it can be a group of pictures with high similarity, and the relative reprint number is a kind of reprint ratio that can reflect the pictures of the same source picture cluster inside and outside the site of a specific resource site For the data, there are many calculation methods for the relative reprint number, and the calculation method of the relative reprint number is not limited in this embodiment; the training module 220 is used to input multiple homologous picture clusters and corresponding relative reprint numbers into the filter for training filter model. Through the study of advertising pictures, it is found that advertising pictures have the following characteristics: the production cost of advertising pictures is high, and many advertising pictures are produced by merchants who spend money and time. It has been disseminated many times, but basically only the merchants will spread these advertising pictures, while other users will basically not spread the advertising pictures. This difference in the dissemination of advertising pictures will eventually be reflected in the number of reprints on the resource site: The number of reprints on a specific resource site is very high (the merchant spreads it intentionally), while the number of reprints on other Internet sites is relatively small (other users do not spread it), that is, the number of advertising pictures on and off the site of a specific resource site The reprint ratio will be relatively high, so the relative reprint number can be used as a kind of data to distinguish advertising pictures from non-advertising pictures; the filter 230 is suitable for obtaining a trained filter model according to the training module, and filtering the target picture cluster according to the model , the filters used in this embodiment include but are not limited to open source LIBSVM; the identification module 240 is used to screen the target picture cluster according to the filter, and identify the picture content attributes in the target picture cluster, that is, to identify the target picture cluster. Whether the image is an ad image.

另外,实际应用中所述系统进一步包括:图片格式特征模块310和/或图片链接特征模块320;所述图片格式特征模块310,适于提取同源图片簇以及目标图片簇中包含的图片的格式特征;所述图片链接特征模块320,适于提取同源图片簇以及目标图片簇中包含的图片的链接特征;所述训练模块220进一步适于基于多个同源图片簇、对应的相对转载数以及对应的图片格式特征和/或图片链接特征,一同输入筛选器中训练筛选器模型;所述筛选器230,进一步适于根据训练后的模型,结合目标图片簇对应的相对转载数以及对应的图片格式特征和/或图片链接特征,对目标图片簇进行筛选;所述识别模块240,进一步用于根据所述筛选器基于目标图片簇对应的相对转载数以及对应的图片格式特征和/或图片链接特征对目标图片簇进行筛选,识别目标图片簇中的图片内容属性。In addition, in practical applications, the system further includes: a picture format feature module 310 and/or a picture link feature module 320; the picture format feature module 310 is suitable for extracting the format of pictures contained in the same source picture cluster and the target picture cluster feature; the picture link feature module 320 is adapted to extract the link features of the pictures contained in the homologous picture cluster and the target picture cluster; and the corresponding picture format feature and/or picture link feature, and input the filter model into the filter together; the filter 230 is further adapted to combine the relative number of reprints corresponding to the target picture cluster and the corresponding Image format features and/or image link features, to filter the target image cluster; the identification module 240 is further used to filter based on the relative number of reprints corresponding to the target image cluster and the corresponding image format features and/or images The link feature screens the target image cluster and identifies the image content attributes in the target image cluster.

有利于对广告图片进行过滤等处理,避免广告图片对用户的体验造成影响,假设目标图片簇为对应图片搜索请求的一组图片,则根据本实施例的技术方案,可以从其中识别出广告图片并进行过滤,从而将非广告图片作为搜索结果提供给用户,从而保证用户的使用体验。It is beneficial to perform filtering and other processing on the advertisement pictures, so as to avoid the impact of the advertisement pictures on the user experience. Assuming that the target picture cluster is a group of pictures corresponding to the picture search request, according to the technical solution of this embodiment, the advertisement pictures can be identified from them And filtering is performed, so that non-advertising images are provided to users as search results, thereby ensuring user experience.

在实际应用中,在本发明提出的相对转载数之外,还考虑到其他的特征,例如图片的长/宽,图片的大小,图片的清晰度,图片链接是否和网页同站,或图片跳转链接是否站外等特征,同样先经过分类器去学习和训练。在目标图片簇识别时,也会考虑上述这些其他特征中的一个或多个来进行筛选并识别是否为广告图片。In practical applications, in addition to the relative number of reprints proposed by the present invention, other features are also considered, such as the length/width of the picture, the size of the picture, the clarity of the picture, whether the link of the picture is on the same site as the web page, or whether the picture jumps Features such as whether the link is off-site or not are also learned and trained by the classifier first. When identifying the target picture cluster, one or more of the above-mentioned other features will also be considered to screen and identify whether it is an advertisement picture.

本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,相对转载数计算模块210对于多个同源图片簇中的一个同源图片簇,将同源图片簇中的图片在特定资源站点上的转载数,例如在图片站A上转载了30次,与在多个资源站点上的转载数相比较,例如在10个图片站(包括图片站A)上共转载了35次,得到同源图片簇对于特定资源站点的相对转载数,多个资源站点包括特定资源站点,本实施例中提供了计算相对转载数的可行方式,且不对具体的比较方式进行限定,例如,取30/35、30/(35-30)作为相对转载数都是可以的。Another embodiment of the present invention proposes a picture content attribute recognition system. Compared with the above-mentioned embodiment, in the picture content attribute recognition system of this embodiment, the relative reprint number calculation module 210 can determine the number of pictures of a same source in a plurality of homologous picture clusters. Source picture cluster, compare the number of reprints of pictures in the same source picture cluster on a specific resource site, for example, 30 times on picture site A, with the number of reprints on multiple resource sites, for example, in 10 pictures A total of 35 reprints were made on the website (including picture site A), and the relative number of reprints of the same-source picture cluster to a specific resource site was obtained. Multiple resource sites include specific resource sites. This embodiment provides a feasible way to calculate the relative reprint number , and does not limit the specific comparison method, for example, it is all possible to take 30/35, 30/(35-30) as the relative reprint number.

如图5所示,本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,还包括:第一平均转载数计算模块250,用于计算特定资源站点上的图片的第一平均转载数,例如假设图片站A的第一平均转载数为5;第二平均转载数计算模块260,用于计算多个资源站点上的图片的第二平均转载数,例如假设10个图片站(包括图片站A)的第二平均转载数为20;相对转载数计算模块210取同源图片簇中的图片在特定资源站点上的转载数与第一平均转载数的第一差值,则第一差值实际上可反映同源图片簇的图片与其他图片在特定资源站点上的转载差异,差值越大则表示同源图片簇为广告图片的可能性越大,结合前述的实施例可知第一差值为30-5=25,以及取同源图片簇中的图片在多个资源站点上的转载数与第二平均转载数的第二差值,则第二差值实际上可反映同源图片簇的图片与其他图片在多个资源站点上的转载差异,差值越大表示同源图片簇为广告图片的可能性越小,结合前述的实施例可知第二差值为35-20=15,将第一差值和第二差值对比得到同源图片簇对于特定资源站点的相对转载数,本实施例中提供了另一种计算相对转载数的方式,且考虑到同源图片簇的图片与其他图片的转载差异,使得相对转载数能更好地反映图片是否为广告图片,本实施例中不对第一差值和第二差值对比方式进行限定,例如,取25/15,(25±a)/(15±b)都是可以的,a、b为常数。As shown in Figure 5, another embodiment of the present invention proposes a picture content attribute identification system. Compared with the above-mentioned embodiment, the picture content attribute identification system of this embodiment further includes: a first average reprint count calculation module 250 , used to calculate the first average number of reprints of pictures on a specific resource site, for example, assuming that the first average number of reprints of picture site A is 5; the second average number of reprints calculation module 260 is used to calculate the number of pictures on multiple resource sites For example, assuming that the second average number of reprints of 10 picture sites (including picture site A) is 20; the relative reprint number calculation module 210 takes the number of reprints of pictures in the same source picture cluster on a specific resource site The first difference with the first average number of reprints, the first difference can actually reflect the reprint difference between the pictures of the same-source picture cluster and other pictures on a specific resource site, and the larger the difference, it means that the same-source picture cluster is The greater the possibility of advertising pictures, combined with the foregoing embodiment, it can be seen that the first difference is 30-5=25, and the number of reprints of pictures in the same source picture cluster on multiple resource sites and the second average number of reprints are taken. The second difference, the second difference can actually reflect the reprint difference between pictures of the same source picture cluster and other pictures on multiple resource sites, the larger the difference, the less likely the same source picture cluster is an advertisement picture , in combination with the foregoing embodiment, it can be seen that the second difference is 35-20=15, and the first difference and the second difference are compared to obtain the relative number of reprints of the same-source picture cluster for a specific resource site. This embodiment provides another A way to calculate the relative number of reprints, and taking into account the reprinting differences between pictures of the same source picture cluster and other pictures, so that the relative reprint number can better reflect whether the picture is an advertisement picture. In this embodiment, the first difference and The second difference is defined in a comparative manner, for example, 25/15, (25±a)/(15±b) is acceptable, and a and b are constants.

本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,第一平均转载数计算模块250取多个同源图片簇的图片中位于特定资源站点上的多个图片,将多个图片的数量与多个图片对应的同源图片簇的数量进行对比,得到第一平均转载数,例如图片站A上有100张图片,该100张图片位于20个图片簇中,则第一平均转载数为100/20=5,本实施例的技术方案中提供了一种快速高效得到平均转载数的方式。Another embodiment of the present invention proposes a picture content attribute recognition system. Compared with the above-mentioned embodiment, in the picture content attribute recognition system of this embodiment, the first average reprint count calculation module 250 takes pictures of multiple homologous picture clusters For multiple pictures located on a specific resource site, compare the number of multiple pictures with the number of homologous picture clusters corresponding to multiple pictures to obtain the first average number of reprints. For example, if there are 100 pictures on picture site A, the If 100 pictures are located in 20 picture clusters, the first average number of reprints is 100/20=5. The technical solution of this embodiment provides a way to quickly and efficiently obtain the average number of reprints.

本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,第二平均转载数计算模块260将多个同源图片簇的图片的数量,与多个同源图片簇的数量进行比较,得到第二平均转载数,例如10个图片站(包括图片站A)上有1000张图片,该1000张图片可聚类为50个图片簇,则第二平均转载数为1000/50=20,本实施例的技术方案中提供了一种快速高效得到平均转载数的方式。Another embodiment of the present invention proposes a picture content attribute recognition system. Compared with the above-mentioned embodiment, in the picture content attribute recognition system of this embodiment, the second average reprint number calculation module 260 calculates the picture content of a plurality of homologous picture clusters is compared with the number of multiple homologous picture clusters to obtain the second average number of reprints. For example, there are 1000 pictures on 10 picture sites (including picture site A), and the 1000 pictures can be clustered into 50 pictures cluster, the second average number of reprints is 1000/50=20, and the technical solution of this embodiment provides a way to quickly and efficiently obtain the average number of reprints.

如图6所示,本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,还包括:图片链接抓取模块270,用于抓取多个资源站点上出现的图片链接(URL);图片链接检测模块280,用于检测图片链接与同源图片簇的图片对应的链接是否相同,这反映了一张图片是否以不同的URL被转载,和/或检测图片链接对应的图片的校验信息与同源图片簇的图片的校验信息(包括但不限于MD5值)是否相同,这反映了是否存在多张相同的图片,和/或检测图片链接对应的图片与同源图片簇的图片是否存在一个或多个相同的图像特征,这反映了多张图片是否相同,或由同一张图片修改得到,本实施例中的图像特征包括但不限于轮廓特征、颜色特征、直方图特征等;图片转载数统计模块290,用于根据检测结果,确定图片链接是否为同源图片簇的图片的转载,并统计同源图片簇的图片的转载数,则本实施例中提供了一种可全面统计图片转载数的技术方案。As shown in Figure 6, another embodiment of the present invention proposes a picture content attribute identification system, compared with the above-mentioned embodiment, the picture content attribute identification system of this embodiment also includes: picture link capture module 270, uses The image link (URL) that appears on multiple resource sites is used for grabbing; the image link detection module 280 is used to detect whether the image link is the same as the link corresponding to the pictures of the same source image cluster, which reflects whether a picture is in a different The URL is reposted, and/or check whether the verification information of the picture corresponding to the picture link is the same as the verification information (including but not limited to MD5 value) of the pictures of the same source picture cluster, which reflects whether there are multiple identical pictures, And/or detect whether the picture corresponding to the picture link has one or more identical image features with the pictures of the same source picture cluster, which reflects whether the multiple pictures are the same, or are obtained by modifying the same picture. The image in this embodiment Features include but are not limited to contour features, color features, histogram features, etc.; picture reprint count statistics module 290 is used to determine whether the picture link is a reprint of a picture of a homologous picture cluster according to the detection results, and count the number of homologous picture clusters For the number of reprints of pictures, this embodiment provides a technical solution that can comprehensively count the number of reprints of pictures.

本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,特定资源站点为多个同源图片簇中转载每个同源图片簇的图片最多的资源站点,转载图片最多次数的站点很可能为广告图片的商户进行传播的站点,该站点对应的转载数最能够有效地反映出图片是否为广告图片。Another embodiment of the present invention proposes a picture content attribute recognition system. Compared with the above-mentioned embodiment, in the picture content attribute recognition system of this embodiment, a specific resource site reprints each same-source picture in a plurality of same-source picture clusters The resource site with the most pictures in the cluster, and the site with the most reposted pictures are likely to be the sites that the merchants of the advertising pictures spread. The number of reprints corresponding to this site can most effectively reflect whether the pictures are advertising pictures.

本发明的另一实施例提出一种图片内容属性识别系统,与上述实施例相比,本实施例的图片内容属性识别系统,每个同源图片簇的图片对应同一源图片,且每个同源图片簇的图片与其对应的源图片具有一个或多个相同的图像特征,则在本实施例的技术方案中,每个同源图片簇的图片相同,或可以同一图片修改得到,本实施例中的图像特征包括但不限于轮廓特征、颜色特征、直方图特征等。Another embodiment of the present invention proposes a picture content attribute recognition system. Compared with the above-mentioned embodiment, in the picture content attribute recognition system of this embodiment, each picture of the same source picture cluster corresponds to the same source picture, and each same source picture The pictures of the source picture cluster and the corresponding source pictures have one or more identical image features, then in the technical solution of this embodiment, the pictures of each same-source picture cluster are the same, or can be obtained by modifying the same picture. Image features in include but not limited to contour features, color features, histogram features, etc.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings), as well as any method or method so disclosed, may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的图片内容属性识别系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the image content attribute recognition system according to the embodiment of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

Claims (10) Translated from Chinese

1.一种图片内容属性识别方法,其包括:1. A method for identifying image content attributes, comprising: 计算多个同源图片簇对于特定资源站点的相对转载数;Calculate the relative number of reprints of multiple homologous image clusters for a specific resource site; 根据所述多个同源图片簇以及对应的相对转载数训练筛选器模型;training a filter model according to the plurality of homologous picture clusters and the corresponding relative reprint numbers; 根据训练后的筛选器模型识别目标图片簇中的图片内容属性。Identify image content attributes in the target image cluster based on the trained filter model. 2.根据权利要求1所述的图片内容属性识别方法,其中,所述计算多个同源图片簇对于特定资源站点的相对转载数的步骤包括:2. The picture content attribute identification method according to claim 1, wherein the step of calculating the relative number of reprints of a plurality of homologous picture clusters for a specific resource site comprises: 对于所述多个同源图片簇中的一个同源图片簇,将所述同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较,得到所述同源图片簇对于所述特定资源站点的相对转载数,所述多个资源站点包括所述特定资源站点。For a homologous picture cluster among the plurality of homologous picture clusters, comparing the number of reprints of pictures in the homologous picture cluster on a specific resource site with the number of reprints on multiple resource sites, to obtain The number of relative reprints of the homologous picture cluster to the specific resource site, where the multiple resource sites include the specific resource site. 3.根据权利要求2所述的图片内容属性识别方法,其中,所述将所述同源图片簇中的图片在所述特定资源站点上的转载数,与在多个资源站点上的转载数相比较的步骤包括:3. The picture content attribute identification method according to claim 2, wherein, the number of reprints of the pictures in the homologous picture cluster on the specific resource site is compared with the number of reprints on multiple resource sites The steps to compare include: 计算所述特定资源站点上的图片的第一平均转载数;calculating the first average number of reprints of pictures on the specific resource site; 计算所述多个资源站点上的图片的第二平均转载数;calculating the second average number of reprints of pictures on the plurality of resource sites; 取所述同源图片簇中的图片在所述特定资源站点上的转载数与所述第一平均转载数的第一差值,以及取所述同源图片簇中的图片在所述多个资源站点上的转载数与所述第二平均转载数的第二差值,将所述第一差值和所述第二差值对比得到所述同源图片簇对于所述特定资源站点的相对转载数。Taking the first difference between the number of reprints of the pictures in the same-source picture cluster on the specific resource site and the first average number of reprints, and taking the pictures in the same-source picture cluster among the multiple The second difference between the number of reprints on the resource site and the second average number of reprints, comparing the first difference with the second difference to obtain the relative Number of reprints. 4.根据权利要求3所述的图片内容属性识别方法,其中,所述计算所述特定资源站点上的图片的第一平均转载数的步骤包括:4. The picture content attribute identification method according to claim 3, wherein the step of calculating the first average number of reprints of pictures on the specific resource site comprises: 取所述多个同源图片簇的图片中位于所述特定资源站点上的多个图片,将所述多个图片的数量与所述多个图片对应的同源图片簇的数量进行对比,得到所述第一平均转载数。Taking multiple pictures located on the specific resource site among the pictures of the multiple homologous picture clusters, comparing the number of the multiple pictures with the number of homologous picture clusters corresponding to the multiple pictures, and obtaining The first average number of reprints. 5.根据权利要求3所述的图片内容属性识别方法,其中,所述计算所述多个资源站点上的图片的第二平均转载数的步骤包括:5. The picture content attribute identification method according to claim 3, wherein the step of calculating the second average number of reprints of pictures on the multiple resource sites comprises: 将所述多个同源图片簇的图片的数量,与所述多个同源图片簇的数量进行比较,得到所述第二平均转载数。The number of pictures in the multiple homologous picture clusters is compared with the number of the multiple homologous picture clusters to obtain the second average number of reprints. 6.根据权利要求2所述的图片内容属性识别方法,其中,在所述将所述同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较的步骤之前,还包括:6. The picture content attribute identification method according to claim 2, wherein, comparing the number of reprints of the pictures in the homologous picture cluster on a specific resource site with the number of reprints on multiple resource sites Before the comparison step, also include: 抓取所述多个资源站点上出现的图片链接;Grab the image links appearing on the multiple resource sites; 检测所述图片链接与所述同源图片簇的图片对应的链接是否相同,和/或检测所述图片链接对应的图片的校验信息与所述同源图片簇的图片的校验信息是否相同,和/或检测所述图片链接对应的图片与所述同源图片簇的图片是否存在一个或多个相同的图像特征;Detecting whether the picture link is the same as the link corresponding to the picture in the same-source picture cluster, and/or detecting whether the check information of the picture corresponding to the picture link is the same as the check information of the picture in the same-source picture cluster , and/or detecting whether one or more of the same image features exist between the picture corresponding to the picture link and the picture of the homologous picture cluster; 根据检测结果,确定所述图片链接是否为所述同源图片簇的图片的转载,并统计所述同源图片簇的图片的转载数。According to the detection result, determine whether the picture link is a reprint of a picture of the same-source picture cluster, and count the number of reprints of the pictures of the same-source picture cluster. 7.根据权利要求2所述的图片内容属性识别方法,其中,7. The picture content attribute identification method according to claim 2, wherein, 所述特定资源站点为所述多个同源图片簇中转载每个同源图片簇的图片最多的资源站点。The specific resource site is the resource site that reprints the most pictures of each same-source picture cluster among the plurality of same-source picture clusters. 8.根据权利要求1至7中任一项所述的图片内容属性识别方法,其中,8. The picture content attribute recognition method according to any one of claims 1 to 7, wherein, 每个同源图片簇的图片对应同一源图片,且每个同源图片簇的图片与其对应的源图片具有一个或多个相同的图像特征。The pictures of each same-source picture cluster correspond to the same source picture, and the pictures of each same-source picture cluster and its corresponding source picture have one or more identical image features. 9.一种图片内容属性识别系统,其包括:9. A picture content attribute identification system, comprising: 相对转载数计算模块,用于计算多个同源图片簇对于特定资源站点的相对转载数;The relative reprint count calculation module is used to calculate the relative reprint count of multiple homologous image clusters for a specific resource site; 训练模块,用于将所述多个同源图片簇以及对应的相对转载数输入筛选器中训练筛选器模型;A training module, configured to input the plurality of homologous picture clusters and the corresponding relative reprint numbers into the filter to train the filter model; 筛选器,适于根据所述训练模块得到训练后的筛选器模型,并根据所述模型对目标图片簇进行筛选;A filter, adapted to obtain a trained filter model according to the training module, and filter the target picture cluster according to the model; 识别模块,用于根据所述筛选器对目标图片簇进行筛选,识别目标图片簇中的图片内容属性。The identification module is configured to screen the target picture cluster according to the filter, and identify the picture content attributes in the target picture cluster. 10.根据权利要求9所述的图片内容属性识别系统,其中,10. The picture content attribute identification system according to claim 9, wherein, 所述相对转载数计算模块对于所述多个同源图片簇中的一个同源图片簇,将所述同源图片簇中的图片在特定资源站点上的转载数,与在多个资源站点上的转载数相比较,得到所述同源图片簇对于所述特定资源站点的相对转载数,所述多个资源站点包括所述特定资源站点。The relative reprint number calculation module calculates the number of reprints of the pictures in the same source picture cluster on a specific resource site and the number of pictures in the multiple resource sites for a homologous picture cluster among the multiple homologous picture clusters By comparing the number of reprints of the same-source picture cluster to the specific resource site, the relative reprint number of the same-source picture cluster is obtained, and the multiple resource sites include the specific resource site.

CN201310632676.8A 2013-12-02 2013-12-02 Picture content attribute identification method and system Active CN103617262B (en) Priority Applications (2) Application Number Priority Date Filing Date Title CN201310632676.8A CN103617262B (en) 2013-12-02 2013-12-02 Picture content attribute identification method and system PCT/CN2014/087109 WO2015081748A1 (en) 2013-12-02 2014-09-22 Method and system for identifying content attribute of picture Applications Claiming Priority (1) Application Number Priority Date Filing Date Title CN201310632676.8A CN103617262B (en) 2013-12-02 2013-12-02 Picture content attribute identification method and system Publications (2) Family ID=50167965 Family Applications (1) Application Number Title Priority Date Filing Date CN201310632676.8A Active CN103617262B (en) 2013-12-02 2013-12-02 Picture content attribute identification method and system Country Status (1) Cited By (5) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title CN103995857A (en) * 2014-05-14 2014-08-20 北京奇虎科技有限公司 Method and device for achieving image search and sorting WO2015081748A1 (en) * 2013-12-02 2015-06-11 北京奇虎科技有限公司 Method and system for identifying content attribute of picture CN105022738A (en) * 2014-04-21 2015-11-04 上海京知信息科技有限公司 Extracting and mapping method of network picture format file on the basis of histograms CN106599177A (en) * 2016-12-12 2017-04-26 国云科技股份有限公司 A processing method for advertising page shielding CN107451180A (en) * 2017-06-13 2017-12-08 百度在线网络技术(北京)有限公司 Identify method, apparatus, equipment and the computer-readable storage medium of website affinity Citations (4) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title US5832119A (en) * 1993-11-18 1998-11-03 Digimarc Corporation Methods for controlling systems using control signals embedded in empirical data CN101071433A (en) * 2007-05-10 2007-11-14 腾讯科技(深圳)有限公司 Picture download system and method CN102419777A (en) * 2012-01-10 2012-04-18 凤凰在线(北京)信息技术有限公司 Internet picture advertisement filtering system and filtering method thereof CN102591983A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filter system and advertisement filter method Patent Citations (5) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title US5832119A (en) * 1993-11-18 1998-11-03 Digimarc Corporation Methods for controlling systems using control signals embedded in empirical data US5832119C1 (en) * 1993-11-18 2002-03-05 Digimarc Corp Methods for controlling systems using control signals embedded in empirical data CN101071433A (en) * 2007-05-10 2007-11-14 腾讯科技(深圳)有限公司 Picture download system and method CN102419777A (en) * 2012-01-10 2012-04-18 凤凰在线(北京)信息技术有限公司 Internet picture advertisement filtering system and filtering method thereof CN102591983A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filter system and advertisement filter method Cited By (6) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title WO2015081748A1 (en) * 2013-12-02 2015-06-11 北京奇虎科技有限公司 Method and system for identifying content attribute of picture CN105022738A (en) * 2014-04-21 2015-11-04 上海京知信息科技有限公司 Extracting and mapping method of network picture format file on the basis of histograms CN103995857A (en) * 2014-05-14 2014-08-20 北京奇虎科技有限公司 Method and device for achieving image search and sorting CN106599177A (en) * 2016-12-12 2017-04-26 国云科技股份有限公司 A processing method for advertising page shielding CN106599177B (en) * 2016-12-12 2020-02-14 国云科技股份有限公司 Advertisement page shielding processing method CN107451180A (en) * 2017-06-13 2017-12-08 百度在线网络技术(北京)有限公司 Identify method, apparatus, equipment and the computer-readable storage medium of website affinity Also Published As Similar Documents Publication Publication Date Title CN112001282B (en) 2024-10-29 Image recognition method TWI727202B (en) 2021-05-11 Method and system for identifying fraudulent publisher networks CN103617262B (en) 2017-03-08 Picture content attribute identification method and system US20200175550A1 (en) 2020-06-04 Method for identifying advertisements for placement in multimedia content elements CN111212303B (en) 2022-05-10 Video recommendation method, server and computer-readable storage medium US8850305B1 (en) 2014-09-30 Automatic detection and manipulation of calls to action in web pages CN103617261B (en) 2017-03-08 Picture content attribute identification method and system CN102833233B (en) 2015-07-01 Method and device for recognizing web pages US20130246166A1 (en) 2013-09-19 Method for determining an area within a multimedia content element over which an advertisement can be displayed WO2014173349A1 (en) 2014-10-30 Method and device for obtaining web page category standards, and method and device for categorizing web page categories CN102169533A (en) 2011-08-31 Commercial webpage malicious tampering detection method CN103793461B (en) 2017-05-31 The analysis method and device of info web CN102902790B (en) 2017-06-06 Web page classification system and method CN105786847A (en) 2016-07-20 Method and system for displaying structured abstracts of commodity web page in e-commerce website CN106681989A (en) 2017-05-17 Method for predicting microblog forwarding probability CN109426831A (en) 2019-03-05 The method, apparatus and computer equipment of picture Similarity matching and model training CN102902794B (en) 2016-08-03 Web page classification system and method CN104966109B (en) 2018-08-14 Medical laboratory sheet image classification method and device CN102902792B (en) 2015-10-21 list page identification system and method CN113469138B (en) 2025-04-29 Object detection method and device, storage medium and electronic device CN102890717B (en) 2016-09-28 Webpage category knowledge base set up system and method CN108920955B (en) 2022-03-11 Webpage backdoor detection method, device, equipment and storage medium CN102929948B (en) 2017-03-08 list page identification system and method CN111325705A (en) 2020-06-23 Image processing method, device, equipment and storage medium WO2015081748A1 (en) 2015-06-11 Method and system for identifying content attribute of picture Legal Events Date Code Title Description 2014-03-05 PB01 Publication 2014-03-05 PB01 Publication 2014-04-02 C10 Entry into substantive examination 2014-04-02 SE01 Entry into force of request for substantive examination 2017-03-08 C14 Grant of patent or utility model 2017-03-08 GR01 Patent grant 2022-08-09 TR01 Transfer of patent right 2022-08-09 TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4