The invention discloses a video content judgment method based on character recognition, which comprises the following steps: A. screenshot is carried out on a video picture; B. calling a pre-trained character detection model to analyze character areas of the screenshot picture, finding out and dividing the character areas in the picture, and obtaining one or more character areas; C. after the character areas are detected, calling a character recognition model trained in advance, circularly performing character recognition on each character area, and recognizing the character content of each character area; D. and (4) carrying out natural language processing aiming at the recognized character content, understanding the semantics of the character content and making corresponding video playing setting. The video content judgment method can run on the embedded platform in real time, can identify the character information in the video, and can set scenes according to the prompt of the character information.
Description Video content judgment method based on character recognitionTechnical Field
The invention relates to the technical field of image recognition, in particular to a video content judgment method based on character recognition.
Background
With the rapid development of artificial intelligence technology, artificial intelligence has gradually entered into various aspects of human life. By utilizing the artificial intelligence technology, the television is intelligentized, the use experience of a user can be greatly improved, and the life of people becomes more convenient.
Video image information in a television often contains a large amount of information content. In addition to the image frame, a frame of image may also contain text information, which is usually a display of important information of the currently playing scene. Compared with ever-changing image information, the text information is analyzed, and generally, which scene is played currently is easier to know.
At present, artificial intelligence technologies of most products are operated at a cloud server end of the internet, and due to the limitation of hardware conditions of an Android system, large-scale calculation cannot be operated, and too many resources, such as occupation of a CPU (central processing unit), cannot be occupied, so that a good technical scheme for character recognition in an image scene operated on an embedded platform is temporarily provided.
Disclosure of Invention
The invention aims to overcome the defects in the background art, and provides a video content judgment method based on character recognition, which can run on an embedded platform in real time, can recognize character information in a video, and can perform scene setting (image or voice setting) according to the prompt of the character information, and is suitable for specific fields, such as the television field and the like.
In order to achieve the technical effects, the invention adopts the following technical scheme:
a video content judgment method based on character recognition comprises the following steps:
A. screenshot is carried out on a video picture;
B. calling a pre-trained character detection model to analyze character areas of the screenshot picture, finding out and dividing the character areas in the picture, and obtaining one or more character areas;
C. after the character areas are detected, calling a character recognition model trained in advance, circularly recognizing characters of each character area, and recognizing the character content of each character area;
D. and (4) carrying out natural language processing aiming at the recognized character content, understanding the semantics of the character content and making corresponding video playing setting.
Further, the step A also comprises the step of dividing and setting a plurality of image areas needing character recognition on the screenshot picture;
the step B specifically comprises the following steps:
B1. calling a character detection model which is trained in advance to analyze character areas of the screenshot picture, finding out and dividing the character areas in the picture, and obtaining one or more character areas;
B2. and C, if the detected character area is in the preset image area needing character recognition, otherwise, returning to the step A.
Further, the character detection model in the step B is a convolutional neural network.
Further, the convolutional neural network is a mobilenet-ssd neural network based on tensoflow.
Further, the training procedure for the convolutional neural network is as follows:
s1, collecting a preset number of video image samples with text contents according to the input characteristics of a neural network;
s2, extracting at least rectangular frame coordinates, text contents, information of text language categories, and image size and image format information of an image sample per Zhang Youwen text content video image sample;
s3, aiming at the image samples and the sample information thereof obtained in the steps S1 and S2, generating training files and verification files in tfrechrd formats supported by tensoflow, wherein the images of the training files and the verification files are different, and the image formats stored in the training files and the verification files are the same as the image information formats;
s4, training the model by using the training file to generate a predetermined character detection model, and verifying the generated character detection model by using the verification file;
s5, if the verification accuracy is larger than or equal to a preset threshold value, or the training step number reaches a certain step number, finishing the training;
and S6, if the verification accuracy is lower than the preset threshold, increasing a video image sample with text content or debugging model parameters, and repeatedly executing the steps S1 to S4 until the training is finished.
Further, the character recognition model in the step C is a convolutional recurrent neural network based on an attention model.
Further, the convolutional recurrent neural network based on the Attention model is an Attention-CRNN neural network based on tensoflow.
Further, the training step of the convolutional recurrent neural network based on the attention model is as follows:
s101, creating a Chinese dictionary, cutting a character area image in a video image sample used in a character detection model, and generating a character image sample data set;
s102, combining sample data into a Chinese dictionary, generating tfrecrd format files required by training, and dividing the tfrecrd format files into training files and verification files, wherein the images of the training files and the verification files are different, but the stored image formats are the same as the image information formats;
s103, training the model by using a training file to generate a predetermined character recognition model, and verifying the generated character recognition model by using a verification file;
s104, if the verification accuracy is larger than or equal to a preset threshold value, or the training step number reaches a certain step number, finishing training;
and S105, if the verification accuracy is lower than the preset threshold, adding a video image sample with text content or debugging model parameters, and repeatedly executing the steps S101 to S103 until the training is finished.
Further, the step D specifically includes the following steps:
D1. dividing the recognized characters into words and single phrases;
D2. carrying out keyword matching on each phrase and a predetermined phrase table;
D3. if the phrase in the current image is the predetermined phrase and the continuous frames of images are all the predetermined phrase, the current image scene is judged to be the predetermined phrase scene, and corresponding scene processing is performed.
Compared with the prior art, the invention has the following beneficial effects:
according to the video content judgment method based on character recognition, the content displayed by the current video image can be judged by automatically recognizing the characters in the video image, and corresponding scene content processing is carried out.
Drawings
Fig. 1 is a flowchart illustrating a method for determining video content based on text recognition according to an embodiment of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the embodiments of the invention described hereinafter.
The embodiment is as follows:
the first embodiment is as follows:
as shown in fig. 1, a method for determining video content based on text recognition, which is applied to an intelligent television system in this embodiment, mainly includes the following steps:
step 1: screenshot is carried out on a video picture:
when the television system detects that there is a video stream, a current video image is cut out of the video stream every 1s, and the video image is 1080P (size: 1920 Ã 1080). Meanwhile, after the background program acquires the images, images with the width of 640 and the height of 360 are generated uniformly through an image scaling technology and are sent to a pre-trained character detection model for detection.
Step 2: calling a character detection model which is trained in advance to analyze character areas of the screenshot picture, finding out the character areas in the picture and dividing the character areas to obtain one or more character areas:
and analyzing the character areas of the image by the character detection model trained in advance, automatically finding the character areas in the picture, giving the coordinates and the width and the height of the character areas, and obtaining a plurality of character areas.
In order to improve efficiency, a plurality of image areas to be subjected to character recognition are generally set in advance. When the detected character area is in a preset image area needing character recognition, entering the next step for character recognition; if the detected character area is not in the preset image area needing character recognition, the next character recognition is not carried out, and the character detection and recognition are directly carried out on the next frame of image.
In this embodiment, for the detected text region, the picture is scaled, and the method includes:
for the horizontal text region, the fixed height is 150, and when the width is less than 600, the image is filled to 600. When the width is larger than 600, the image is cut into a plurality of 500 Ã 150 images (the first image is special, the cut is 550 Ã 150), and special processing is performed on the edge, namely, 50 Ã 150 images are respectively placed at the left and right edges of the original image to be cut into the newly cut image. If the last image is less than 600 width, the last image is filled to 600 width. Finally, the text area image is generated into a plurality of 600 Ã 150 images.
For the vertical character area, the characters are cut into single characters for recognition according to the principle that the font proportion is about 0.7-1 and the length-width ratio of the character area, after a single character image is cut, the fixed height of the image is scaled to be 150, and the width of the image is filled to be 600. And finally, generating a plurality of character area images.
And step 3: after the character areas are detected, calling a character recognition model trained in advance, circularly recognizing characters of each character area, and recognizing the character content of each character area:
and calling a pre-trained Chinese text recognition model, circularly performing character recognition on each character area, and recognizing the character content of each character area.
And 4, step 4: and (3) aiming at the recognized text content, performing natural language processing, understanding the semantics of the text content, and making corresponding video playing settings:
the method specifically comprises the following steps:
step 4.1, dividing the recognized characters into words and phrases;
step 4.2, carrying out keyword matching on each phrase and a predetermined phrase table;
and 4.3, if the phrase in the current image is the predetermined phrase and the continuous frames of images are all the predetermined phrase, judging that the current image scene is the predetermined phrase scene, and carrying out corresponding scene processing.
Specifically, in this embodiment, for the text content, the pre-defined text includes phrases such as "advertisement" and "news". These phrases may all represent the image content category of the current video playback. And (4) carrying out natural language processing such as word segmentation, word group matching and the like aiming at the extracted character contents. When a predefined phrase such as an advertisement phrase or a news phrase is detected in the same position in a plurality of continuous frames, the current scene is determined to be a scene such as an advertisement or news, and corresponding video playing setting is subsequently made according to different scenes.
Specifically, the pre-trained character detection model is a convolutional neural network, and in this embodiment, a mobilenet-ssd neural network based on tensoflow of google is specifically adopted. The training process of the neural network is as follows:
A. aiming at the input characteristics of the neural network, collecting about 5000 video image samples with character contents in a television playing video, and uniformly setting the size of 640 Ã 360;
B. extracting information such as rectangular frame coordinates, text contents, text language categories and the like of an area where the text is located and information such as image size and image format of the image sample per se from the video image sample of the Zhang Youwen text contents;
C. aiming at the image samples and the sample information thereof obtained in the two steps, training files and verification files in tfrecrd format supported by tensoflow are generated, wherein the images of the training files and the verification files are different, but the stored image formats are the same as the image information formats.
D. Training the model by using a training file to generate a predetermined character detection model, and verifying the generated character detection model by using a verification file;
E. if the verification accuracy is greater than or equal to a preset threshold (the preset verification accuracy threshold is 95% in this embodiment), or the training step number reaches a certain step number (2 ten thousand steps), the training is completed;
F. if the verification accuracy is lower than the preset threshold (95%), increasing video image samples with text content or debugging model parameters, and repeatedly executing the step A, B, C, D, E until the training is completed.
G. And generating a tflite model file for being called by the android program.
Specifically, the pre-trained character recognition model is a convolutional recurrent neural network based on an Attention model, and an Attention-CRNN neural network based on tenserflow of google is specifically adopted in this embodiment. Although the Attention-CRNN is composed of several different neural networks and components (CNN, RNN, attention), they can be trained end-to-end using the same penalty function. Therefore, the model can be trained uniformly, and the training process is as follows:
A. a chinese dictionary was created containing 5462 chinese characters. Only two columns in the dictionary, the left column is serial numbers (0,1, ⦠â¦), and the right column is Chinese characters (toilet, in-pants, ⦠â¦);
B. cutting a character area image in a video image sample used in a character detection model, and generating a character image sample data set with the image size width of 600 and the height of 150;
C. and combining the sample data set with a Chinese dictionary to generate tfrecrd format files required by training, wherein the tfrecrd format files are also divided into training files and verification files, the images of the training files and the verification files are different, but the stored image formats of the training files and the verification files are the same as the image information formats of the training files and the verification files.
D. Training the model by using a training file to generate a predetermined character recognition model, and verifying the generated character detection model by using a verification file;
E. if the verification accuracy is greater than or equal to a preset threshold (the preset verification accuracy threshold is 90% in this embodiment), or the training step number reaches a certain step number (2 ten thousand steps), the training is completed;
F. if the verification accuracy is lower than the preset threshold (90%), increasing video image samples with text content or debugging model parameters, and repeatedly executing the step A, B, C, D, E until the training is completed.
G. And generating a tflite model file for being called by the android program. .
It will be understood that the above embodiments are merely exemplary embodiments adopted to illustrate the principles of the present invention, and the present invention is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.
Claims (5)1. A video content judgment method based on character recognition is characterized by comprising the following steps:
A. screenshot is carried out on a video picture;
B. calling a pre-trained character detection model to analyze character areas of the screenshot picture, finding out and dividing the character areas in the picture, and obtaining one or more character areas;
C. after the character areas are detected, calling a character recognition model trained in advance, circularly performing character recognition on each character area, and recognizing the character content of each character area;
D. carrying out natural language processing aiming at the recognized character content, understanding the semantics thereof and making corresponding video playing setting;
the step A also comprises the division setting of a plurality of image areas needing character recognition on the screenshot picture;
the step B specifically comprises the following steps:
B1. calling a pre-trained character detection model to analyze character areas of the screenshot picture, finding out and dividing the character areas in the picture, and obtaining one or more character areas;
B2. if the detected character area is in a preset image area needing character recognition, entering the step C, otherwise, returning to the step A;
the character detection model in the step B is a convolutional neural network based on a mobilenet-ssd neural network of tensoflow; the training procedure for the convolutional neural network is as follows:
s1, collecting a preset number of video image samples with text contents according to the input characteristics of a neural network;
s2, extracting at least rectangular frame coordinates, text contents, information of text language categories, and image size and image format information of an image sample per Zhang Youwen text content video image sample;
s3, aiming at the image samples and the sample information thereof obtained in the steps S1 and S2, generating training files and verification files in tfrechrd formats supported by tensoflow, wherein the images of the training files and the verification files are different, and the image formats stored in the training files and the verification files are the same as the image information formats;
s4, training the model by using the training file to generate a predetermined character detection model, and verifying the generated character detection model by using the verification file;
s5, if the verification accuracy is larger than or equal to a preset threshold value, or the training step number reaches a certain step number, finishing the training;
and S6, if the verification accuracy is lower than a preset threshold, increasing a video image sample with text content or debugging model parameters, and repeatedly executing the steps until the training is finished.
2. The method as claimed in claim 1, wherein the character recognition model in step C is a convolutional recurrent neural network based on attention model.
3. The method as claimed in claim 2, wherein the convolutional recurrent neural network based on Attention model is an Attention-CRNN neural network based on tensoflow.
4. The method of claim 3, wherein the training of the convolutional recurrent neural network based on the attention model comprises the following steps:
s101, creating a Chinese dictionary, cutting a character area image in a video image sample used in a character detection model, and generating a character image sample data set;
s102, combining sample data into a Chinese dictionary, generating tfrecrd format files required by training, and dividing the tfrecrd format files into training files and verification files, wherein the images of the training files and the verification files are different, but the stored image formats are the same as the image information formats;
s103, training the model by using a training file to generate a predetermined character recognition model, and verifying the generated character recognition model by using a verification file;
s104, if the verification accuracy is larger than or equal to a preset threshold value, or the training step number reaches a certain step number, finishing training;
and S105, if the verification accuracy is lower than the preset threshold, adding a video image sample with text content or debugging model parameters, and repeatedly executing the steps until the training is finished.
5. The method for determining video content based on character recognition according to any one of claims 1 to 4, wherein the step D specifically comprises the following steps:
D1. performing word segmentation on the recognized characters, and dividing the recognized characters into single phrases;
D2. carrying out keyword matching on each phrase and a predetermined phrase table;
D3. if the phrase in the current image is the predetermined phrase and the continuous frames of images are all the predetermined phrase, the current image scene is judged to be the predetermined phrase scene, and corresponding scene processing is carried out.
CN201811360543.9A 2018-11-15 2018-11-15 Video content judgment method based on character recognition Active CN109583443B (en) Priority Applications (1) Application Number Priority Date Filing Date Title CN201811360543.9A CN109583443B (en) 2018-11-15 2018-11-15 Video content judgment method based on character recognition Applications Claiming Priority (1) Application Number Priority Date Filing Date Title CN201811360543.9A CN109583443B (en) 2018-11-15 2018-11-15 Video content judgment method based on character recognition Publications (2) Family ID=65922743 Family Applications (1) Application Number Title Priority Date Filing Date CN201811360543.9A Active CN109583443B (en) 2018-11-15 2018-11-15 Video content judgment method based on character recognition Country Status (1) Families Citing this family (8) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title CN110210299A (en) * 2019-04-26 2019-09-06 å¹³å®ç§æï¼æ·±å³ï¼æéå ¬å¸ Voice training data creation method, device, equipment and readable storage medium storing program for executing CN111081105B (en) * 2019-07-17 2022-07-08 广ä¸å°å¤©æç§ææéå ¬å¸ Dictation detection method in black screen standby state and electronic equipment CN110490232B (en) * 2019-07-18 2021-08-13 å京æ·éåå£°ç§æè¡ä»½æéå ¬å¸ Method, device, equipment and medium for training character row direction prediction model CN110458162B (en) * 2019-07-25 2023-06-23 䏿µ·å è§ä¿¡æ¯ç§æææ¯æéå ¬å¸ Method for intelligently extracting image text information CN111147891B (en) * 2019-12-31 2022-09-13 æå·å¨ä½©ç½ç»ç§ææéå ¬å¸ Method, device and equipment for acquiring information of object in video picture CN111814642A (en) * 2020-06-30 2020-10-23 å京ç©å¨ä¸èµ·ç§ææéå ¬å¸ A kind of identification method and system of electric competition event data CN113255689B (en) * 2021-05-21 2024-03-19 å京æç«¹å± ç½ç»ææ¯æéå ¬å¸ Text line picture identification method, device and equipment CN115937855B (en) * 2023-03-10 2023-06-06 åå·è¯çç§ææéå ¬å¸ Intelligent equipment control method and system based on big data Citations (14) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title WO2003051031A2 (en) * 2001-12-06 2003-06-19 The Trustees Of Columbia University In The City Of New York Method and apparatus for planarization of a material by growing and removing a sacrificial film CN101667251A (en) * 2008-09-05 2010-03-10 䏿çµåæ ªå¼ä¼ç¤¾ OCR recognition method and device with auxiliary positioning function CN101692269A (en) * 2009-10-16 2010-04-07 åäº¬ä¸æå¾®çµåæéå ¬å¸ Method and device for processing video programs CN102331990A (en) * 2010-12-22 2012-01-25 åå·å¤§å¦ A News Video Retrieval Method Based on Subtitle Extraction CN103020618A (en) * 2011-12-19 2013-04-03 åäº¬æ·æä¸çºªç§æè¡ä»½æéå ¬å¸ Detection method and detection system for video image text CN103336954A (en) * 2013-07-08 2013-10-02 åäº¬æ·æä¸çºªç§æè¡ä»½æéå ¬å¸ Identification method and device of station caption in video CN103503463A (en) * 2011-11-23 2014-01-08 åä¸ºææ¯æéå ¬å¸ Video advertisement broadcasting method, device and system CN103544467A (en) * 2013-04-23 2014-01-29 Tcléå¢è¡ä»½æéå ¬å¸ Method and device for detecting and recognizing station captions CN105183758A (en) * 2015-07-22 2015-12-23 æ·±å³å¸ä¸å§å®ç¥ ç½ç»ç§æè¡ä»½æéå ¬å¸ Content recognition method for continuously recorded video or image CN105516802A (en) * 2015-11-19 2016-04-20 䏿µ·äº¤éå¤§å¦ Multi-feature fusion video news abstract extraction method JP2016119552A (en) * 2014-12-19 2016-06-30 䏿é»åæ ªå¼ä¼ç¤¾ï¼³ï½ï½ï½ï½ï½ï½ ï¼¥ï½ï½ ï½ï½ï½ï½ï½ï½ï½ï½ ï¼£ï½ï¼ï¼ï¼¬ï½ï½ï¼ Video contents processing device, video contents processing method and program CN106557768A (en) * 2016-11-25 2017-04-05 å京å°ç±³ç§»å¨è½¯ä»¶æéå ¬å¸ The method and device is identified by word in picture CN108182420A (en) * 2018-01-24 2018-06-19 å京ä¸ç§ç«ç¼ç§ææéå ¬å¸ A kind of advertisement localization method based on the detection of advertisement printed words US10007863B1 (en) * 2015-06-05 2018-06-26 Gracenote, Inc. Logo recognition in images and videos Family Cites Families (10) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title US20040125877A1 (en) * 2000-07-17 2004-07-01 Shin-Fu Chang Method and system for indexing and content-based adaptive streaming of digital video content US8769584B2 (en) * 2009-05-29 2014-07-01 TVI Interactive Systems, Inc. Methods for displaying contextually targeted content on a connected television EP3043570B1 (en) * 2013-09-04 2018-10-24 Panasonic Intellectual Property Management Co., Ltd. Video reception device, video recognition method, and additional information display system WO2018033156A1 (en) * 2016-08-19 2018-02-22 å京å¸åæ±¤ç§æå¼åæéå ¬å¸ Video image processing method, device, and electronic apparatus CN107564034A (en) * 2017-07-27 2018-01-09 ååçå·¥å¤§å¦ The pedestrian detection and tracking of multiple target in a kind of monitor video CN108256493A (en) * 2018-01-26 2018-07-06 ä¸å½çµåç§æéå¢å ¬å¸ç¬¬ä¸åå «ç ç©¶æ A kind of traffic scene character identification system and recognition methods based on Vehicular video CN108460106A (en) * 2018-02-06 2018-08-28 å京å¥èç§ææéå ¬å¸ A kind of method and apparatus of identification advertisement video CN108229442B (en) * 2018-02-07 2022-03-11 西åç§æå¤§å¦ Fast and stable face detection method in image sequence based on MS-KCF CN108399161A (en) * 2018-03-06 2018-08-14 å¹³å®ç§æï¼æ·±å³ï¼æéå ¬å¸ Advertising pictures identification method, electronic device and readable storage medium storing program for executing CN108446621A (en) * 2018-03-14 2018-08-24 å¹³å®ç§æï¼æ·±å³ï¼æéå ¬å¸ Bank slip recognition method, server and computer readable storage mediumRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4