A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/huyt16/Twitter100k below:

GitHub - huyt16/Twitter100k

Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval

Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang

This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. It is characterized by two aspects: 1) it has 100,000 image-text pairs randomly crawled from Twitter and thus has no constraint in the image categories; 2) text in Twitter100k is written in informal language by the users.

Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the Correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k and Twitter100k.

As a minor contribution, inspired by the characteristic of Twitter100k, we propose an OCR-based cross-media retrieval method. In experiment, we show that the proposed OCR-based method improves the baseline performance.

Detailed description is provided in our paper.

For subspace learning methods (CCA, PLS, BLM, GMMFA)
  1. download the data of the three benckmark datasets ( Twitter100k_feature (3.5G), Flickr30k_feature (5.5G), Wikipedia_feature (102M)) and put them into the folders feature/ or other folders convenient to you.
  2. modify the dataset name and the data path variables of the script file run_baseline.m in code/GMA-CVPR2012/.
  3. run the matlab script file run_baseline.m.
  4. run retrieve.py for a specific dataset and the results of the rank of ground truth will be saved in result/rank/.
  1. download the data of the three benckmark datasets ( Twitter100k (2.0G) , Flickr30k (1.9G) , Wikipedia (59M)) and put them into the folders feature/ or other folders convenient to you.
  2. run the python script file genNPYdata.py in code/deepnet-master/deepnet/examples/yutinghu/ to generate the input data for Corr-AE methods.
  3. install deepnet and its dependencies with patience following the instruction INSTALL.TXT in code/deepnet-master/.
  4. run runall_all.sh in code/deepnet-master/deepnet/examples/yutinghu/wikipedia/ or flickr30k/, twitter100k.
  5. run retrieve_corr_ae.py for a specific dataset and the results of the rank of ground truth will be saved in result/rank/.

You can download the results of CMC saved in MAT-file format for direct comparison.

The Twitter100k dataset (10G). Access Code:tifl


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4