- set logging in bash, one have to build the directory fist.
- check out the dropout using in fc7
- consider add dropout in embedding layer
- The total number of raw image data: 108249
- The total number of image descriptions: 108078. With a json file corresponding to a image which has the same name (i.e. the id number)
- Dataset typo:
bckground
{"region_id": 1936, "width": 40, "height": 38, "image_id": 1,
"phrase": "bicycles are seen in the bckground", "y": 320, "x": 318,
"phrase_tokens": ["bicycles", "are", "seen", "in", "the", "bckground"]}
- Read a gt_region json take
1 ms
. x_gt_region/id.json
has the format:
{
"regions":[ ...
{"region_id": 1382, "width": 82, "height": 139, "image_id": 1,
"phrase": "the clock is green in colour", "y": 57, "x": 421,
"phrase_tokens": ["the", "clock", "is", "green", "in", "colour"]},
...
],
"path": "/home/joe/git/VG_raw_data/img_test/1.jpg",
"width": 800,
"id": 1,
"height": 600
}
-
**Add "gt_phrases" to every roidb, both with LIMIT_RAM version or UNLIMIT_RAM version.
-
SAVE LIMIT_RAM VISION AS PKL FILE.
-
LIMIT_RAM example: 1.pkl
{
'gt_classes': array([1382, 1383, ..., 4090, 4091], dtype=int32),
'flipped': False,
'gt_phrases': [[4, 33, 6, 25, 20, 144], [167, 6, 30, 4, 11], [7, 6, 21, 72],...],
'boxes': array([[421, 57, 503, 196],
[194, 372, 376, 481],
[241, 491, 302, 521],
...], dtype=uint16),
'seg_areas': array([ 11620., 20130., 1922., ...], dtype=float32),
'gt_overlaps': <262x2 sparse matrix of type '<type 'numpy.float32'>'
with 262 stored elements in Compressed Sparse Row format>}
}
.update
{
'width': 800,
'max_classes': array([1, 1, 1, ...]),
'image': u'/home/joe/git/VG_raw_data/img_test/1.jpg',
'max_overlaps': array([ 1., 1., 1., 1., 1., ...]),
'height': 600,
'image_id': 1
}
**NOTE: gt_phrases
add 1 before saving it. **
# increment the stream -- 0 will be the EOS character
stream = [s + 1 for s in stream]
- LIMIT_RAM example: 1_flip.pkl
{
'gt_classes': array([1382, 1383, ..., 4090, 4091], dtype=int32),
'flipped': True,
'gt_phrases': [[4, 33, 6, 25, 20, 144], [167, 6, 30, 4, 11], [7, 6, 21, 72],...],
'boxes': array([[296, 57, 378, 196],
[423, 372, 605, 481],
[497, 491, 558, 521],
...], dtype=uint16),
'seg_areas': array([ 11620., 20130., 1922., ...], dtype=float32),
'gt_overlaps': <262x2 sparse matrix of type '<type 'numpy.float32'>'
with 262 stored elements in Compressed Sparse Row format>}
}
.update
{
'width': 800,
'max_classes': array([1, 1, 1, ...]),
'image': u'/home/joe/git/VG_raw_data/img_test/1.jpg',
'max_overlaps': array([ 1., 1., 1., 1., 1., ...]),
'height': 600,
'image_id': '1_flip'
}
- UNLIMIT_RAM example: pre_gt_roidb.pkl
{
'gt_classes': array([1382, 1383, ..., 4090, 4091], dtype=int32),
'flipped': False,
'boxes': array([[421, 57, 503, 196],
[194, 372, 376, 481],
[241, 491, 302, 521],
...], dtype=uint16),
'seg_areas': array([ 11620., 20130., 1922., ...], dtype=float32),
'gt_phrases': [[4, 33, 6, 25, 20, 144], [167, 6, 30, 4, 11], [7, 6, 21, 72],...],
'gt_overlaps': <262x2 sparse matrix of type '<type 'numpy.float32'>'
with 262 stored elements in Compressed Sparse Row format>}
}
DO NOT HAVE ALL_PHRASES
- UNLIMIT_RAM exampl:
{1536: [3, 10, 20, 8, 6, 2, 9], 3584: [36, 38, 29, 17, 2, 37], ...}
- Return
roidb
is a path to the saved pkls. - TRAIN.USE_FLIPPED = True NEEDS TO:
- add USE_FLIPPED to be True? added on 10.18.17
rdl_roidb.prepare_roidb
method to process data.- filter out the invalid roi before
-
add self.image_index to visual_genome class for filterd indexes. Update: change to self._image_index.
-
finish roidatalayer, Read image in BGR order. Example of
data.forward()
with 1.jpg- gt_phrases shape: num_regions x 10(max_words)
{
'gt_boxes': array([[ 2.66399994e+02, 5.12999992e+01, 3.40200012e+02,
1.76399994e+02, 1.38200000e+03],
[ 3.80700012e+02, 3.34799988e+02, 5.44500000e+02,
4.32899994e+02, 1.38300000e+03], ...], dtype=float32),
'data': array([[[[-14.9183712 , -28.93106651, -9.59885979],
[-18.94306374, -32.79834747, -13.62355137],
[-11.24244404, -24.31069756, -5.66058874],...]]], dtype=float32),
'im_info': array([[ 540. , 720. , 0.89999998]], dtype=float32),
'gt_phrases': array([[ 4, 33, 6, ..., 0, 0, 0],
[167, 6, 30, ..., 0, 0, 0],
[ 7, 6, 21, ..., 0, 0, 0],...]
}
Output of first 3 regions of 1.jpg:
# length of labels, i.e. number of regions: 262
# sentence data layer input (first 3)
1382.0 [ 4 33 6 25 20 144 0 0 0 0]
1383.0 [167 6 30 4 11 0 0 0 0 0]
1384.0 [ 7 6 21 72 0 0 0 0 0 0]
# sentence data layer output (first 3)
# input sentence
[[ 1. 4. 33. 6. 25. 20. 144. 0. 0. 0. 0.]
[ 1. 167. 6. 30. 4. 11. 0. 0. 0. 0. 0.]
[ 1. 7. 6. 21. 72. 0. 0. 0. 0. 0. 0.]]
target sentence
[[ 1. 4. 33. 6. 25. 20. 144. 2. 0. 0. 0. 0.]
[ 1. 167. 6. 30. 4. 11. 2. 0. 0. 0. 0. 0.]
[ 1. 7. 6. 21. 72. 2. 0. 0. 0. 0. 0. 0.]]
# cont sentence
[[ 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
[ 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]]
# cont bbox
[[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]
- index
0
will be the<pad>
character - index
1
will be the<SOS>
character - index
2
will be the<EOS>
character
image: (1, 540, 720, 3)
head: (1, 34, 45, 1024)
rpn: (1, 34 ,45, 512)
rpn_cls_score: (1, 34, 45, 24) #2 x 3 x 4
rpn_cls_score_reshape: (1, 34x12, 45, 2)
rpn_cls_prob_reshape: (1, 34x12, 45, 2)
rpn_cls_prob: (1, 34, 45, 24) #2 x 3 x 4
rpn_bbox_pred: (1, 34, 45, 48)
anchors ==> (18360, 4)
proposal layer
proposal_rois: (9, 5) #due to NMS, it's a heavy reduce to proposals.
proposal_rpn_scores: (9, 1)
anchor target layer
rpn_labels: (1, 1, 408, 45)
rpn_bbox_targets: (1, 34, 45, 48)
rpn_bbox_inside_weights: (1, 34, 45, 48)
rpn_bbox_outside_weights: (1, 34, 45, 48)
proposal_targets_single_class_layer
make sure a fixed number of regions are sampled.
rois: (256, 5)
labels ==> (256,)
clss ==> (256,)
phrases ==> (256, 10)
bbox_targets ==> (256, 4)
bbox_inside_weights ==> (256, 4)
bbox_outside_weights ==> (256, 4)
RPN
pool5 ==> (256, 7, 7, 1024)
fc7 ==> (256, 2048)
name: fc7_before_pool ==> (256, 7, 7, 2048)
cls_prob ==> (256, 2)
sentence data layer
input_sentence ==> (256, 11)
target_sentence ==> (256, 12)
cont_bbox ==> (256, 12)
cont_sentence ==> (256, 12)
embed_caption_layer
name: embedding ==> (10003, 512)
name: embed_input_sentence ==> (256, 11, 512)
name: fc8 ==> (256, 512)
name: im_context ==> (256, 1, 512)
name: im_concat_words ==> (256, 12, 512)
name: captoin_outputs ==> (256, 12, 512)
name: loc_outputs ==> (256, 12, 512)
name: bbox_pred ==> (256, 4)
name: predict_caption ==> (256, 12, 10003)