Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

birlrobotics / action_anticipation Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Code
Issues 1
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

action_anticipation

/

logger.md

Latest commit

History

41 lines (34 loc) · 2.28 KB

Breadcrumbs

action_anticipation

/

logger.md

File metadata and controls

41 lines (34 loc) · 2.28 KB

Data processing

python -m dataset.video2frame './dataset/breakfast/original/' './dataset/breakfast/rgb_frame/' --nw=12

python -m dataset.annotation_gather './dataset/breakfast/original/' './dataset/breakfast/'

TODO

When processing the stereo video, it seems we just need to choose the ch0 video;
Add data augmentation;
Sample the clips based on the GT labels to guarantee the integrity of the video;
Implement the positional encoding in Transformer;
Which normalization we should use in Transformer;
Shall we need the ReLU activation function when calculating the (Q, K, V) in attention;
shall we need a Conv layer in I3D head instead of average pooling?

IDEAS

Transformer
对子动作segment的长度回归可能**采用范围值而非确定值（或学习一个offset）**会好点，因为子动作持续时间是浮动的；
多尺度未来特征生成，因为小物体在浅层特征才具有一定分辨率；
在transformer中的attention使用多尺度attention/local attention而不是全局attention；
一开始从I3D得到的特征是否需要通过MLP后作动作识别；
逆视频输入，加个可学习的正逆特征，像PE特征一样；
初始特征增强，加噪声
我们采用8frames通过I3D得到每clip的特征，但这些特征会不会由于时序太短而无法捕获到时序信息，更多是空间信息；

NOTE

For simplicity, we sample the training data in each pure action segment, while we construct the evaluation datas with continuous frames.

Log

lr = 0.0001时，训练会明显震荡，检查数据好像没啥问题，改成0.00001会好点；
所有层使用default初始化比xavier_uniform好；
input feature先试用L2norm归一化或L2norm，好像也没啥作用；
batchsize设小点好像效果更好；
没有Positional embedding效果好差;
I3D features没有经过FC效果会差点,加两层MLP会好点；
dropout in PE效果不好；
decoder中PE加offset而不是从0位置计起，效果没提升；
4层效果比2层好,6层与4层相当;
原始特征加noise会好点

CUDA_VISIBLE_DEVICES=1 python train.py --nw=4 --lr=0.00001 --bs=4 --e_v='L2_dp0.3_lr0.00001_bs64_dfinit_alldata_inputl2norm'

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.