Skip to content

Latest commit

 

History

History
executable file
·
128 lines (94 loc) · 5.35 KB

README.md

File metadata and controls

executable file
·
128 lines (94 loc) · 5.35 KB

user-level-audio-auditor (Transcriptions-Only)

Paper: The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

Published: PoPETS 2019

Table of Contents


Methodology

Transcription-only black-box access to ASR model:

  • Input: audio & its true transcription
  • Output: its predicted transcription

User-level Membership Inference Attack:

  • Querying with a user’s data, if this user has any data within target model’s training set, even if the query data are not members of the training set, this user is the user-level member of this training set.

Fig. 2 depicts a workflow of our audio auditor auditing an ASR model. Generally, there are two processes, i.e., training and auditing. The former process is to build a binary classifier as a user-level membership auditor Aaudit using a supervised learning algorithm. The latter uses this auditor to audit an ASR model Ftar by querying a few audios spoken by one user u. In Section 4.4, we show that only a small number of audios per user can determine whether u ∈ Utar or u /∈ Utar. Furthermore, a small number of users used to train the auditor is sufficient to provide a satisfying result.

methodology


Data Prepare

Each matrix obtains 4 columns {id, transcript_txt, frame_length, txt, txt_length}.

For record id=777-126732-0046, under the folder ./testing_set_for_auditor, extract elements from /decode_dev_clean_2out_dnn2/decode.1.log and dev-clean-2-true-txt.txt

{777-126732-0046, "IN ANY CASE HE HAD NOT THE TIME", 223, "IN ANY CASE HE HAD NOT THE TIME", 31}

  1. log to txt.

decodelog2txt.sh:

  • input: dataset = $1 = testing_set_for_auditor/decode_test_clean_2_user_out_dnn2; label = $2 = nonmember_test_clean_2_user
  • output: txt_f = testing_set_for_auditor/nonmember_test_clean_2_user.txt
  • Reduce irrelevant information from the raw transcription results.
$ ./decodelog2txt.sh testing_set_for_auditor/decode_train_clean_100_user_out_dnn2 member_train_clean_100_user 
>> out_log/decodelog2txt_member_train_clean_100_user.txt 2>&1 && echo 's' || echo 'e'
$ cp out_log/decodelog2txt_member_train_clean_100_user.txt testing_set_for_auditor/
  1. txt to csv.

txt2csv.py:

  • input: txt_in = "testing_set_for_auditor/nonmember_test_clean_2_user.txt"; true_in = "testing_set_for_auditor/test-clean-2-user-true-txt.txt"
  • output: csv_file = "data/nonmember_test-clean-2-user.csv"
  • Convert txt_in (transcription results) to a matrix focusing on sentence id.
  • Extract 4 features (trans_txt, frame_length, true_txt, true_txt_length) for each sentence id.
  • Save as .csv focusing on sentence id. header = ['id', 'predicted_txt', 'true_txt', 'true_txt_length', 'frame_length'].
$ python ./txt2csv.py 

Data Preprocess (feature extraction)

Transfer sentence-id-record ('id') data to user-id-record ('user') data. Mainly process 2 string-type features --- 'predicted_txt' and 'true_txt' --- into int-type features as similarity score. The other 2 int-type features including previous processed 2 int-type features are analyzed statistically.

  1. Word2Vec Model Training

word_embedding.py:

  • input: predicted_path = testing_set_for_auditor//.log; true_label_path = True_transcripts/*.txt
  • output: w2vModel = word2vec_libri.model
  • Train a Word2Vec with 2 kinds of Vocabularies (logs and true_txt files) --> save as .model
  • Update the pretrained model (word2vec_*.model) with another total_samples.
$ python ./word_embedding.py
  1. Word2Vec Model Update

New log files found:

  • Repeat ## Data Preprocess 1. Word2Vec Model Training
  • Repeat
$ python ./word_embedding.py
  1. Similarity Score Between 'predicted_txt' and 'true_txt'

feature_sentence.py:

  • input: csv = data/member_dev-clean-2.csv
  • output: feats_file = data/member_feats3_dev-clean-2.csv
  • Load pretrained Word2Vec model (word2vec_*.model)
  • Initial the list for processing original 4 feats into 3 feats except 'id' Convert the 2nd and 3rd columns (string-type features) into word vectors --- 1 word 1 vector and 1 similarity score. Specifically, ['id', 'predicted_text', 'true_text', 'true_text_length', 'frame_length'] ==> ['id', 'similarity', 'frame_length', 'speed']
  • Save initial features as .csv focusing on sentence id
$ python ./feature_sentence.py
  1. Similarity Statistic for Each User

feature_speaker.py:

  • input: feats_csv = data/member_feats3_dev-clean-2.csv
  • output: feats_user_file = data/member_feats3_user_dev-clean-2.csv
  • Statistically analyze the list for processing 3 feats towards each user where 'id' = user#-chapter#-sentence# Specifically, ['id', 'predicted_text', 'true_text', 'true_text_length', 'frame_length'] ==> ['user', 'similarity_statistics', 'frame_length_statistics', 'speed_statistics']
  • Save processed features as .csv focusing on user(speaker) id
$ python ./feature_speaker.py

User-level Audio Auditor Model

$ python ./audit_speaker.py