Skip to content

SaadARazzaq/spectrogram-rnnrbm-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NeuroVocal RNN-RBM: A Hybrid Deep Architecture for Temporal Sequence Modeling of Audio Spectrograms

Abstract

This work presents a novel hybrid deep learning architecture that integrates Recurrent Neural Networks (RNNs) with Restricted Boltzmann Machines (RBMs) for unsupervised temporal modeling of audio spectrograms. The proposed RNN-RBM framework enables robust feature learning, temporal dependency modeling, and generative synthesis of vocalization patterns. By conditioning RBM parameters on temporal context through multi-layer gated RNNs, the model captures both short-term acoustic features and long-term structural patterns in audio data.

Table of Contents

  1. Architecture Overview
  2. Mathematical Foundations
  3. Implementation Details
  4. Experimental Setup
  5. Results & Analysis
  6. Applications
  7. Installation & Usage
  8. Citation
  9. Future Work

Architecture Overview

Hybrid RNN-RBM Framework

The core innovation lies in the tight coupling of temporal modeling (RNN) and generative modeling (RBM) components:

Input Spectrograms 
    → [RNN Temporal Encoder] 
    → Dynamic RBM Parameters 
    → [Conditional RBM Decoder]
    → Generated Sequences

Component Specifications

  • Spectrogram Input: (time_steps × freq_bins) where freq_bins = 129 (from 256-point FFT)
  • RBM Hidden Layers: Scalable architecture with n_hidden = 1.7 × freq_bins
  • RNN Context Encoding: Multi-layer gated units with n_recurrent = 1.3 × freq_bins
  • Parameter Conditioning: Real-time adaptation of RBM biases based on temporal context

Mathematical Foundations

1. Conditional RBM Formulation

image

2. Multi-Layer Gated RNN Architecture

The temporal context u_t is computed through a three-layer hierarchical RNN:

Layer 1: Input Processing

def build_rnn_layer_1(v_t, u1_tm1, params):
    W_in_update, W_hidden_update, b_update, W_in_reset, W_hidden_reset, b_reset, W_in_hidden, W_reset_hidden, b_hidden = params
    
    update_gate = tanh(dot(v_t, W_in_update) + dot(u1_tm1, W_hidden_update) + b_update)
    reset_gate = tanh(dot(v_t, W_in_reset) + dot(u1_tm1, W_hidden_reset) + b_reset)
    u1_t_temp = tanh(dot(v_t, W_in_hidden) + dot(u1_tm1 * reset_gate, W_reset_hidden) + b_hidden)
    u1_t = (1 - update_gate) * u1_t_temp + update_gate * u1_tm1
    return u1_t

Layers 2 & 3: Context Refinement

Similar gated mechanisms process the output of previous layers, enabling multi-scale temporal representation learning.

3. Training Objective

image

Implementation Details

Spectrogram Preprocessing Pipeline (parser.py)

Adaptive Vocalization Detection

def parse_segments(data, rate, threshold=90, buffer=5, min_length=15, max_length=500):
    # Multi-stage processing:
    # 1. Signal conditioning with linear smoothing
    rectified = np.abs(data)
    smoothed = linear_smooth(rectified, window_length)
    
    # 2. Percentile-based thresholding
    threshold_value = np.percentile(smoothed, threshold)
    indices = smoothed >= threshold_value
    
    # 3. Morphological operations for segment cleaning
    bounded = np.hstack(([0], indices, [0]))
    diffs = np.diff(bounded)
    run_starts = np.where(diffs > 0)[0]
    run_ends = np.where(diffs < 0)[0]
    
    # 4. Segment validation and spectrogram computation
    for start, end in valid_segments:
        f, t, spec = spectrogram(data[start-buffer:end+buffer], rate, 
                                noverlap=128, nperseg=256)
        yield f, t, spec

Key Parameters:

  • Window Length: 4ms converted to samples
  • Minimum Spacing: 2ms between segments
  • Spectrogram: 256-point FFT, 128 overlap
  • Frequency Range: 0 - Nyquist (rate/2 Hz)

Advanced Training Techniques (rbm.py)

1. Robust Optimization

# Gradient conditioning with NaN/Inf protection
not_finite = T.or_(T.isnan(gradient), T.isinf(gradient))
gradient = T.switch(not_finite, 0.1 * param, gradient)

# RMSProp with adaptive learning
accu_new = 0.9 * accu + 0.1 * gradient ** 2
param_update = lr * gradient / T.sqrt(accu_new + 1e-6)

2. Multi-phase Learning Schedule

The training implements a curriculum learning strategy:

  • Phase 1: lr = 3e-4 - Rapid feature acquisition
  • Phase 2: lr = 1e-4 - Refinement learning
  • Phase 3: lr = 5e-5 to 1e-5 - Fine-tuning

3. Regularization Strategy

  • L1 Regularization: λ₁ = 1e-4 for feature selection
  • L2 Regularization: λ₂ = None (configurable)
  • Dropout: Implicit through stochastic hidden units

Experimental Setup

Dataset Specifications

  • Input Format: Raw WAV files with variable sampling rates
  • Preprocessing: Automatic segmentation, normalization, spectrogram computation
  • Training/Validation: Temporal cross-validation within sequences

Model Configuration

model_config = {
    'n_visible': 129,           # Fixed by spectrogram resolution
    'n_hidden': 219,            # 1.7 × n_visible
    'n_hidden_recurrent': 167,  # 1.3 × n_visible
    'learning_rates': [3e-4, 1e-4, 5e-5, 3e-5, 1e-5],
    'batch_size': 20,
    'gibbs_steps': {
        'training': 15,
        'generation': 20
    }
}

Evaluation Metrics

  1. Training Convergence: Negative log-likelihood bounds
  2. Generation Quality: Visual inspection of spectrogram coherence
  3. Temporal Consistency: Long-range dependency modeling
  4. Feature Learning: Hidden unit activation patterns

Results & Analysis

Training Behavior

  • Stable Convergence: Protected gradients prevent training divergence
  • Multi-timescale Learning: RNN captures both frame-level and sequence-level patterns
  • Regularization Efficacy: L1 norm promotes sparse, interpretable features

Generation Capabilities

  • Temporal Coherence: Generated sequences maintain structural consistency beyond training length
  • Multi-scale Patterns: Captures both fine-grained spectral features and broader temporal contours
  • Mode Coverage: Diverse sampling through temperature-based activation

Applications

1. Bioacoustic Research

  • Animal Vocalization Analysis: Unsupervised discovery of call types and sequences
  • Species Identification: Learning distinctive acoustic signatures
  • Behavioral Studies: Temporal pattern analysis in communication

2. Speech Technology

  • Unsupervised Phoneme Learning: Discovering speech units from raw audio
  • Prosody Modeling: Capturing rhythm and intonation patterns
  • Pathological Speech Analysis: Identifying atypical temporal patterns

3. Audio Synthesis

  • Generative Sound Design: Creating novel audio textures and sequences
  • Music Information Retrieval: Learning musical structure and style
  • Voice Conversion: Modeling speaker characteristics

4. Neuroscience Applications

  • Neural Coding: Modeling temporal dependencies in neural recordings
  • Sensory Processing: Understanding hierarchical feature extraction
  • Motor Sequence Generation: Modeling complex temporal behaviors

Usage

Customization Guide

# For custom datasets, modify:
config = {
    'threshold': 85,           # Detection sensitivity
    'min_length': 10,          # Minimum segment length (ms)
    'max_length': 1000,        # Maximum segment length (ms)
    'hidden_scalar': 2.0,      # RBM hidden unit scaling
    'recurrent_scalar': 1.5    # RNN hidden unit scaling
}

Future Work

Architectural Extensions

  1. Attention Mechanisms: Content-based temporal focusing
  2. Hierarchical RNNs: Multi-resolution temporal processing
  3. Variational Extensions: Explicit latent variable modeling

Algorithmic Improvements

  1. Advanced Sampling: Parallel tempering for better mixing
  2. Structured Regularization: Temporal smoothness constraints
  3. Multi-modal Learning: Joint audio-text representation learning

Applications Development

  1. Real-time Synthesis: Streaming audio generation
  2. Transfer Learning: Pre-trained models for new domains
  3. Interpretability Tools: Visualization of learned features


✨ Author

Saad Abdur Razzaq
Machine Learning Engineer | Effixly AI

LinkedIn Email Website GitHub



About

NeuroVocal RNN-RBM: A Hybrid Deep Architecture for Spectrogram Modeling and Vocalization Synthesis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages