This repository accompanies the paper Does Bigger Mean Better Everywhere? Cross-Domain Sentiment Analysis with TF-IDF and DistilBERT.
It compares a classical TF-IDF + Logistic Regression pipeline against fine-tuned DistilBERT for binary sentiment classification. Both models are trained on Sentiment140 tweets and evaluated zero-shot on IMDB movie reviews.
Main result: DistilBERT has a clear in-domain advantage on Twitter (7.3 accuracy points, p < 0.001), but that gap disappears entirely under domain shift. Cross-domain, the two models reach statistical parity (χ² = 0.042, p = 0.838), and the cheaper classical model is the more defensible choice when retraining is not feasible.
A model with a 7-point in-domain accuracy advantage can lose it entirely under domain shift—even against a far simpler baseline. We train TF-IDF + Logistic Regression (TF-IDF+LR) and fine-tuned DistilBERT on Sentiment140 and evaluate both zero-shot on IMDB movie reviews. In-domain, DistilBERT leads by 7.3 accuracy points (85.0% vs. 77.7%, p < 0.001, McNemar's test). Cross-domain, DistilBERT degrades 2.3× faster (12.6 vs. 5.4 points) and the two models reach statistical parity (χ² = 0.042, p = 0.838). The degradation is precision-dominated (DistilBERT precision drop −15.6 vs. recall drop −4.3 points; 3.6× asymmetry), consistent with over-triggering on local positive-affect tokens that do not carry sentiment in long-form text. We recommend the precision-to-recall degradation ratio as a lightweight cross-domain diagnostic, and conclude that in-domain accuracy alone is an insufficient basis for model selection under distribution shift.
Dataset sizes: 1.6M Sentiment140 tweets · 160K used for DistilBERT fine-tuning · 1.28M for TF-IDF+LR · 25K IMDB examples for zero-shot evaluation.
Training data asymmetry: TF-IDF+LR was trained on 1.28M tweets; DistilBERT on 160K due to GPU constraints. This is a known confound — each pipeline is treated as representative of its practical deployment scenario. The cross-domain evaluation is unaffected since both models are evaluated zero-shot on the same 25K IMDB examples.
| Model | Domain | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| TF-IDF + LR | Twitter (in-domain) | 0.777 | 0.765 | 0.798 | 0.781 |
| TF-IDF + LR | IMDB (cross-domain) | 0.723 | 0.701 | 0.777 | 0.737 |
| DistilBERT | Twitter (in-domain) | 0.850 | 0.846 | 0.853 | 0.850 |
| DistilBERT | IMDB (cross-domain) | 0.723 | 0.690 | 0.810 | 0.746 |
In-domain McNemar: χ² = 1113.5, p < 0.001. Cross-domain: χ² = 0.042, p = 0.838.
| Metric | TF-IDF + LR | DistilBERT |
|---|---|---|
| Accuracy | −0.054 | −0.126 |
| Precision | −0.064 | −0.156 |
| Recall | −0.021 | −0.043 |
| F1 | −0.044 | −0.104 |
DistilBERT degrades 2.3× faster on accuracy and 2.4× on precision, yet both models reach statistical parity on IMDB.
- DistilBERT leads TF-IDF+LR by 7.3 accuracy points in-domain (p < 0.001).
- Cross-domain, TF-IDF+LR drops 5.4 points; DistilBERT drops 12.6 points (2.3× faster).
- The two models reach statistical parity on IMDB (χ² = 0.042, p = 0.838) — the 7.3-point advantage is entirely erased.
- DistilBERT's degradation is precision-dominated (−15.6 pt precision vs. −4.3 pt recall; 3.6× asymmetry), consistent with over-triggering on local positive-affect tokens in long-form reviews.
- In-domain accuracy alone is insufficient for model selection under distribution shift.
On Twitter, positive sentiment is local — a tweet with "love" or "amazing" is almost always positive. DistilBERT learns strong associations between positive-affect tokens and the positive class. On IMDB, the same tokens appear in negative reviews ("I wanted to love this film, but..."). Sentiment is determined by a narrative arc spanning hundreds of tokens — a structure a model trained on 12-token tweets is not equipped to capture.
TF-IDF+LR is less susceptible because term frequency is diluted by document length: "love" in a 300-word review contributes far less weight than in a 10-word tweet, producing more symmetric degradation.
| TF-IDF + LR | DistilBERT | |
|---|---|---|
| Preprocessing | Non-alphabetic removal, lowercasing, stopword removal, Porter stemming | Raw tweet text, WordPiece tokenization |
| Training data | 1,280,000 tweets (80/20 split, seed 42) | 160,000 tweets (GPU constraint) |
| Model | Logistic Regression (C=1.0, L2) | distilbert-base-uncased, lr=2×10⁻⁵, 3 epochs, batch 32, FP16 |
| In-domain test | 320,000 examples | 40,000 examples |
| Cross-domain test | 25,000 IMDB examples (zero-shot) | 25,000 IMDB examples (zero-shot) |
McNemar's test uses the continuity-corrected form. In-domain McNemar is restricted to the shared 40K subset for fair pairing.
sentiment_analysis/
├── notebooks/
│ └── sentiment_analysis.ipynb
├── requirements.txt
├── CITING.bib
├── Contributing.md
└── LICENSE
- Open the notebook in Google Colab.
- Upload
notebooks/sentiment_analysis.ipynb. - Switch the runtime to a T4 GPU.
- Run the notebook top to bottom.
- Create a Python 3.10 environment.
- Install PyTorch for your CUDA version.
- Install remaining dependencies:
pip install -r requirements.txt - Open
notebooks/sentiment_analysis.ipynbin Jupyter and run it.
DistilBERT training on CPU is very slow. If you only need the TF-IDF pipeline, run the preprocessing and classical modeling sections of the notebook only.
- Sentiment140 — 1.6M tweets with binary labels derived from emoticons (Go et al., 2009).
- IMDB Movie Reviews — 50K human-annotated reviews; fixed, deterministic 25K balanced subset of the standard test split used for zero-shot evaluation (Maas et al., 2011).
- Ben-David, S. et al. (2010). A theory of learning from different domains. Machine Learning 79(1).
- Blitzer, J. et al. (2007). Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. ACL 2007.
- Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
- Dror, R. et al. (2018). The hitchhiker's guide to testing statistical significance in NLP. ACL 2018.
- Go, A., Bhayani, R., & Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. CS224N Stanford.
- Gururangan, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020.
- Liu, Y. et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
- Maas, A. et al. (2011). Learning Word Vectors for Sentiment Analysis. ACL 2011.
- Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2).
- Nguyen, D.Q. et al. (2020). BERTweet: A pre-trained language model for English tweets. EMNLP 2020.
- Pan, S.J. & Yang, Q. (2010). A survey on transfer learning. IEEE TKDE 22(10).
- Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3).
- Ramponi, A. & Plank, B. (2020). Neural unsupervised domain adaptation in NLP — A survey. COLING 2020.
- Ribeiro, M.T. et al. (2016). "Why should I trust you?": Explaining the predictions of any classifier. KDD 2016.
- Sanh, V. et al. (2019). DistilBERT, a distilled version of BERT. EMC2 @ NeurIPS 2019.
- Sundararajan, M. et al. (2017). Axiomatic attribution for deep networks. ICML 2017.
@misc{anonymous2026crossdomain,
title = {Does Bigger Mean Better Everywhere? Cross-Domain Sentiment Analysis with TF-IDF and DistilBERT},
year = {2026},
howpublished = {\url{https://anonymous.4open.science/r/sentiment_analysis-DBC8}},
note = {Cross-domain sentiment analysis with TF-IDF + Logistic Regression and fine-tuned DistilBERT.}
}A machine-readable copy is available in CITING.bib.
MIT License