Updates

Jul 2026

Paper Published in ACL 2026

Our paper "Comparative Analysis of the Intrinsic Metrics for Tokenizers and their effect on Downstream Tasks for Hindi and Marathi" has been published in ACL Main 2026.

Jun 2026

Paper Accepted to DALL (ICDAR 2026 Workshop)

Our paper "Synthetic Line Image Generation from Cropped Grapheme Images for Handwritten Marathi Text Recognition" has been accepted to ICDAR 2026 Workshop on Documents Analysis of Low-resource Languages.

Jan 2026

HTR of Grantha Manuscripts

I'm exploring different OCR methods like Vision Transformers, CRNNs, and other techniques to improve accuracy. Compiling some insights about the variations in the script here.

Hello! I'm Shagun Dwivedi

I'm an AI Researcher, with a passion for data, writing systems, languages, and accuracy. I currently work at the Centre for Interdisciplinary AI, FLAME University, Pune, India. I have a BS in Data Science and Applications from IIT Madras. I have a BA (Hons) in Ancient Indian History and Archaeology from University of Lucknow.

I'm interested in natural language processing, use of language models in machine perception, and evaluation benchmarks/metrics for LLMs.

Education

2025

BS in Data Science and Applications, IIT Madras

CGPA : 9.04

2023

BA (Hons.) in Ancient Indian History and Archaeology, University of Lucknow

CGPA: 8.02

2020

Class XII (PCM with CS) Jagran Public School, Lucknow

Percentage: 94%

2018

Class X, Jagran Public School, Lucknow

Percentage: 91.4%

Check out some of my projects

Analysis of different Tokenizers and their effect on Downstream tasks for Hindi and Marathi

Evaluation of various subword tokenization methods and their impact on language modeling performance for Hindi and Marathi.

HuggingFace Transformers LLM Evaluation Question Answering Grapheme to Phoneme + MORE

Proposing a method of tokenization, which could help reduce the issue of sub-optimal tokenization with over-segmentation of text in Indic languages and its effect on a language modeling (using T5).

MANUSCRIPT IN REVIEW

Converting Custom-Embedded Subsetted Non-Unicode Fonts to Searchable Formats

Extracting Indic text from non-Unicode standard PDFs with embedded fonts to a searchable and Unicode-compliant format.

Computational Linguistics Optical Character Recognition Low Resource Scripts + MORE

Expanding on previously published case study in Gujarati by including Hindi, Marathi, and Bangla documents.

Handwritten Text Recognition for Pre-Colonial Sanskrit manuscripts

Digitizing manuscripts on Vedanta and Mimansa to make them searchable.

Optical Character Recognition Finetuning Evaluation Metrics + MORE

Finetuned a Resnet-BiLSTM-CTC model for handwritten manuscripts which outperforms out-of-the-box HTR models on grapheme cluster error rate and character error rate.

GO TO PROJECT

AI Language Practice Applications for French and German

Platform for beginner French 101 and German 101 courses' practice customised to university course syllabi.

Machine Learning Python Scikit-Learn + MORE

Designed multi-agent system supported by various custom prompt-engineered OpenAI bots for the dialog framework. Presented at AIET 2025.

GO TO PROJECT

Customer Behavior Prediction

Predicting user behaviour based on the restaurant's offers.

Machine Learning Python Scikit-Learn + MORE

Performed preprocessing, trained and validated and evaluated multiple supervised classification models and decided on the best prediction model.

GO TO PROJECT

Indian Weather API

API for India Meteorological Department Weather Data

RESTful API Web Scraping Flask-Restful + MORE

Access brief or detailed weather report based on station's name. Get forecast for the next seven days.

GO TO PROJECT

AirIndia Case Study

Analysis and formulation of growth strategies for AirIndia post acquisition by TATA Group

Market Research Analysis Strategy Development

Case study ranked first under Winter Consulting'22, IIT Guwahati. Conducted extensive market research, formulated strategies for airline's improvement based on the growth strategy framework.

GO TO PROJECT

TrackOn App

Flask based web application for tracking habits, activities, and other life parameters.

Flask SQLite VueJs Redis Celery + MORE

TrackOn is a web application used for tracking habits, activities, other life parameters. Users can register & login to create multiple trackers with multiple logs. They can review their progress over time with graphs trend lines.

GO TO PROJECT

Publications

A Case Study of Handwritten Text Recognition from Pre-Colonial era Sanskrit Manuscripts

Kartik Chincholikar, Shagun Dwivedi, Kaushik Gopalan, Tarinee Awasthi

World Sanskrit Conference, 2025

This paper presents a comprehensive study on applying handwritten text recognition techniques to pre-colonial era Sanskrit manuscripts. We discuss the challenges unique to these historical documents and our approach to achieving high accuracy in recognition.

PDF

A Semi-Automatic Text Recognition Tool for Pre-Colonial Handwritten Manuscripts in Devanāgari Script

Bharath Valaboju, Shagun Dwivedi, Kartik Chincholikar, Kaushik Gopalan, Vinod Vidwans

International Conference on Human-Computer Interaction, 2025

This poster presents an annotation tool which allows the user to extract text from undigitized manuscripts using OCR, following which users can make corrections to the OCR-detected text.

DOI

Converting Gujarati Text in Custom-Embedded Subsetted Non-unicode Fonts to Searchable Formats: A Case Study Using Jain Religious Texts

Rishav Jain, Shagun Dwivedi, Kaushik Gopalan

International Conference on Computer Vision and Image Processing, 2024

This study presents a novel approach for extracting Gujarati text from non-Unicode standard PDFs with embedded fonts, addressing the challenges posed by these legacy fonts and providing a pathway to convert and preserve such documents in a searchable, Unicode-compliant format.

DOI

Updates

Hello! I'm Shagun Dwivedi

Education

Check out some of my projects

Analysis of different Tokenizers and their effect on Downstream tasks for Hindi and Marathi

Converting Custom-Embedded Subsetted Non-Unicode Fonts to Searchable Formats

Handwritten Text Recognition for Pre-Colonial Sanskrit manuscripts

AI Language Practice Applications for French and German

Customer Behavior Prediction

Indian Weather API

AirIndia Case Study

TrackOn App

Publications

A Case Study of Handwritten Text Recognition from Pre-Colonial era Sanskrit Manuscripts

A Semi-Automatic Text Recognition Tool for Pre-Colonial Handwritten Manuscripts in Devanāgari Script

Converting Gujarati Text in Custom-Embedded Subsetted Non-unicode Fonts to Searchable Formats: A Case Study Using Jain Religious Texts

Contact Me: