Julia's Data — Brazilian Portuguese Medical Audio Datasets

$ cat README.md


High-quality audio datasets in Brazilian Portuguese with medical context.


Built for teams training ASR, TTS, and conversational AI models. Each record ships with source notes, ground-truth text, segment audio, speaker labels, generation provenance, recording-session metadata, and commercial-readiness fields.


Recorded in native Brazilian Portuguese with curated speakers, consistent capture conditions, and metadata ready for buyer review.

$ tree -L 4 juliasdata-delivery-acme-2026-03-21-v3/


juliasdata-delivery-acme-2026-03-21-v3/
├── DELIVERY.json
├── SCHEMA.md
├── SHA256SUMS
├── manifests/
│   ├── records.jsonl
│   ├── steps.jsonl
│   ├── segments.jsonl
│   ├── speakers.jsonl
│   ├── recording_sessions.jsonl
│   └── provenance.jsonl
└── records/
    └── <record_id>/
        ├── record.json
        ├── source_notes.txt
        ├── rights.json
        ├── provenance/
        └── steps/
            └── <step_type>/
                ├── transcript.jsonl
                ├── media.jsonl
                └── audio/
                    ├── source/*.webm
                    └── wav/*.wav

Request a free sample

$ cat stats.txt


  • Target100+ hours of recorded audio
  • LanguageBrazilian Portuguese (pt-BR)
  • SpeakersNative, medical background
  • CoverageCardiology, neurology, orthopedics, and more
  • AlignmentSegment-level ground-truth text with speaker-aware metadata
  • ReadinessProvenance, rights fields, consent tracking
  • StatusFoundation in production

$ cat availability.txt


An early-access batch is available now with 32 hours of recorded audio across 64 structured records, 3 native pt-BR speakers including one clinician voice, and 1,280 speaker-labeled dialog turns. Coverage currently spans 11 medical specialties plus intake, triage, and patient education, with segment-level clips averaging 7.4 seconds and a broader commercial catalog extending past 100 hours.

$ cat packaging.txt


The dataset can be licensed as a 5 to 10 hour pilot pack for evaluation, as a full multi-speaker core dataset with long-form, QA, terminology, and dialog components, or as a custom specialty-specific collection for teams that need targeted coverage, speaker profiles, or delivery formats. Deliveries include dataset-wide manifests, per-record review folders, source WebM audio, WAV derivatives, transcript ledgers, provenance, and review-friendly rights metadata, with non-exclusive commercial licensing by default and dedicated or exclusive options available for custom runs.

$ cat faq.md


  • What is Julia's Data?
    A curated collection of Brazilian Portuguese medical audio datasets built for ASR, TTS, and conversational AI teams, with ground-truth text, recording provenance, and structured delivery metadata.
  • What does each record contain?
    Five components: raw medical notes read aloud, a long-form text (~3k–5k words), question & answer pairs, a medical terminology lexicon, and a two-speaker dialog — all with transcription ground truth.
  • What format is the data in?
    Each delivery includes dataset-wide JSONL manifests, per-record review folders, original WebM/Opus segment files, WAV derivatives, checksums, schema documentation, recording-session metadata, provenance, and rights metadata.
  • Who records the audio?
    Native Brazilian Portuguese speakers record in a controlled home-office setup, including a medical professional voice and additional approved speakers for dialog coverage.
  • Is the dataset commercially ready?
    Yes. Every delivery includes source-to-audio provenance, speaker identity and consent status, recording-session metadata, file-level technical metadata, and rights fields designed for buyer review.
  • How do you handle rights, consent, and PHI risk?
    We track speaker consent and rights chain at the record level, and medical source material is reviewed with de-identification and release fields before it becomes recordable content.
  • What quality checks happen before delivery?
    Each record is reviewed for transcript alignment, speaker labeling, metadata completeness, and file integrity. Delivered files include checksums and structured manifests for ingestion validation.
  • How do I get access?
    An early-access batch is available now. Reach out at julia@juliasdata.com to review sample materials, packaging options, and licensing.

$ echo $CONTACT