$ cat README.md
High-quality audio datasets in Brazilian Portuguese with medical context.
Built for teams training ASR, TTS, and conversational AI models. Each record ships with source notes, ground-truth text, segment audio, speaker labels, generation provenance, recording-session metadata, and commercial-readiness fields.
Recorded in native Brazilian Portuguese with curated speakers, consistent capture conditions, and metadata ready for buyer review.
$ tree -L 4 juliasdata-delivery-acme-2026-03-21-v3/
juliasdata-delivery-acme-2026-03-21-v3/
├── DELIVERY.json
├── SCHEMA.md
├── SHA256SUMS
├── manifests/
│ ├── records.jsonl
│ ├── steps.jsonl
│ ├── segments.jsonl
│ ├── speakers.jsonl
│ ├── recording_sessions.jsonl
│ └── provenance.jsonl
└── records/
└── <record_id>/
├── record.json
├── source_notes.txt
├── rights.json
├── provenance/
└── steps/
└── <step_type>/
├── transcript.jsonl
├── media.jsonl
└── audio/
├── source/*.webm
└── wav/*.wav$ cat stats.txt
- Target100+ hours of recorded audio
- LanguageBrazilian Portuguese (pt-BR)
- SpeakersNative, medical background
- CoverageCardiology, neurology, orthopedics, and more
- AlignmentSegment-level ground-truth text with speaker-aware metadata
- ReadinessProvenance, rights fields, consent tracking
- StatusFoundation in production
$ cat availability.txt
An early-access batch is available now with 32 hours of recorded audio across 64 structured records, 3 native pt-BR speakers including one clinician voice, and 1,280 speaker-labeled dialog turns. Coverage currently spans 11 medical specialties plus intake, triage, and patient education, with segment-level clips averaging 7.4 seconds and a broader commercial catalog extending past 100 hours.
$ cat packaging.txt
The dataset can be licensed as a 5 to 10 hour pilot pack for evaluation, as a full multi-speaker core dataset with long-form, QA, terminology, and dialog components, or as a custom specialty-specific collection for teams that need targeted coverage, speaker profiles, or delivery formats. Deliveries include dataset-wide manifests, per-record review folders, source WebM audio, WAV derivatives, transcript ledgers, provenance, and review-friendly rights metadata, with non-exclusive commercial licensing by default and dedicated or exclusive options available for custom runs.
$ cat faq.md
- What is Julia's Data?A curated collection of Brazilian Portuguese medical audio datasets built for ASR, TTS, and conversational AI teams, with ground-truth text, recording provenance, and structured delivery metadata.
- What does each record contain?Five components: raw medical notes read aloud, a long-form text (~3k–5k words), question & answer pairs, a medical terminology lexicon, and a two-speaker dialog — all with transcription ground truth.
- What format is the data in?Each delivery includes dataset-wide JSONL manifests, per-record review folders, original WebM/Opus segment files, WAV derivatives, checksums, schema documentation, recording-session metadata, provenance, and rights metadata.
- Who records the audio?Native Brazilian Portuguese speakers record in a controlled home-office setup, including a medical professional voice and additional approved speakers for dialog coverage.
- Is the dataset commercially ready?Yes. Every delivery includes source-to-audio provenance, speaker identity and consent status, recording-session metadata, file-level technical metadata, and rights fields designed for buyer review.
- How do you handle rights, consent, and PHI risk?We track speaker consent and rights chain at the record level, and medical source material is reviewed with de-identification and release fields before it becomes recordable content.
- What quality checks happen before delivery?Each record is reviewed for transcript alignment, speaker labeling, metadata completeness, and file integrity. Delivered files include checksums and structured manifests for ingestion validation.
- How do I get access?An early-access batch is available now. Reach out at julia@juliasdata.com to review sample materials, packaging options, and licensing.
$ echo $CONTACT