Skip to content

Latest commit

 

History

History
332 lines (244 loc) · 34 KB

File metadata and controls

332 lines (244 loc) · 34 KB

describegpt

Infer a "neuro-symbolic" Data Dictionary, Description & Tags or ask questions about a CSV with a configurable, Mini Jinja prompt file, using any OpenAI API-compatible LLM, including local LLMs. (e.g. Markdown, JSON, TOON, JSON Schema, Semantic Markdown, OKF, Everything, Spanish, Mandarin, Controlled Tags; --prompt "What are the top 10 complaint types by community board & borough by year?" - deterministic, hallucination-free SQL RAG result; iterative, session-based SQL RAG refinement - refined SQL RAG result)

Table of Contents | Source: src/cmd/describegpt.rs | 📇🗃️🤖🌐🪄📚⛩️ CKAN

Description | Examples | Usage | Data Analysis/Inferencing Options | Dictionary Options | Tag Options | Stats/Frequency Options | Custom Prompt Options | LLM API Options | Caching Options | MCP Sampling Options | Common Options

Description

Create a "neuro-symbolic" Data Dictionary and/or infer Description & Tags about a Dataset using an OpenAI API-compatible Large Language Model (LLM).

It does this by compiling Summary Statistics & a Frequency Distribution of the Dataset, and then prompting the LLM with detailed, configurable, Mini Jinja-templated prompts with these extended statistical context.

The Data Dictionary is "neuro-symbolic" as it uses a hybrid approach. It's primarily populated deterministically using Summary Statistics & Frequency Distribution, and only the human-friendly Label and Description (plus Content Type when --infer-content-type is set) are populated by the "neural network" LLM using the same statistical context.

CHAT MODE:
You can also use the --prompt option to ask a natural language question about the Dataset.

If the question can be answered by solely using the Dataset's Summary Statistics and Frequency Distribution data, the LLM will return the answer directly.

CHAT SQL RETRIEVAL-AUGMENTED GENERATION (RAG) SUB-MODE:
If the question cannot be answered using the Dataset's Summary Statistics & Frequency Distribution, it will first create a Data Dictionary and a small random sample (default: 100 rows) of the Dataset and provide it to the LLM as additional context to help it generate a SQL query that DETERMINISTICALLY answers the natural language question.

Two SQL dialects are currently supported - DuckDB (highly recommended) & Polars. If the QSV_DUCKDB_PATH environment variable is set to the absolute path of the DuckDB binary, DuckDB will be used to answer the question. Otherwise, if the "polars" feature is enabled, Polars SQL will be used.

If neither DuckDB nor Polars is available, the SQL query will be returned in a Markdown code block, along with the reasoning behind the query.

Even in "SQL RAG" mode, though the SQL query is guaranteed to be deterministic, the query itself may not be correct. In the event of a SQL query execution failure, run the same --prompt with the --fresh option to request the LLM to generate a new SQL query.

When using DuckDB, all loaded DuckDB extensions will be sent as additional context to the LLM to let it know what functions (even UDFs!) it can use in the SQL queries it generates. If you want a specific function or technique to be used in the SQL query, mention it in the prompt.

SUPPORTED MODELS & LLM PROVIDERS:
OpenAI's open-weights gpt-oss model (both 20b and 120b variants) was used during development & is recommended for most use cases. It was also tested with OpenAI, TogetherAI, OpenRouter and Google Gemini cloud providers. For Gemini, use the base URL "https://generativelanguage.googleapis.com/v1beta/openai". Local LLMs tested include Ollama, Jan and LM Studio.

Note

LLMs are prone to inaccurate information being produced. Verify output results before using them.

CACHING:
As LLM inferencing takes time and can be expensive, describegpt caches the LLM inferencing results in a either a disk cache (default) or a Redis cache. It does so by calculating the BLAKE3 hash of the input file and using it as the primary cache key along with the prompt type, model and every flag that influences the rendered prompt (including prompt-file, context-file, language, tag-vocab, num-tags, enum-threshold, infer-content-type, sample-size, fewshot-examples, the QSV_DUCKDB_PATH toggle and the generated Data Dictionary), so changing any of them produces a fresh LLM call rather than stale cached output.

The default disk cache is stored in the ~/.qsv-cache/describegpt directory with a default TTL of 28 days and cache hits NOT refreshing an existing cached value's TTL. Adjust the QSV_DISKCACHE_TTL_SECS & QSV_DISKCACHE_TTL_REFRESH env vars to change disk cache settings.

Alternatively a Redis cache can be used instead of the disk cache. This is especially useful if you want to share the cache across the network with other users or computers. The Redis cache is stored in database 3 by default with a TTL of 28 days and cache hits NOT refreshing an existing cached value's TTL. Adjust the QSV_DG_REDIS_CONNSTR, QSV_REDIS_MAX_POOL_SIZE, QSV_REDIS_TTL_SECS & QSV_REDIS_TTL_REFRESH env vars to change Redis cache settings.

Examples

Generate a Data Dictionary, Description & Tags of data.csv using default OpenAI gpt-oss-20b model (replace <API_KEY> with your OpenAI API key)

qsv describegpt data.csv --api-key <API_KEY> --all

Generate a Data Dictionary of data.csv using the DeepSeek R1:14b model on a local Ollama instance

qsv describegpt data.csv -u http://localhost:11434/v1 --model deepseek-r1:14b --dictionary

Generate a Data Dictionary that also infers a semantic Content Type for each field (e.g. email, city, latitude) so the dictionary can later drive synthetic data generation

qsv describegpt data.csv --dictionary --infer-content-type

Ask questions about the sample NYC 311 dataset using LM Studio with the default gpt-oss-20b model. Questions that can be answered using the Summary Statistics & Frequency Distribution of the dataset.

qsv describegpt NYC_311.csv --prompt "What is the most common complaint?"

Ask detailed natural language questions that require SQL queries and auto-invoke SQL RAG mode Generate a DuckDB SQL query to answer the question

QSV_DUCKDB_PATH=/path/to/duckdb \
qsv describegpt NYC_311.csv -p "What's the breakdown of complaint types by borough descending order?"

Prompt requires a natural language query. Convert query to SQL using the LLM and save results to a file with the --sql-results option. If generated SQL query runs successfully, the file is "results.csv". Otherwise, it is "results.sql".

qsv describegpt NYC_311.csv -p "Aggregate complaint types by community board" --sql-results results

Cache Dictionary, Description & Tags inference results using the Redis cache instead of the disk cache

qsv describegpt data.csv --all --redis-cache

Get fresh Description & Tags inference results from the LLM and refresh disk cache entries for both

qsv describegpt data.csv --description --tags --fresh

Get fresh inference results from the LLM and refresh the Redis cache entries for all three

qsv describegpt data.csv --all --redis-cache --fresh

Forget a cached response for data.csv's data dictionary if it exists and then exit

qsv describegpt data.csv --dictionary --forget

Flush/Remove ALL cached entries in the disk cache

qsv describegpt --flush-cache

Flush/Remove ALL cached entries in the Redis cache

qsv describegpt --redis-cache --flush-cache

Generate Data Dictionary but exclude ID columns from frequency analysis to reduce overhead

qsv describegpt data.csv --dictionary --freq-options "--select '!id,!uuid' --limit 20"

Generate Data Dictionary, Description & Tags but reduce frequency context by showing only top 5 values per field

qsv describegpt data.csv --all --freq-options "--limit 5"

Generate Description using weighted frequencies with ascending sort

qsv describegpt data.csv --description --freq-options "--limit 50 --asc --weight count_column"

Generate a Data Dictionary, Description & Tags using a previously compiled stats CSV file and frequency CSV file instead of running the stats and frequency commands

qsv describegpt data.csv --all --stats-options "file:my_stats.csv" --freq-options "file:my_freq.csv"

For more examples, see tests.

For more detailed info on how describegpt works and how to prepare a prompt file, see https://github.com/dathere/qsv/blob/master/docs/Describegpt.md and https://github.com/dathere/qsv/wiki/AI-and-Documentation#describegpt

Usage

qsv describegpt [options] [<input>]
qsv describegpt --prepare-context [options] [<input>]
qsv describegpt --process-response [options]
qsv describegpt (--redis-cache) (--flush-cache)
qsv describegpt --help

Data Analysis/Inferencing Options

     Option      Type Description Default
 ‑‑dictionary  flag Create a Data Dictionary using a hybrid "neuro-symbolic" pipeline - i.e. the Dictionary is populated deterministically using Summary Statistics and Frequency Distribution data, and only the human-friendly Label and Description (and Content Type when --infer-content-type is set) are populated by the LLM using the same statistical context.
 ‑‑description  flag Infer a general Description of the dataset based on detailed statistical context. An Attribution signature is embedded in the Description.
 ‑‑tags  flag Infer Tags that categorize the dataset based on detailed statistical context. Useful for grouping datasets and filtering.
 ‑A,
‑‑all 
flag Shortcut for --dictionary --description --tags.

Dictionary Options

        Option         Type Description Default
 ‑‑num‑examples  integer The number of Example values to include in the dictionary. 5
 ‑‑truncate‑str  integer The maximum length of an Example value in the dictionary. An ellipsis is appended to the truncated value. If zero, no truncation is performed. 25
 ‑‑infer‑content‑type  flag Also have the LLM classify each field's semantic "Content Type", mapped to a curated, documented vocabulary (e.g. email, city, category, name, credit card, etc.) see https://github.com/dathere/qsv/blob/master/src/cmd/synthesize/faker_map.rs. Adds a "Content Type" column/field to the Data Dictionary output. Fields where cardinality equals the row count (i.e. every row has a distinct non-null value - primary keys, surrogate keys, sequence numbers) are deterministically classified as "unique_id", overriding any token the LLM returned for that field. For Date/DateTime fields, the LLM also infers the column's strftime date format (e.g. "date:%m/%d/%Y"); the Markdown, JSON & JSON Schema dictionaries then render Min/Max AND Examples in that inferred format so they match how the dates actually appear in the data, instead of qsv's normalized form. (TSV output keeps Min/Max & Examples in qsv's raw normalized form.)
 ‑‑two‑pass  flag Run a second LLM call that takes the full first-pass Data Dictionary as JSON context and refines each field's Label, Description and (when --infer-content-type is set) Content Type using cross-field awareness. The LLM can then relate fields that belong together (e.g. street_no + street_name + city + state + zip describing a single mailing address; first_name + last_name naming a single person; lat + lng forming a coordinate pair). The refined dictionary becomes the emitted output and is also what downstream Description, Tags and Prompt inference phases see as dictionary context. Roughly doubles dictionary LLM cost and latency, so opt-in. Most useful when combined with --infer-content-type. Allowed with the --dictionary, --all and --prompt inference flags. Mutually exclusive with --prepare-context and --process-response (MCP sampling is single-turn per inference phase).
 ‑‑addl‑cols  flag Add additional columns to the dictionary from the Summary Statistics.
 ‑‑addl‑cols‑list  string A comma-separated list of additional stats columns to add to the dictionary. The columns must be present in the Summary Statistics. If the columns are not present in the Summary Statistics or already in the dictionary, they will be ignored. These values are case-insensitive and automatically set the --addl-cols option to true. "everything" can be used to add all 45 "available" statistics columns. You can adjust the available columns with --stats-options. "everything!" automatically sets --stats-options to compute "all" 51 supported stats. The 6 addl cols are the mode/s & antimode/s stats with each having counts & occurrences. "moar" gets you even moar stats, with detailed outliers info. "moar!" gets you even moar with --advanced stats (Kurtosis, Gini Coefficient & Shannon Entropy) sort_order, sortiness, mean, median, mad, stddev, variance, cv

Tag Options

     Option      Type Description Default
 ‑‑num‑tags  integer The maximum number of tags to infer when the --tags option is used. Maximum allowed value is 50. 10
 ‑‑tag‑vocab  string The CSV file containing the tag vocabulary to use for inferring tags. If no tag vocabulary file is provided, the model will use free-form tags. Supports local files, remote URLs (http/https), CKAN resources (ckan://), and dathere:// scheme. Remote resources are cached locally. The CSV file must have two columns with headers: first column is the tag, second column is the description. Note that qsvlite only supports local files.
 ‑‑cache‑dir  string The directory to use for caching downloaded tag vocabulary resources. If the directory does not exist, qsv will attempt to create it. If the QSV_CACHE_DIR envvar is set, it will be used instead. ~/.qsv-cache
 ‑‑ckan‑api  string The URL of the CKAN API to use for downloading tag vocabulary resources with the "ckan://" scheme. If the QSV_CKAN_API envvar is set, it will be used instead. https://data.dathere.com/api/3/action
 ‑‑ckan‑token  string The CKAN API token to use. Only required if downloading private resources. If the QSV_CKAN_TOKEN envvar is set, it will be used instead.

Stats/Frequency Options

      Option       Type Description Default
 ‑‑stats‑options  string Options for the stats command used to generate summary statistics. If it starts with "file:" prefix, the statistics are read from the specified CSV file instead of running the stats command. e.g. "file:my_custom_stats.csv" --infer-dates --infer-boolean --mad --quartiles --percentiles --force --stats-jsonl
 ‑‑freq‑options  string Options for the frequency command used to generate frequency distributions. You can use this to exclude certain variable types from frequency analysis (e.g., --select '!id,!uuid'), limit results differently per use case, or control output format. If --limit is specified here, it takes precedence over --enum-threshold. If it starts with "file:" prefix, the frequency data is read from the specified CSV file instead of running the frequency command. e.g. "file:my_custom_frequency.csv" A "file:"-backed CSV is assumed to use frequency's default "(NULL)" null text; a custom --null-text in a file-supplied CSV is not recognized when validating inferred date/datetime formats. --rank-strategy dense
 ‑‑enum‑threshold  integer The threshold for compiling Enumerations with the frequency command before bucketing other unique values into the "Other" category. This is a convenience shortcut for --freq-options --limit . If --freq-options contains --limit, this flag is ignored. 10

Custom Prompt Options

       Option         Type Description Default
 ‑p,
‑‑prompt 
string Custom prompt to answer questions about the dataset. The prompt will be answered based on the dataset's Summary Statistics, Frequency data & Data Dictionary. If the prompt CANNOT be answered by looking at these metadata, a SQL query will be generated to answer the question. If the "polars" or the "QSV_DUCKDB_PATH" environment variable is set & the --sql-results option is used, the SQL query will be automatically executed and its results returned. Otherwise, the SQL query will be returned along with the reasoning behind it. If it starts with "file:" prefix, the prompt is read from the file specified. e.g. "file:my_long_prompt.txt"
 ‑‑sql‑results  string The file to save the SQL query results to. Only valid if the --prompt option is used & the "polars" or the "QSV_DUCKDB_PATH" environment variable is set. If the SQL query executes successfully, the results will be saved with a ".csv" extension. Otherwise, it will be saved with a ".sql" extension so the user can inspect why it failed and modify it.
 ‑‑prompt‑file  string The configurable TOML file containing prompts to use for inferencing. If no file is provided, default prompts will be used. The prompt file uses the Mini Jinja template engine (https://docs.rs/minijinja) See https://github.com/dathere/qsv/blob/master/resources/describegpt_defaults.toml
 ‑‑context‑file  string Path to a file with additional context about the dataset - e.g. variable/code labels, provenance & domain notes - injected into the prompts as the {{ context }} Mini Jinja variable. The file TYPE is sniffed from its contents (not its extension). Supported types: plain text, Markdown, CSV, Excel/ODS spreadsheets (extracted to CSV), and PDF or image files (JPEG/PNG/WebP/GIF) sent to the LLM as a multimodal attachment (needs a multimodal model & endpoint; max ~32 MB). Word/PowerPoint (docx/pptx) are NOT supported - convert to PDF or text. By default qsv injects this context into the USER message; custom prompt file templates may reference {{ context }} anywhere to place it instead. If the option is unset or the file is empty, {{ context }} renders as an empty string and the prompts are unaffected. The file's contents are part of the cache key, so editing it produces a fresh LLM call.
 ‑‑markdown‑template  string TOML file with Mini Jinja templates for Markdown output. The TOML contains four wrapper templates - one per inference kind: dictionary_md_template, description_md_template, tags_md_template and custom_prompt_md_template - plus a dictionary_md_body_template that drives the per-field dictionary table that fills the dictionary wrapper's {{ llm_response }}. All template fields are optional; any omitted field falls back to the embedded default, so a minimal TOML can override just the templates you want to change. Custom Mini Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) and template variables are documented inline in the default TOML referenced below. If no file is provided, built-in defaults are used (matching legacy output). See https://github.com/dathere/qsv/blob/master/resources/describegpt_md_defaults.toml
 ‑‑sample‑size  integer The number of rows to randomly sample from the input file for the sample data. Uses the INDEXED sampling method with the qsv sample command. 100
 ‑‑fewshot‑examples  flag By default, few-shot examples are NOT included in the LLM prompt when generating SQL queries. When this option is set, few-shot examples in the default prompt file are included. Though this will increase the quality of the generated SQL, it comes at a cost - increased LLM API call cost in terms of tokens and execution time. See https://en.wikipedia.org/wiki/Prompt_engineering for more info.
 ‑‑session  string Enable stateful session mode for iterative SQL RAG refinement. The session name is the file path of the markdown file where session messages will be stored. When used with --prompt, subsequent queries in the same session will refine the baseline SQL query. SQL query results (10-row sample) and errors are automatically included in subsequent messages for context.
 ‑‑session‑len  integer Maximum number of recent messages to keep in session context before summarizing older messages. Only used when --session is specified. 10
 ‑‑no‑score‑sql  flag Disable scoresql validation of generated SQL queries before execution. By default, when --prompt generates a SQL query and --sql-results is set, the query is scored and iteratively improved if below threshold.
 ‑‑score‑threshold  integer Minimum scoresql score for a SQL query to be accepted. Typical range is 0-100; values >100 will always trigger retries and the below-threshold warning. 50
 ‑‑score‑max‑retries  integer Max LLM re-prompts to improve a low-scoring SQL query. 3

LLM API Options

     Option       Type Description Default
 ‑u,
‑‑base‑url 
string The LLM API URL. Supports APIs & local LLMs compatible with the OpenAI API specification. Some common base URLs: OpenAI: https://api.openai.com/v1 Gemini: https://generativelanguage.googleapis.com/v1beta/openai TogetherAI: https://api.together.ai/v1
 ‑m,
‑‑model 
string The model to use for inferencing. This model must be compatible with OpenAI API spec. Works with both cloud LLM providers and local LLMs. Tested open weights models include OpenAI's gpt-oss-20b and gpt-oss-120b; Google's Gemma family of open models; and Mistral's Magistral reasoning models. Precedence: explicit CLI flag > QSV_LLM_MODEL env var > prompt file model > built-in default (openai/gpt-oss-20b). No docopt default — same rationale as --base-url above.
 ‑‑language  string The output language/dialect/tone to use for the response. (e.g., "Spanish", "French", "Hindi", "Mandarin", "Italian", "Castilian", "Franglais", "Taglish", "Pig Latin", "Valley Girl", "Pirate", "Shakespearean English", "Chavacano", "Gen Z", "Yoda", etc.)
 ‑‑addl‑props  string Additional model properties to pass to the LLM chat/completion API. Various models support different properties beyond the standard ones. For instance, gpt-oss-20b supports the "reasoning_effort" property. e.g. to set the "reasoning_effort" property to "high" & "temperature" to 0.5, use '{"reasoning_effort": "high", "temperature": 0.5}'
 ‑k,
‑‑api‑key 
string The API key to use. If set, takes precedence over the QSV_LLM_APIKEY envvar. Required when the base URL is not localhost. Set to NONE to suppress sending the API key.
 ‑t,
‑‑max‑tokens 
integer Limits the number of generated tokens in the output. Set to 0 to disable token limits. If the --base-url is localhost, indicating a local LLM, the default is automatically set to 0. 10000
 ‑‑timeout  integer Timeout for completions in seconds. If 0, no timeout is used. Defaults to 300 if not set. If the --base-url is localhost, indicating a local LLM, the timeout is automatically disabled unless you set --timeout (or the QSV_TIMEOUT envvar) explicitly.
 ‑‑user‑agent  string Specify custom user agent. It supports the following variables - $QSV_VERSION, $QSV_TARGET, $QSV_BIN_NAME, $QSV_KIND and $QSV_COMMAND. Try to follow the syntax here - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
 ‑‑export‑prompt  string Export the default prompts to the specified file that can be used with the --prompt-file option. The file will be saved with a .toml extension. If the file already exists, it will be overwritten. It will exit after exporting the prompts.

Caching Options

      Option       Type Description Default
 ‑‑no‑cache  flag Disable default disk cache.
 ‑‑disk‑cache‑dir  string The directory to store the disk cache. Note that if the directory does not exist, it will be created. If the directory exists, it will be used as is, and will not be flushed. This option allows you to maintain several disk caches for different describegpt jobs (e.g. one for a data portal, another for internal data exchange). ~/.qsv-cache/describegpt
 ‑‑redis‑cache  flag Use Redis instead of the default disk cache to cache LLM completions. It connects to "redis://127.0.0.1:6379/3" by default, with a connection pool size of 20, with a TTL of 28 days, and cache hits NOT refreshing an existing cached value's TTL. This option automatically disables the disk cache.
 ‑‑fresh  flag Send a fresh request to the LLM API, refreshing a cached response if it exists. When a --prompt SQL query fails, you can also use this option to request the LLM to generate a new SQL query.
 ‑‑forget  flag Remove a cached response if it exists and then exit.
 ‑‑flush‑cache  flag Flush the current cache entries on startup. WARNING: This operation is irreversible.

MCP Sampling Options

       Option        Type Description Default
 ‑‑prepare‑context  flag Output the prompt context as JSON to stdout without calling the LLM. JSON includes system/user prompts, cache state, and analysis results for each inference phase. Useful for inspecting prompts or piping to custom LLM integrations. Used by the MCP server for sampling mode.
 ‑‑process‑response  flag Process LLM responses provided as JSON via stdin. Takes the output format from --prepare-context with LLM responses filled in, and produces the final output (dictionary, description, tags, or prompt results). Used by the MCP server for sampling mode.

Common Options

       Option        Type Description Default
 ‑h,
‑‑help 
flag Display this message
 ‑‑format  string Output format: Markdown, TSV, JSON, TOON, JSONSchema, SemanticMd, or OKF. TOON is a compact, human-readable encoding of the JSON data model for LLM prompts. See https://toonformat.dev/ for more info. JSONSchema emits the Data Dictionary as a JSON Schema (draft 2020-12) document, enriched with LLM-inferred Label, Description and Content Type (the latter only when the infer-content-type flag is set). qsv- and LLM- specific metadata not modeled by the JSON Schema spec (cardinality, null_count, weighted example counts, content_type, addl stats columns) is preserved via a single x-qsv annotation object per property; unknown keywords are ignored by validators per the 2020-12 spec. The JSONSchema format requires the dictionary inference phase (the dictionary or all flag). The description inference, when also run, becomes the schema's top-level description; tags, when also run, are embedded at x-qsv.tags. The prompt inference is not supported. SemanticMd emits the Data Dictionary as a Semantic Markdown document (https://semanticmd.org/) - human-readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. It enriches each column with a catalog-wide Concept ID (for cross-dataset join discovery), an analytical Role (dimension/measure/identifier/timestamp), join keys & cardinality, data-quality flags, and a richer per-column statistics block; it also emits a dataset grain and a temporal/spatial envelope. To populate Concept/Role/grain, SemanticMd implies --infer-content-type. Like JSONSchema, it requires the dictionary inference phase (the dictionary or all flag). The description inference, when also run, becomes the '# Dataset' description; tags, when also run, are embedded in the document frontmatter. The prompt inference is not supported. OKF emits the Data Dictionary as an Open Knowledge Format document (https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf) - a leaner, vendor-neutral plain-markdown-plus-YAML-frontmatter format. It emits frontmatter (type/title/description/resource/timestamp/tags) and a Schema section with a Column/Type/Content Type/Description/Enumeration table. The type frontmatter key is set via the okf-type flag (default "CSV Table"); ds-source maps to resource and ds-updated maps to timestamp. Like SemanticMd, OKF implies infer-content-type, requires the dictionary inference phase (the dictionary or all flag), folds the description inference into the body prose and a single-sentence frontmatter description, and embeds tags in the frontmatter. The prompt inference is not supported. Markdown
 ‑‑allow‑extra‑cols  flag When the format is JSONSchema, emit additionalProperties as true at the schema root (default is false, strict). Only meaningful with the JSONSchema format; ignored otherwise.
 ‑‑strict‑dates  flag When the format is JSONSchema, emit format date or date-time for columns that stats infers as Date or DateTime. Off by default because qsv's --infer-dates is permissive (accepts strings like "June 27, 1968") and JSON Schema's date formats require RFC 3339, so the validate roundtrip would otherwise fail. Set this only when your source columns are guaranteed to be RFC 3339 full-date / date-time. Mirrors the same flag on the schema command.
 ‑‑ds‑source  string For the SemanticMd & OKF formats only: the dataset source/provenance recorded in the document frontmatter (e.g. a source URL or publisher). For OKF this populates the resource key. Optional; the frontmatter key is omitted when unset. Ignored by other formats.
 ‑‑ds‑updated  string For the SemanticMd & OKF formats only: the dataset's last-updated date recorded in the document frontmatter. For OKF this populates the timestamp key. Optional. Ignored by other formats.
 ‑‑ds‑license  string For the SemanticMd format only: the dataset license recorded in the document frontmatter. Optional. Ignored by other formats.
 ‑‑okf‑type  string For the OKF format only: the value of the required type frontmatter key (e.g. "CSV Table", "BigQuery Table"). Optional; defaults to "CSV Table". Ignored by other formats.
 ‑o,
‑‑output 
string Write output to instead of stdout. If --format is set to TSV, separate files will be created for each prompt type with the pattern {filestem}.{kind}.tsv (e.g., output.dictionary.tsv, output.tags.tsv).
 ‑q,
‑‑quiet 
flag Do not print status messages to stderr.

Source: src/cmd/describegpt.rs | Table of Contents | README